Alan Cox [Sat, 30 Mar 2024 20:35:32 +0000 (15:35 -0500)]
arm64: enable superpage mappings by pmap_mapdev{,_attr}()
In order for pmap_kenter{,_device}() to create superpage mappings,
either 64 KB or 2 MB, pmap_mapdev{,_attr}() must request appropriately
aligned virtual addresses.
Eliot Solomon [Sun, 24 Mar 2024 19:01:47 +0000 (14:01 -0500)]
arm64 pmap: Add ATTR_CONTIGUOUS support [Part 1]
The ATTR_CONTIGUOUS bit within an L3 page table entry designates that
L3 page as being part of an aligned, physically contiguous collection
of L3 pages. For example, 16 aligned, physically contiguous 4 KB pages
can form a 64 KB superpage, occupying a single TLB entry. While this
change only creates ATTR_CONTIGUOUS mappings in a few places,
specifically, the direct map and pmap_kenter{,_device}(), it adds all
of the necessary code for handling them once they exist, including
demotion, protection, and removal. Consequently, new ATTR_CONTIGUOUS
usage can be added (and tested) incrementally.
Modify the implementation of sysctl vm.pmap.kernel_maps so that it
correctly reports the number of ATTR_CONTIGUOUS mappings on machines
configured to use a 16 KB base page size, where an ATTR_CONTIGUOUS
mapping consists of 128 base pages.
Additionally, this change adds support for creating L2 superpage
mappings to pmap_kenter{,_device}().
thread_single(9): decline external requests for traced or debugger-stopped procs
Debugger has the powers to cause unbound delay in single-threading,
which then blocks the threaded taskqueue. The reproducer is
`truss -f timeout 2 sleep 10`.
Reported by: mjg
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D44523
Robert Evans [Sat, 30 Mar 2024 00:11:52 +0000 (20:11 -0400)]
Linux 5.18+ compat: Detect filemap_range_has_page
In v5.18 `filemap_range_has_page` moved to `pagemap.h`
`pagemap.h` has been around since 3.10 so just include both
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Robert Evans <evansr@google.com>
Closes #16034
Update the I2C controller logic to be more consistent with the
newer version of the controller reference manual.
This makes it work better on modern LS/LX platforms and avoids
unnecessary delays. Also fixes a lock leak.
vf_i2c: split up and add ACPI attachments in addition to FDT
Move the code from the arm specific to the iicbus controller directory.
Split up between general logic and bus attachment code.
Add support for ACPI attachment in addition to FDT.
MFC after: 7 days
Tested by: bz (LS1088a FDT), Pierre-Luc Drouin (Honeycomb, ACPI)
Based on: D24917 by Val Packett (initial early version)
Differential Revision: https://reviews.freebsd.org/D44020
Robert Evans [Fri, 29 Mar 2024 21:59:23 +0000 (17:59 -0400)]
Fix buffer underflow if sysfs file is empty
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Jason Lee <jasonlee@lanl.gov> Signed-off-by: Robert Evans <evansr@google.com>
Closes #16028
Closes #16035
Rob N [Fri, 29 Mar 2024 21:51:33 +0000 (08:51 +1100)]
vdev_disk: clean up spa/bdev mode conversion
43e8f6e37 introduced a subtle API misuse, in that it passed the output
from vdev_bdev_mode() back into itself. Fortunately, the
SPA_MODE_(READ|WRITE) bit values exactly map to the FMODE_(READ|WRITE) &
BLK_OPEN_(READ|WRITE) bit values, so it didn't result in a bug, but it
was hard to read and understand, so I cleaned it up.
In doing so, I noticed that the only call to vdev_bdev_mode() without
the "exclusive" flag set was in that misuse, and actually, we never do a
non-exclusive blkdev_get_by_path(). So I've just made exclusive be
always-on.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15995
currently, the linux kernel allows 2^20 minor devices per major device
number. ZFS reserves blocks of 2^4 minors per zvol: 1 for the zvol
itself, the other 15 for the first partitions of that zvol. as a result,
only 2^16 such blocks are available for use.
there are no checks in place to avoid overflowing into the major device
number when more than 2^16 zvols are allocated (with volmode=dev or
default). instead of ignoring this limit, which comes with all sorts of
weird knock-on effects, detect this situation and simply fail allocating
the zvol block device early on.
without this safeguard, the kernel will reject the attempt to create an
already existing block device, but ZFS doesn't handle this error and
gets confused about which zvol occupies which minor slot, potentially
resulting in kernel NULL derefs and other issues later on.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Closes #16006
Gleb Smirnoff [Fri, 29 Mar 2024 20:35:51 +0000 (13:35 -0700)]
linux: make linux_netlink_p->msg_from_linux be able to fail
The KPI for this function was misleading. From the NetLink perspective it
looked like a function that: a) allocates new hdr, b) can fail. Neither
was true. Let the function return a error code instead of returning the
same hdr it was passed to. In case if future Linux NetLink compatibility
support calls for reallocating header, pass hdr as pointer to pointer.
With KPI that returns a error, propagate domain conversion errors all the
way up to NetLink module. This fixes panic when unknown domain is
converted to 0xff and this invalid value is passed into NetLink
processing.
Gleb Smirnoff [Fri, 29 Mar 2024 20:35:37 +0000 (13:35 -0700)]
linux: use sa_family_t for address family conversions
Express "conversion failed" with maximum possible value. This allows to
reduce number of size/signedness conversion in the code that utilizes the
functions.
Gleb Smirnoff [Fri, 29 Mar 2024 19:35:41 +0000 (12:35 -0700)]
if_tuntap: simplify storage of per-vnet cloners
There is no need for a separate structure neither for a linked list.
Provide each VNET with an array of pointers to if_clone that has the same
size as the driver list.
Bojan Novković [Fri, 29 Mar 2024 19:17:19 +0000 (20:17 +0100)]
kern_ctf.c: Don't print out warning messages unconditionally
The kernel CTF loading routines print various warnings when attempting
to load CTF data from an ELF file. After the changes in c21bc6f3c242
those warnings are unnecessarily printed for each kernel module
that was compiled without CTF data.
The kernel linker already uses the bootverbose flag to conditionally
print CTF loading errors. This patch alters kern_ctf.c
routines to do the same.
Gleb Smirnoff [Fri, 29 Mar 2024 19:16:59 +0000 (12:16 -0700)]
inpcb: fully retire inp_ppcb pointer
Before a protocol specific control block started to embed inpcb in self
(see 0aa120d52f3c, e68b3792440c, 483fe96511ec) this pointer used to point
at it.
Retain kf_sock_inpcb field in the struct kinfo_file in <sys/user.h>. The
exp-run detected a minimal use of the field in ports:
* sysutils/lsof - patched upstream
* net-mgmt/netdata - patch accepted upstream
* emulators/qemu-user-static - upstream master branch seems not using
the field anymore
We can keep the field around for some time, but eventually it may be
reused for something else.
George Wilson [Fri, 29 Mar 2024 19:15:56 +0000 (15:15 -0400)]
Add ashift validation when adding devices to a pool
Currently, zpool add allows users to add top-level vdevs that have
different ashifts but doing so prevents users from being able to
perform a top-level vdev removal. Often times consumers may not realize
that they have mismatched ashifts until the top-level removal fails.
This feature adds ashift validation to the zpool add command and will
fail the operation if the sector size of the specified vdev does not
match the existing pool. This behavior can be disabled by using the -f
flag. In addition, new flags have been added to provide fine-grained
control to disable specific checks. These flags
are:
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Maybee <mmaybee@delphix.com> Signed-off-by: George Wilson <gwilson@delphix.com>
Closes #15509
Prevent a use-after-free in kern_poll() by making sure the buffer's
selinfo is drained. This is required for a subsequent patch that
implements asynchronous audio device detach.
Reported by: KASAN
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D44544
Colin Percival [Fri, 29 Mar 2024 07:10:50 +0000 (00:10 -0700)]
release.sh: Don't install git if already present
Prior to this commit, we install git from ports if there is a ports
tree available and git is not installed, and we install git from pkg
otherwise -- including the case where git is already installed.
Rework the logic to not (re)install git at all if it is already
installed.
Gleb Smirnoff [Thu, 28 Mar 2024 21:10:15 +0000 (14:10 -0700)]
pfilctl: fix 'pfilctl hooks' when nothing is connected
The 'hooks' command actually worked accidentially until now. It used
PFILIOC_LISTHEADS to determine current number of hooks. This worked when
at least one head had a hook connected to it.
kerneldump: Add flag to indicate kernel core was successfully dumped
This allows for shutdown_final EVENTHANDLERs to know that a core dump
successfully occurred. Embedded systems may want to record this fact
or act on it.
Kristof Provost [Wed, 27 Mar 2024 14:47:21 +0000 (15:47 +0100)]
pf: fix reply-to after rdr and dummynet
If we redirect a packet to localhost and it gets dummynet'd it may be
re-injected later (e.g. when delayed) which means it will be passed
through ip_input() again. ip_input() will then reject the packet because
it's directed to the loopback address, but did not arrive on a loopback
interface.
Fix this by having pf set the rcvif to V_iflo if we redirect to
loopback.
See also: https://redmine.pfsense.org/issues/15363
Sponsored by: Rubicon Communications, LLC ("Netgate")
Randall Stewart [Thu, 28 Mar 2024 12:12:37 +0000 (08:12 -0400)]
Optimize HPTS so that little work is done until we have a hpts thread that is over the connection threshold
HPTS inserts a softclock for system call return that optimizes performance. However when
no HPTS threads need the help (i.e. when they have less than 100 or so connections) then
there should be little work done i.e. check the counter and return instead of running through
all the threads getting locks etc.ptimize HPTS so that little work is done until we have a hpts
thread that is over the connection threshold.
Robert Evans [Wed, 27 Mar 2024 21:59:16 +0000 (17:59 -0400)]
ZTS: fix flakiness in cp_files_002_pos
Fix RANDOM to not return zero.
Overwriting with `dd ... count=0` does not test anything.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Robert Evans <evansr@google.com>
Closes #16029
Alexander Motin [Mon, 18 Mar 2024 18:19:53 +0000 (14:19 -0400)]
BRT: Fix holes cloning.
- When reading L0 block pointers handle buffers without ones and
without dirty records as a holes. Those appear when dnode size
was increased, but the end was never written, so there are no new
indirection levels to store the pointers. It makes no sense to
return EAGAIN here, since sync won't create new indirection levels
until there will be actual writes.
- When cloning blocks set destination hole logical birth time
to the current TXG. Otherwise if we are cloning over existing
data, newly created holes may not be properly replicated later.
Use BP_SET_BIRTH() when possible to not replicate its logic.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15994
Closes #16007
Mike Karels [Wed, 27 Mar 2024 20:10:43 +0000 (15:10 -0500)]
bsdinstall: draw attention to new network config options
The network configuration options have changed in bsdinstall, with
an Auto option to proceed directly to DHCP and IPv6 autoconfig (which
is the default) as well as Manual (the old mode). For users like me
that were used to hitting return automatically to select an interface,
but want manual configuration, attempt to call out the difference:
Change the menu caption to say "Please select a network interface
and configuration mode:" and not just an interface.
Gleb Smirnoff [Wed, 27 Mar 2024 19:19:44 +0000 (12:19 -0700)]
sockets: define shutdown(2) constants in cpp namespace
There is software that uses SHUT_RD, SHUT_WR as preprocessor defines and
its build was broken by enum declaration. Keep the enum, but provide
defines to propagate the constants to cpp namespace.
Michael Tuexen [Wed, 27 Mar 2024 13:31:48 +0000 (14:31 +0100)]
tcp bblog: use correct length
The length of tldl_reason is TCP_LOG_REASON_LEN, not TCP_LOG_ID_LEN.
No functional change intended.
Reported by: Coverity Scan
CID: 1418074
CID: 1418276
Reviewed by: glebius, rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44510
Note that VFS internally interprets a timestamp of -1 as “do not set”,
so this has no effect, but at least touch won't incorrectly reject the
given date / time (1969-12-31 23:59:59 UTC) as invalid.
* Mention that mktime() and timegm() set errno on failure.
* Correctly determining whether mktime() / timegm() succeeded with
arbitrary input (where -1 can be a valid result) is non-trivial.
Document the recommended procedure.
This adds support for two new diff algorithms, Myers diff and Patience
diff.
These algorithms perform a different form of search compared to the
classic Stone algorithm and support escapes when worst case scenarios
are encountered.
Add the -A flag to allow selection of the algorithm, but default to
using the new Myers diff implementation.
The libdiff implementation currently only supports a subset of input and
output options supported by diff. When these options are used, but the
algorithm is not selected, automatically fallback to the classic Stone
algorithm until support for these modes can be added.
Based on work originally done by thj@ with contributions from kevans@.
Sponsored by: Klara, Inc.
Reviewed by: thj
Differential Revision: https://reviews.freebsd.org/D44302
Tom Jones [Tue, 26 Mar 2024 09:52:07 +0000 (09:52 +0000)]
netmap: Address errors on memory free in netmap_generic
netmap_generic keeps a pool of mbufs for handling transfers, these mbufs
have an external buffer attached to them.
If some cases other parts of the network stack can chain these mbufs,
when this happens the normal pool destructor function can end up
free'ing the pool mbufs twice:
- A first time if a pool mbuf has been chained with another mbuf when
its chain is freed
- A second time when its entry in the pool is freed
Additionally, if other parts of the stack demote a pool mbuf its
interface reference will be cleared. In this case we deference a NULL
pointer when trying to free the mbuf through the destructor. Store a
reference to the adapter in ext_arg1 with the destructor callback so we
can find the correct adapter when free'ing a pool mbuf.
This change enables using netmap with epair interfaces.
Reviewed By: vmaffione
MFC after: 1 week
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D44371
Zhenlei Huang [Tue, 26 Mar 2024 08:47:02 +0000 (16:47 +0800)]
kern linker: Do not touch userrefs of the kernel file
A nonzero `userrefs` of a linker file indicates that the file, either
loaded from kldload(2) or preloaded, can be unloaded via kldunload(2).
As for the kernel file, it can be unloaded by the loader but should not
be after initialization.
This change fixes regression from d9ce8a41eac9 which incidentally
increases `userrefs` of the kernel file.
Zhenlei Huang [Tue, 26 Mar 2024 03:55:45 +0000 (11:55 +0800)]
kern linker: Do not unload a module if it has dependants
Despite the name, linker_file_unload() will drop a reference and return
success when the module file has dependants, i.e. it has more than one
reference. When user request to unload such modules then the kernel
should reject unambiguously and immediately.
amd64: initialize td_frame stack area for init(8) main thread
Unitialized td_frame mostly does not matter since all registers are
overwritten on exec to activate init(8). Except PSL_T bit from the
%rflags which might leak into fresh init as garbage, causing spurious
SIGTRAPs delivered to init until first syscall is executed.
Reviewed by: emaste, jhb, jhibbits
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D44498
Alexander Motin [Tue, 26 Mar 2024 00:13:45 +0000 (20:13 -0400)]
BRT: Skip getting length in brt_entry_lookup()
Unlike DDT, where ZAP values may have different lengths due to
compression, all BRT entries are identical 8-byte counters. It
does not make sense to first fetch the length only to assert it.
zap_lookup_uint64() is specifically designed to work with counters
of different size and should return error if something odd found.
Calling it straight allows to save some measurable CPU time.
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15950
Rob Norris [Wed, 13 Mar 2024 23:57:30 +0000 (10:57 +1100)]
abd_iter_page: don't use compound heads on Linux <4.5
Before 4.5 (specifically, torvalds/linux@ddc58f2), head and tail pages
in a compound page were refcounted separately. This means that using the
head page without taking a reference to it could see it cleaned up later
before we're finished with it. Specifically, bio_add_page() would take a
reference, and drop its reference after the bio completion callback
returns.
If the zio is executed immediately from the completion callback, this is
usually ok, as any data is referenced through the tail page referenced
by the ABD, and so becomes "live" that way. If there's a delay in zio
execution (high load, error injection), then the head page can be freed,
along with any dirty flags or other indicators that the underlying
memory is used. Later, when the zio completes and that memory is
accessed, its either unmapped and an unhandled fault takes down the
entire system, or it is mapped and we end up messing around in someone
else's memory. Both of these are very bad.
The solution on these older kernels is to take a reference to the head
page when we use it, and release it when we're done. There's not really
a sensible way under our current structure to do this; the "best" would
be to keep a list of head page references in the ABD, and release them
when the ABD is freed.
Since this additional overhead is totally unnecessary on 4.5+, where
head and tail pages share refcounts, I've opted to simply not use the
compound head in ABD page iteration there. This is theoretically less
efficient (though cleaning up head page references would add overhead),
but its safe, and we still get the other benefits of not mapping pages
before adding them to a bio and not mis-splitting pages.
There doesn't appear to be an obvious symbol name or config option we
can match on to discover this behaviour in configure (and the mm/page
APIs have changed a lot since then anyway), so I've gone with a simple
version check.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
Rob Norris [Wed, 21 Feb 2024 00:07:21 +0000 (11:07 +1100)]
vdev_disk: use bio_chain() to submit multiple BIOs
Simplifies our code a lot, so we don't have to wait for each and
reassemble them.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
Rob Norris [Tue, 9 Jan 2024 02:28:57 +0000 (13:28 +1100)]
vdev_disk: add module parameter to select BIO submission method
This makes the submission method selectable at module load time via the
`zfs_vdev_disk_classic` parameter, allowing this change to be backported
to 2.2 safely, and disabled in favour of the "classic" submission method
if new problems come up.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
Rob Norris [Tue, 18 Jul 2023 01:11:29 +0000 (11:11 +1000)]
vdev_disk: rewrite BIO filling machinery to avoid split pages
This commit tackles a number of issues in the way BIOs (`struct bio`)
are constructed for submission to the Linux block layer.
The kernel has a hard upper limit on the number of pages/segments that
can be added to a BIO, as well as a separate limit for each device
(related to its queue depth and other scheduling characteristics).
ZFS counts the number of memory pages in the request ABD
(`abd_nr_pages_off()`, and then uses that as the number of segments to
put into the BIO, up to the hard upper limit. If it requires more than
the limit, it will create multiple BIOs.
Leaving aside the fact that page count method is wrong (see below), not
limiting to the device segment max means that the device driver will
need to split the BIO in half. This is alone is not necessarily a
problem, but it interacts with another issue to cause a much larger
problem.
The kernel function to add a segment to a BIO (`bio_add_page()`) takes a
`struct page` pointer, and offset+len within it. `struct page` can
represent a run of contiguous memory pages (known as a "compound page").
In can be of arbitrary length.
The ZFS functions that count ABD pages and load them into the BIO
(`abd_nr_pages_off()`, `bio_map()` and `abd_bio_map_off()`) will never
consider a page to be more than `PAGE_SIZE` (4K), even if the `struct
page` is for multiple pages. In this case, it will load the same `struct
page` into the BIO multiple times, with the offset adjusted each time.
With a sufficiently large ABD, this can easily lead to the BIO being
entirely filled much earlier than it could have been. This is also
further contributes to the problem caused by the incorrect segment limit
calculation, as its much easier to go past the device limit, and so
require a split.
Again, this is not a problem on its own.
The logic for "never submit more than `PAGE_SIZE`" is actually a little
more subtle. It will actually never submit a buffer that crosses a 4K
page boundary.
In practice, this is fine, as most ABDs are scattered, that is a list of
complete 4K pages, and so are loaded in as such.
Linear ABDs are typically allocated from slabs, and for small sizes they
are frequently not aligned to page boundaries. For example, a 12K
allocation can span four pages, eg:
Such an allocation would be loaded into a BIO as you see:
[1K, 4K, 4K, 3K]
This tends not to be a problem in practice, because even if the BIO were
filled and needed to be split, each half would still have either a start
or end aligned to the logical block size of the device (assuming 4K at
least).
---
In ideal circumstances, these shortcomings don't cause any particular
problems. Its when they start to interact with other ZFS features that
things get interesting.
Aggregation will create a "gang" ABD, which is simply a list of other
ABDs. Iterating over a gang ABD is just iterating over each ABD within
it in turn.
Because the segments are simply loaded in order, we can end up with
uneven segments either side of the "gap" between the two ABDs. For
example, two 12K ABDs might be aggregated and then loaded as:
[1K, 4K, 4K, 3K, 2K, 4K, 4K, 2K]
Should a split occur, each individual BIO can end up either having an
start or end offset that is not aligned to the logical block size, which
some drivers (eg SCSI) will reject. However, this tends not to happen
because the default aggregation limit usually keeps the BIO small enough
to not require more than one split, and most pages are actually full 4K
pages, so hitting an uneven gap is very rare anyway.
If the pool is under particular memory pressure, then an IO can be
broken down into a "gang block", a 512-byte block composed of a header
and up to three block pointers. Each points to a fragment of the
original write, or in turn, another gang block, breaking the original
data up over and over until space can be found in the pool for each of
them.
Each gang header is a separate 512-byte memory allocation from a slab,
that needs to be written down to disk. When the gang header is added to
the BIO, its a single 512-byte segment.
Pulling all this together, consider a large aggregated write of gang
blocks. This results a BIO containing lots of 512-byte segments. Given
our tendency to overfill the BIO, a split is likely, and most possible
split points will yield a pair of BIOs that are misaligned. Drivers that
care, like the SCSI driver, will reject them.
---
This commit is a substantial refactor and rewrite of much of `vdev_disk`
to sort all this out.
`vdev_bio_max_segs()` now returns the ideal maximum size for the device,
if available. There's also a tuneable `zfs_vdev_disk_max_segs` to
override this, to assist with testing.
We scan the ABD up front to count the number of pages within it, and to
confirm that if we submitted all those pages to one or more BIOs, it
could be split at any point with creating a misaligned BIO. If the
pages in the BIO are not usable (as in any of the above situations), the
ABD is linearised, and then checked again. This is the same technique
used in `vdev_geom` on FreeBSD, adjusted for Linux's variable page size
and allocator quirks.
`vbio_t` is a cleanup and enhancement of the old `dio_request_t`. The
idea is simply that it can hold all the state needed to create, submit
and return multiple BIOs, including all the refcounts, the ABD copy if
it was needed, and so on. Apart from what I hope is a clearer interface,
the major difference is that because we know how many BIOs we'll need up
front, we don't need the old overflow logic that would grow the BIO
array, throw away all the old work and restart. We can get it right from
the start.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
Rob Norris [Tue, 9 Jan 2024 01:29:19 +0000 (12:29 +1100)]
vdev_disk: make read/write IO function configurable
This is just setting up for the next couple of commits, which will add a
new IO function and a parameter to select it.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
Rob Norris [Tue, 9 Jan 2024 01:23:30 +0000 (12:23 +1100)]
vdev_disk: reorganise vdev_disk_io_start
Light reshuffle to make it a bit more linear to read and get rid of a
bunch of args that aren't needed in all cases.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
Rob Norris [Tue, 9 Jan 2024 01:12:56 +0000 (12:12 +1100)]
vdev_disk: rename existing functions to vdev_classic_*
This is just renaming the existing functions we're about to replace and
grouping them together to make the next commits easier to follow.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
Rob Norris [Mon, 11 Dec 2023 05:05:54 +0000 (16:05 +1100)]
abd: add page iterator
The regular ABD iterators yield data buffers, so they have to map and
unmap pages into kernel memory. If the caller only wants to count
chunks, or can use page pointers directly, then the map/unmap is just
unnecessary overhead.
This adds adb_iterate_page_func, which yields unmapped struct page
instead.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
Rob Norris [Mon, 13 Nov 2023 06:55:29 +0000 (17:55 +1100)]
linux 5.4 compat: page_size()
Before 5.4 we have to do a little math.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
Alexander Motin [Mon, 25 Mar 2024 22:02:38 +0000 (18:02 -0400)]
BRT: Make BRT block sizes configurable
Similar to DDT make BRT data and indirect block sizes configurable
via module parameters. I am not sure what would be the best yet,
but similar to DDT 4KB blocks kill all chances of compression on
vdev with ashift=12 or more, that on my tests reaches 3x.
While here, fix documentation for respective DDT parameters.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15967
George Wilson [Mon, 25 Mar 2024 22:01:54 +0000 (18:01 -0400)]
Provide macros for setting and getting blkptr birth times
There exist a couple of macros that are used to update the blkptr birth
times but they can often be confusing. For example, the
BP_PHYSICAL_BIRTH() macro will provide either the physical birth time
if it is set or else return back the logical birth time. The
complement to this macro is BP_SET_BIRTH() which will set the logical
birth time and set the physical birth time if they are not the same.
Consumers may get confused when they are trying to get the physical
birth time and use the BP_PHYSICAL_BIRTH() macro only to find out that
the logical birth time is what is actually returned.
This change cleans up these macros and makes them symmetrical. The same
functionally is preserved but the name is changed. Instead of calling
BP_PHYSICAL_BIRTH(), consumer can now call BP_GET_BIRTH(). In
additional to cleaning up this naming conventions, two new sets of
macros are introduced -- BP_[SET|GET]_LOGICAL_BIRTH() and
BP_[SET|GET]_PHYSICAL_BIRTH. These new macros allow the consumer to
get and set the specific birth time.
As part of the cleanup, the unused GRID macros have been removed and
that portion of the blkptr are currently unused.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: George Wilson <gwilson@delphix.com>
Closes #15962
Alexander Motin [Mon, 25 Mar 2024 21:59:55 +0000 (17:59 -0400)]
BRT: Relax brt_pending_apply() locking
Since brt_pending_apply() is running in syncing context, no other
brt_pending_tree accesses are possible for the TXG. We don't need
to acquire brt_pending_lock here.
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15955
Alexander Motin [Mon, 25 Mar 2024 21:58:50 +0000 (17:58 -0400)]
ZAP: Massively switch to _by_dnode() interfaces
Before this change ZAP called dnode_hold() for almost every block
access, that was clearly visible in profiler under heavy load, such
as BRT. This patch makes it always hold the dnode reference between
zap_lockdir() and zap_unlockdir(). It allows to avoid most of dnode
operations between those. It also adds several new _by_dnode() APIs
to ZAP and uses them in BRT code. Also adds dmu_prefetch_by_dnode()
variant and uses it in the ZAP code.
After this there remains only one call to dmu_buf_dnode_enter(),
which seems to be unneeded. So remove the call and the functions.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15951
Alexander Motin [Mon, 25 Mar 2024 21:58:04 +0000 (17:58 -0400)]
BRT: Skip duplicate BRT prefetches
If there is a pending entry for this block, then we've already
issued BRT prefetch for it within this TXG, so don't do it again.
BRT vdev lookup and following zap_prefetch_uint64() call can be
pretty expensive and should be avoided when not necessary.
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15941
Robert Evans [Mon, 25 Mar 2024 21:56:49 +0000 (17:56 -0400)]
Fix corruption caused by mmap flushing problems
1) Make mmap flushes synchronous. Linux may skip flushing dirty pages
already in writeback unless data-integrity sync is requested.
2) Change zfs_putpage to use TXG_WAIT. Otherwise dirty pages may be
skipped due to DMU pushing back on TX assign.
3) Add missing mmap flush when doing block cloning.
4) While here, pass errors from putpage to writepage/writepages.
This change fixes corruption edge cases, but unfortunately adds
synchronous ZIL flushes for dirty mmap pages to llseek and bclone
operations. It may be possible to avoid these sync writes later
but would need more tricky refactoring of the writeback code.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Robert Evans <evansr@google.com>
Closes #15933
Closes #16019
Ed Maste [Tue, 23 Jan 2024 18:04:43 +0000 (13:04 -0500)]
bsdlabel: add deprecation notice
gpart is the preferred tool for managing partitions of all types,
including BSD disklabels.
Note that this is only about bsdlabel/disklabel, the tool -- there is no
current plan to remove support for MBR or BSD disk labels from the
kernel or from gpart.
Reviewed by: imp, olce
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D43563
Mark Peek [Mon, 25 Mar 2024 15:58:46 +0000 (16:58 +0100)]
certctl: Revert to symlinks.
Unfortunately tar will not be able to extract base.txz to a system where
/etc and /usr are not on the same filesystem if the certificates are
hard links.
* Add a dummy getopt(3) loop to handle `--`.
* Move interval parsing out into a separate function.
* Print a diagnostic for every invalid interval.
* Check for NaN and infinity.
* Improve bounds checks.
Manual page:
* Miscellaneous markup fixes.
* Reword DESCRIPTION section.
* Move text about GNU compatibility to STANDARDS section.
* Convert examples from csh to sh.
Sponsored by: Klara, Inc.
Reviewed by: kevans
Differential Revision: https://reviews.freebsd.org/D44471
Kristof Provost [Sun, 24 Mar 2024 08:46:31 +0000 (09:46 +0100)]
pfsync: fix use of invalidated stack variable
Calls to pfsync_send_plus() pass pointers to stack variables.
If pfsync_sendout() then fails it retains the pointer to these stack
variables, accesing them later.
Allocate a buffer and copy the data instead, so that we can retain the
pointer safely.
Kristof Provost [Sat, 23 Mar 2024 16:02:50 +0000 (17:02 +0100)]
pf: fix use-after-free
If we fragment the packet in pf_route() the first transmitted packet
will free the pf_mtag we have stored in pf_pdesc (pd). Ensure we
update that pointer for every packet to avoid using a freed pointer in
pf_dummynet_route().
Eliot Solomon [Sat, 18 Nov 2023 21:13:21 +0000 (15:13 -0600)]
arm64: fix free queue and reservation configuration for 16KB pages
Correctly configure the free page queues and the reservation size when
the base page size is 16KB. In particular, the reservation size was
less than the L2 Block size, making L2 promotions and mappings all but
impossible.