Notable upstream pull request merges:
#14925 Another set of vdev queue optimizations
#14964 Use big transactions for small recordsize writes
#14999 ZIL: Fix another use-after-free
Mark Johnston [Wed, 28 Jun 2023 20:13:37 +0000 (16:13 -0400)]
arm64: Make register definitions const
No functional change intended.
Reviewed by: andrew
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Sponsored by: Klara, Inc. (hardware)
Differential Revision: https://reviews.freebsd.org/D40502
Mark Johnston [Wed, 28 Jun 2023 20:12:47 +0000 (16:12 -0400)]
arm64: Add a masked get_kernel_reg()
This lets consumers fetch the value of a system register and apply a
mask over individual fields. That is, each field in the returned value
will be the "smaller" of the two provided by "mask" and the value saved
in kern_cpu_desc. This will be used by vmm to sanitize host system
register fields.
Reviewed by: andrew
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Sponsored by: Klara, Inc. (hardware)
Differential Revision: https://reviews.freebsd.org/D40500
Mark Johnston [Wed, 28 Jun 2023 20:06:37 +0000 (16:06 -0400)]
bhyve: Rename a pci_cfgrw() parameter
pci_cfgrw() may be called via a write to the extended config space,
which is memory-mapped. In this case, the name "eax" is misleading.
Give it a more generic name. No functional change intended.
Mark Johnston [Wed, 28 Jun 2023 20:06:21 +0000 (16:06 -0400)]
bhyve: Stop calling pci_lintr_request() in the NVMe device model
The device model effectively assumes that MSI-X is enabled (it never
asserts the legacy interrupt), so any guest which relies on being able
to use the legacy PCI interrupt will fail.
The WIP arm64 port does not implement legacy PCI interrupts, but NVMe
emulation is potentially useful there. Simply remove the call.
Ed Maste [Wed, 28 Jun 2023 18:50:49 +0000 (14:50 -0400)]
login.conf: document how to specify env var values with commas
As of f32db406504e quotes may be used to specify login class
capabilities that include commas. This is true in general but is
particularly relevant for setenv, a comma-separated list of environment
variables and values, so mention it there.
John Baldwin [Wed, 28 Jun 2023 18:12:50 +0000 (11:12 -0700)]
camdd: Remove some dead code but also make -E functional.
- Pass down the top-level arglist from main down to camdd_rw and use
it in place of the hardcoded CAMDD_ARG_ERR_RECOVER when calling
camdd_probe_pass. This now disables error recovery by default
unless -E is specified.
- Use the return value of parse_btl to determine if an explicit LUN
was specified.
- Remove most CAMDD_ARG_* flags that are only set and never checked.
CAMDD_ARG_VERBOSE remains, but could perhaps be removed (and possibly
-v should be removed as well since it is currently a no-op).
Reported by: clang -Wunused-but-set-variable (arglist in main)
Differential Revision: https://reviews.freebsd.org/D40666
John Baldwin [Wed, 28 Jun 2023 18:11:00 +0000 (11:11 -0700)]
bsdinstall: Handle errors from geom_gettree.
geom_gettree probably never fails, and if it does there isn't much of
a fallback other than aborting partitioning. However, a few places
were checking the return value and not doing anything with it
triggering a unused-but-set-variable warning. Checking the errors
resolves the warning.
While here, check for errors in other places that weren't checking for
them at all, remove a spurious double call (the second call overwrote
the mesh structure leaking all the pointers from the first), and close
a few resource leaks on error paths.
John Baldwin [Wed, 28 Jun 2023 18:10:05 +0000 (11:10 -0700)]
mrsas: Use mrsas_sge64 instead of iovec for the S/G list for passthru.
The DMA scatter/gather list for mrsas passthrough ioctl commands is
stored in a SGL at the end of the DCMD frame. Given the SGL member at
the end of the DCMD frame it seems likely this S/G list is formatted
as a fixed-width structure such as the type mrsas_sge64 and not as a
iovec which contains a kernel pointer and length that vary with the
native architecture size.
Andrew Turner [Tue, 27 Jun 2023 08:32:12 +0000 (09:32 +0100)]
Continue searching for an irq map from the start
When searching for a free irq map location continue the search from the
beginning of the list. There may be holes in the map before
irq_map_first_free_idx, e.g. when removing an entries in order will
increase the index past the current free entry.
PR: 271990
Reviewed by: mhorne
Sponsored by: Arm Ltd
Randall Stewart [Wed, 28 Jun 2023 15:18:47 +0000 (11:18 -0400)]
tcp: With the right options in the kernel cc_cubic stays in slowstart always.
So this is a subtle bug I have found in cubic. If you compile a number of options into the kernel (accounting and tracking) then
all of the kernel code "knows" the proper offset of t_rttupdatecnt however cubic does not see the options (and htcp too) and thus
will look for the count in the wrong place seeing 0. Which then means it never builds an MIN RTT which then means it
always hangs in slowstart.
The solution is to put all options that effect tcpcb size into the opt_global.h so that the kernel
has a consistent view of the tcpcb size.
Alexander Motin [Wed, 28 Jun 2023 00:03:37 +0000 (20:03 -0400)]
ZIL: Fix another use-after-free.
lwb->lwb_issued_txg can not be accessed after lwb_state is set to
LWB_STATE_FLUSH_DONE and zl_lock is dropped, since the lwb may be
freed by zil_sync(). We must save the txg number before that.
This is similar to the 55b1842f92, but as I see the bug is not new.
It existed for quite a while, just was not triggered due to smaller
race window.
Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14988
Closes #14999
Alexander Motin [Wed, 28 Jun 2023 00:00:30 +0000 (20:00 -0400)]
Use big transactions for small recordsize writes.
When ZFS appends files in chunks bigger than recordsize, it borrows
buffer from ARC and fills it before opening transaction. This
supposed to help in case of page faults to not hold transaction open
indefinitely. The problem appears when recordsize is set lower than
default 128KB. Since each block is committed in separate transaction,
per-transaction overhead becomes significant, and what is even worse,
active use of of per-dataset and per-pool locks to protect space use
accounting for each transaction badly hurts the code SMP scalability.
The same transaction size limitation applies in case of file rewrite,
but without even excuse of buffer borrowing.
To address the issue, disable the borrowing mechanism if recordsize
is smaller than default and the write request is 4x bigger than it.
In such case writes up to 32MB are executed in single transaction,
that dramatically reduces overhead and lock contention. Since the
borrowing mechanism is not used for file rewrites, and it was never
used by zvols, which seem to work fine, I don't think this change
should create significant problems, partially because in addition to
the borrowing mechanism there are also used pre-faults.
My tests with 4/8 threads writing several files same time on datasets
with 32KB recordsize in 1MB requests show reduction of CPU usage by
the user threads by 25-35%. I would measure it in GB/s, but at that
block size we are now limited by the lock contention of single write
issue taskqueue, which is a separate problem we are going to work on.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14964
Jessica Clarke [Tue, 27 Jun 2023 23:10:49 +0000 (00:10 +0100)]
Generalise libcompat to be a list rather than a single option
Whilst the kernel can support any number of COMPAT_FOO, world can only
build a single libfoo. Upstream this isn't such an issue, since the only
option is lib32 anyway, but downstreams, such as CheriBSD, may wish to
support multiple at the same time. Thus, adjust the top-level Makefiles
to turn _LIBCOMPAT into a _LIBCOMPATS list that gets iterated over, and
adjust bsd.compat.mk to support this use-case.
For the normal NEED_COMPAT/WANT_COMPAT case, LIBCOMPATFOO remain set and
refer to the requested compat's, preserving the current interface. For
the top-level Makefiles those variables are no longer set (since there
is no longer "the" compat) and only the per-compat ones are available.
Jessica Clarke [Tue, 27 Jun 2023 23:10:30 +0000 (00:10 +0100)]
Makefile.libcompat: Handle MK_FOO varying for native and compat arches
Currently Makefile.libcompat queries a few MK_FOO variables to determine
what is being built. However, it is plausible (and indeed, downstream
in CheriBSD, this is the case) that these may vary between the native
and the compat architecture. In order to correctly determine their
values for the compat architecture, we need to defer their evaluation
until we are in the compat sub-make where src.opts.mk will give us the
right value for the compat MACHINE_ARCH.
Alfonso Gregory [Tue, 27 Jun 2023 22:18:55 +0000 (16:18 -0600)]
ifconfig: skip calling fnmatch once the result no longer matters
Because fnmatch has no side effects, we can safely avoid calling fnmatch
if the end result does not matter anyway (the compiler cannot see this,
so it calls fnmatch in the event it has side-effects).
o optimize string matching for ':M' and ':N'
o warn about malformed patterns in ':M', ':N' and '.if make(...)'
o allow guards to be targets as well as variables
The guard targets may include variable references like
__${.PARSEDIR:tA}/${.PARSEFILE}__
o optimization for makefiles protected from multiple-inclusion
skip even opening the file after first include.
o var.c: do not allow delete of readOnly variable
o parse.c: .break takes no args
Doug Moore [Tue, 27 Jun 2023 17:21:11 +0000 (12:21 -0500)]
radix_trie: pass fewer params to node_get
Let node_get calculate it's own owner value. Don't pass the count
parameter, since it's always 2. Save 16 bytes in insert(). Move,
without modifying, slot and trimkey to handle use-before-declaration
problem.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D40723
John Baldwin [Tue, 27 Jun 2023 17:19:32 +0000 (10:19 -0700)]
xo: Disable -Wunused-but-set-variable.
Presumably these warnings will be fixed upstream at some point.
contrib/libxo/xo/xo.c:99:9: error: variable 'hflag' set but not used [-Werror,-Wunused-but-set-variable]
int hflag = 0, jflag = 0, tflag = 0,
^
contrib/libxo/xo/xo.c:99:20: error: variable 'jflag' set but not used [-Werror,-Wunused-but-set-variable]
int hflag = 0, jflag = 0, tflag = 0,
^
contrib/libxo/xo/xo.c:99:31: error: variable 'tflag' set but not used [-Werror,-Wunused-but-set-variable]
int hflag = 0, jflag = 0, tflag = 0,
^
contrib/libxo/xo/xo.c:100:2: error: variable 'zflag' set but not used [-Werror,-Wunused-but-set-variable]
zflag = 0, qflag = 0, star1 = 0, star2 = 0;
^
contrib/libxo/xo/xo.c:100:13: error: variable 'qflag' set but not used [-Werror,-Wunused-but-set-variable]
zflag = 0, qflag = 0, star1 = 0, star2 = 0;
^
John Baldwin [Tue, 27 Jun 2023 17:19:32 +0000 (10:19 -0700)]
bsdinstall: Replace correct, but fragile, string builder with open_memstream.
The old one triggered a false positive -Warray-bounds from GCC (the
compiler assumed len was always 0), but it was also fragile with
manually computed lengths paired with strcat vs using a string
builder.
Warner Losh [Tue, 27 Jun 2023 17:11:03 +0000 (11:11 -0600)]
openzfs: Remove broken symlinks
These symlinks are broken. They point to files that don't exist, and
rely on having them be built using a 'in-tree' build that FreeBSD
doesn't use. They also show up as errors on grep -r. Since they are
broken and can't possibly work, remove them for now since non-functional
symlinks matching an upstrem repo that can't work in our repo aren't
worth the error messages.
Though one could argue the whole debian directory should be removed, I
did the minimal change necessary. These can return when the fundamental
issue is fixed and the contrib/debian tree is useful on FreeBSD systems.
Doug Moore [Tue, 27 Jun 2023 16:59:12 +0000 (11:59 -0500)]
radix_trie: clean up overlong lines
This is purely a cosmetic change. vm_radix.c has lines that reach past
column 80 and this change cleans that up. The associated changes to
subr_pctrie.c are just to keep mirroring vm_radix.c.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D40764
Alexander Motin [Tue, 27 Jun 2023 16:09:48 +0000 (12:09 -0400)]
Another set of vdev queue optimizations.
Switch FIFO queues (SYNC/TRIM) and active queue of vdev queue from
time-sorted AVL-trees to simple lists. AVL-trees are too expensive
for such a simple task. To change I/O priority without searching
through the trees, add io_queue_state field to struct zio.
To not check number of queued I/Os for each priority add vq_cqueued
bitmap to struct vdev_queue. Update it when adding/removing I/Os.
Make vq_cactive a separate array instead of struct vdev_queue_class
member. Together those allow to avoid lots of cache misses when
looking for work in vdev_queue_class_to_issue().
Introduce deadline of ~0.5s for LBA-sorted queues. Before this I
saw some I/Os waiting in a queue for up to 8 seconds and possibly
more due to starvation. With this change I no longer see it. I
had to slightly more complicate the comparison function, but since
it uses all the same cache lines the difference is minimal. For a
sequential I/Os the new code in vdev_queue_io_to_issue() actually
often uses more simple avl_first(), falling back to avl_find() and
avl_nearest() only when needed.
Arrange members in struct zio to access only one cache line when
searching through vdev queues. While there, remove io_alloc_node,
reusing the io_queue_node instead. Those two are never used same
time.
Remove zfs_vdev_aggregate_trim parameter. It was disabled for 4
years since implemented, while still wasted time maintaining the
offset-sorted tree of TRIM requests. Just remove the tree.
Remove locking from txg_all_lists_empty(). It is racy by design,
while 2 pair of locks/unlocks take noticeable time under the vdev
queue lock.
With these changes in my tests with volblocksize=4KB I measure vdev
queue lock spin time reduction by 50% on read and 75% on write.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14925
Greg Becker [Tue, 27 Jun 2023 01:08:29 +0000 (20:08 -0500)]
libpthread: allocate rwlocks and spinlocks in dedicated cachelines
Reduces severe performance degradation due to false-sharing. Note that this
does not account for hardware which can perform adjacent cacheline prefetch.
[mjg: massaged the commit message and the patch to use aligned_alloc
instead of malloc]
Instead of using VV_READLINK vnode flag and checking it in one place,
just assign VLNK type to the Fdesc vnodes for linrdlnk mounts. Then all
places where symlinks needs to be followed, e.g. lookup(), are handled.
PR: 272127
Reported by: Peter Eriksson <pen@lysator.liu.se>
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D40700
Doug Moore [Tue, 27 Jun 2023 05:42:41 +0000 (00:42 -0500)]
radix_trie: skip compare in lookup_le, lookup_ge
In _lookup_ge, where a loop "looks for an available edge or val within
the current bisection node" (to quote the code comment), the value of
index has already been modified to guarantee that it is the least
value than can be found in the non-NULL child node being
examined. Therefore, if the non-NULL child is a leaf, there's no need
to compare 'index' to anything, and the value can just be returned.
The same is true for _lookup_le with 'most' replacing 'least'.
Reviewed by: alc
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D40746
Alan Cox [Fri, 23 Jun 2023 17:00:32 +0000 (12:00 -0500)]
vm: Fix anonymous memory clustering under ASLR
By default, our ASLR implementation is supposed to cluster anonymous
memory allocations, unless the application's mmap(..., MAP_ANON, ...)
call included a non-zero address hint. Unfortunately, clustering
never occurred because kern_mmap() always replaced the given address
hint when it was zero. So, the ASLR implementation always believed
that a non-zero hint had been provided and randomized the mapping's
location in the address space. To fix this problem, I'm pushing down
the point at which we convert a hint of zero to the minimum allocatable
address from kern_mmap() to vm_map_find_min().
John Baldwin [Tue, 27 Jun 2023 03:33:25 +0000 (20:33 -0700)]
cam_xpt: Properly fail if a sim uses an unsupported transport.
The default xport ops for a new bus is xport_default, not NULL, so
check for that when determining if a bus failed to find a suitable
transport. In addition, the path needs to be freed with xpt_free_path
instead of a plain free so that the path's reference on the sim is
dropped; otherwise, cam_sim_free in the caller after xpt_bus_register
returns failure will hang forever.
Note that we have to exempt the xpt bus from this check as it uses
xport_default on purpose.
John Baldwin [Tue, 27 Jun 2023 03:32:29 +0000 (20:32 -0700)]
nvme: Tidy up transfer rate settings in XPT_GET_TRAN_SETTINGS.
- Replace a magic number with CTS_NVME_VALID_SPEC.
- Set the transport and protocol versions the same as for XPT_PATH_INQ.
Probably we shouldn't bother with setting the version in the 'spec'
member of ccb_trans_settings_nvme at all and use the transport
and/or protocol version field instead.
Rich Ercolani [Mon, 26 Jun 2023 20:57:12 +0000 (16:57 -0400)]
Add a delay to tearing down threads.
It's been observed that in certain workloads (zvol-related being a
big one), ZFS will end up spending a large amount of time spinning
up taskqs only to tear them down again almost immediately, then
spin them up again...
I noticed this when I looked at what my mostly-idle system was doing
and wondered how on earth taskq creation/destroy was a bunch of time...
So I added a configurable delay to avoid it tearing down tasks the
first time it notices them idle, and the total number of threads at
steady state went up, but the amount of time being burned just
tearing down/turning up new ones almost vanished.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14938
Andrew Fengler [Mon, 26 Jun 2023 18:40:21 +0000 (18:40 +0000)]
rc.d/routing: Correct setting default gateway for each FIB
There was a mistake in the previous commit, it used the incorrect
spelling of the FIB variable name and was not functional
Also corrects an issue with the IPv6 default route variable name.
Fixes: 30659d1dcbcc ("Add support for adding default routes for other FIBs")
Sponsored-by: ScaleEngine Inc.
Differential Revision: https://reviews.freebsd.org/D37685
The ichsmb driver does not actually handle SMBALERT, other than by logging the
first 16 occurences of the ICH_HST_STA_SMBALERT_STS_SMBALERT status
flag. Because the SMBALERT is not acknowledged by the host, clearing it in the
host status register does not appear to work as long as some slave device is
pulling the SMBALERT line low, at least for C2000 chips. As a result, if a slave
device does pull SMBALERT low the interrupt handler will always loop its maximum
of 16 times attempting to clear all status register flags and device_printf the
status register. The result is the kernel message buffer is littered with these
device_printfs at every interrupt.
To remedy the problem, the ICH_HST_STA_SMBALERT_STS flag is zeroed in the read
host status register value, just as with ICH_HST_STA_INUSE_STS and
ICH_HST_STA_HOST_BUSY. This allows the loop to break when no other flags that
must be handled are set in the host status register. Additionally, because the
SMBALERT is not actually handled the SMBALERT logging is omitted as it has no
actual function at this time.
Ed Maste [Tue, 19 Jul 2022 19:36:09 +0000 (15:36 -0400)]
Assemble .s to .o using cc, not as
As of commit fd71da37d478 we no longer have an `as` in the default
toolchain. Although we do not make use of this rule in the base system
some ports or downstream projects might. Use `cc -x assembler` instead
of as.
Reviewed by: arichardson
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35859
to make __cxa_thread_call_dtors() operational for statically linked
binaries.
Noted by: andrew
Reviewed by: emaste, dim
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D40748
Doug Moore [Sun, 25 Jun 2023 17:49:15 +0000 (12:49 -0500)]
radix_trie: simplify trimkey functions
Replacing a branch and two shifts with a single masking operation saves 64 bytes the pair of functions lookup_le and lookup_ge on amd64. Refresh the associated comments.
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D40722
Ed Maste [Sat, 24 Jun 2023 17:52:08 +0000 (13:52 -0400)]
libcrypto: build nistp* on all little-endian 64-bit targets
libcrypto intends to provide these routines on little-endian 64-bit
targets. This was previously done by including them in the ASM_aarch64
and ASM_amd64 blocks in the Makefile, but this excluded powerpc64le and
riscv64.
Reported by: ci.freebsd.org
Reviewed by: jrtc27
Fixes: b077aed33b7b ("Merge OpenSSL 3.0.9")
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D40749
Enji Cooper [Thu, 22 Jun 2023 03:53:54 +0000 (20:53 -0700)]
libfetch: remove all old OpenSSL support
This change removes pre-OpenSSL 1.1 supporting code and removes/adjusted
preprocessor conditionals which were tautilogically true as FreeBSD main
has shipped with OpenSSL 1.1+ for some time.
Doug Moore [Fri, 23 Jun 2023 23:47:23 +0000 (18:47 -0500)]
radix_trie: avoid reloading radix node
In the vm_radix:remove loop that searches for the last child, load
that child once, without loading it again after the search is over.
Change KASSERTS from index check to NULL node check.
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D40721