Notable upstream pull request merges:
#15018 Increase limit of redaction list by using spill block
#15161 Make zoned/jailed zfsprops(7) make more sense
#15216 Relax error reporting in zpool import and zpool split
#15218 Selectable block allocators
#15227 ZIL: Tune some assertions
#15228 ZIL: Revert zl_lock scope reduction
#15233 ZIL: Change ZIOs issue order
ZFS historically has had several space allocators that were
dynamically selectable. While these have been retained in
OpenZFS, only a single allocator has been statically compiled
in. This patch compiles all allocators for OpenZFS and provides
a module parameter to allow for manual selection between them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes #15218
Relax error reporting in zpool import and zpool split
For zpool import and zpool split, zpool_enable_datasets is called
to mount and share all datasets in a pool. If there is an error
while mounting or sharing any dataset in the pool, the status of
import or split is reported as failure. However, the changes do
show up in zpool list.
This commit updates the error reporting in zpool import and zpool
split path. More descriptive messages are shown to user in case
there is an error during mount or share. Errors in mount or share
do not effect the overall status of zpool import and zpool split.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #15216
Andrea Righi [Sat, 2 Sep 2023 00:21:40 +0000 (02:21 +0200)]
Linux 6.5 compat: safe cleanup in spl_proc_fini()
If we fail to create a proc entry in spl_proc_init() we may end up
calling unregister_sysctl_table() twice: one in the failure path of
spl_proc_init() and another time during spl_proc_fini().
Avoid the double call to unregister_sysctl_table() and while at it
refactor the code a bit to reduce code duplication.
This was accidentally introduced when the spl code was
updated for Linux 6.5 compatibility.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Closes #15234
Closes #15235
Alexander Motin [Sat, 2 Sep 2023 00:14:50 +0000 (20:14 -0400)]
ZIL: Change ZIOs issue order.
In zil_lwb_write_issue(), after issuing lwb_root_zio/lwb_write_zio,
we have no right to access lwb->lwb_child_zio. If it was not there,
the first two ZIOs may have already completed and freed the lwb.
ZIOs issue in opposite order from children to parent should keep
the lwb valid till the end, since the lwb can be freed only after
lwb_root_zio completion callback.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15233
Alexander Motin [Sat, 2 Sep 2023 00:13:52 +0000 (20:13 -0400)]
ZIL: Revert zl_lock scope reduction.
While I have no reports of it, I suspect possible use-after-free
scenario when zil_commit_waiter() tries to dereference zcw_lwb
for lwb already freed by zil_sync(), while zcw_done is not set.
Extension of zl_lock scope as it was originally should block
zil_sync() from freeing the lwb, closing this race.
This reverts #14959 and couple chunks of #14841.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15228
Brooks Davis [Fri, 1 Sep 2023 16:42:52 +0000 (17:42 +0100)]
Add INIT_ALL build option
This option replaces WITH_INIT_ALL_PATTERN and WITH_INIT_ALL_ZERO with
INIT_ALL=pattern and INIT_ALL=zero respectively. As these are
relatively rarely used options no backwards compatibility is
implemented.
Brooks Davis [Fri, 1 Sep 2023 16:42:39 +0000 (17:42 +0100)]
libc: add LIBC_MALLOC option
This will enable alternative mallocs to be included in the tree and
selected by setting LIBC_MALLOC. As there is only one today (jemalloc)
this option does nothing, but we expect to add other implementations
in the future. This will also reduce diffs to CheriBSD.
Brooks Davis [Fri, 1 Sep 2023 16:41:59 +0000 (17:41 +0100)]
makeman: add minimal support for group options
Ignore OPT_* values in showconfig out in exising code paths and add
a new path to include descriptions for each. For now, hardcode the
description contents rather than attempting to generate it. This runs
the risk of docs getting out of date, limits the amount of new shell
code added today while a lua rewrite is nearly ready to land.
This change requires a followup commit to enable OPT_* values in
"make showconfig" in order to actually find group options.
Brooks Davis [Fri, 1 Sep 2023 16:41:07 +0000 (17:41 +0100)]
share/mk: support for "single" group options
Support group options where 1 of n values will be selected (or a default
value will be used). After processing, an OPT_FOO will be set to one
value from __FOO_OPTIONS for each FOO in __SINGLE_OPTIONS. If the user
sets FOO that value will be used, otherwise __FOO_DEFAULT will be used.
Options that don't work an a particular system can be remapped to an
alternative using BROKEN_SINGLE_OPTIONS which can be set to a list of
3-tuples of the form:
OPTION broken_value replacement_value
This is somewhat inspired by OPTIONS_SINGLE from ports, but the
structure is quite different with a per-option variable in the style of
MK_FOO={yes,no}.
Zachary Leaf [Thu, 31 Aug 2023 13:11:53 +0000 (14:11 +0100)]
armv8_crypto: fix recursive fpu_kern_enter call
Now armv8_crypto is using FPU_KERN_NOCTX, this results in a kernel panic
in armv8_crypto.c:armv8_crypto_cipher_setup:
panic: recursive fpu_kern_enter while in PCB_FP_NOSAVE state
This is because in armv8_crypto.c:armv8_crypto_cipher_process,
directly after calling fpu_kern_enter() a call is made to
armv8_crypto_cipher_setup(), resulting in nested calls to
fpu_kern_enter() without the required fpu_kern_leave() in between.
Move fpu_kern_enter() in armv8_crypto_cipher_process() after the
call to armv8_crypto_cipher_setup() to resolve this.
Reviewed by: markj, andrew Fixes: 6485286f536f ("armv8_crypto: Switch to using FPU_KERN_NOCTX")
Sponsored by: Arm Ltd
Differential Revision: https://reviews.freebsd.org/D41671
Andrew Turner [Tue, 22 Aug 2023 16:01:21 +0000 (17:01 +0100)]
gicv3: Support indirect ITS tables
The GICv3 ITS device supports two options for device tables. Currently
we support a single table to hold all device IDs, however when the
device ID space grows large this can be too large for the GITS_BASER
register to describe.
To handle this case, and to reduce the memory needed when this space
is sparse support the second option, the indirect table. The indirect
table is a 2 level table where the first level contains the physical
address of the second with a valid bit. The second level is an ITS
page sized table where each entry is the original entry size.
As we don't need to allocate a second level table for devices IDs that
don't exist this can reduce the allocation size.
Reviewed by: gallatin
Sponsored by: Arm Ltd
Differential Revision: https://reviews.freebsd.org/D41555
Andrew Turner [Wed, 23 Aug 2023 12:34:09 +0000 (13:34 +0100)]
arm: Add a userspace physical timer check
We currently use the same Arm generic time in both userspace and the
kernel. As we always enable userspace access to the virtual timer we
can tell userspace to use it.
Reviewed by: imp
Sponsored by: Arm Ltd
Differential Revision: https://reviews.freebsd.org/D41565
jail: Add the ability to access system-level filesystem extended attributes
Prior to this commit privileged accounts in a jail could not access to the
filesystem extended attributes in the system namespace. To control access to
the system namespace in a per-jail basis add a new configuration parameter
allow.extattr which is off by default.
linux(4): Return ENOTSUP from xattr syscalls instead of EPERM
FreeBSD does not permits manipulating extended attributes in the system
namespace by unprivileged accounts, even if account has appropriate
privileges to access filesystem object.
In Linux the system namespace is used to preserve posix acls. Some Gnu
coreutils binaries uses posix acls, eg, install, ls. And fails if we
unexpectedly return EPERM error from xattr system calls.
In the other hands, in Linux read and write access to the system
namespace depend on the policy implemented for each filesystem, so we'll
mimics we're a filesystem that prohibits this for unpriveleged accounts.
arm64: initialize pcb in the TBI/PAC/etc. fault case
After 2c10be9e06d, we may jump to the bad_far label without `pcb` being
set, resulting in a follow-up fault as we may dereference it immediately
after the jump if td_intr_nesting_level == 0. In this branch, it should
be safe to dereference `td` as we're not handling the special case
mentioned below of accessing it during promotion/demotion.
This seems to fix a null ptr deref I hit during my most recent pkgbase
build attempt on the Windows DevKit, though that was admittedly
encountered while we were on the way to a panic from an apparent
use-after-free in ZFS bits.
dmu_buf_will_clone: change assertion to fix 32-bit compiler warning
Building module/zfs/dbuf.c for 32-bit targets can result in a warning:
In file included from
/usr/src/sys/contrib/openzfs/include/sys/zfs_context.h:97,
from /usr/src/sys/contrib/openzfs/module/zfs/dbuf.c:32:
/usr/src/sys/contrib/openzfs/module/zfs/dbuf.c: In function
'dmu_buf_will_clone':
/usr/src/sys/contrib/openzfs/lib/libspl/include/assert.h:116:33: error:
cast from pointer to integer of different size
[-Werror=pointer-to-int-cast]
116 | const uint64_t __left = (uint64_t)(LEFT);
\
| ^
/usr/src/sys/contrib/openzfs/lib/libspl/include/assert.h:148:25: note:
in expansion of macro 'VERIFY0'
148 | #define ASSERT0 VERIFY0
| ^~~~~~~
/usr/src/sys/contrib/openzfs/module/zfs/dbuf.c:2704:9: note: in
expansion of macro 'ASSERT0'
2704 | ASSERT0(dbuf_find_dirty_eq(db, tx->tx_txg));
| ^~~~~~~
This is because dbuf_find_dirty_eq() returns a pointer, which if
pointers are 32-bit results in a warning about the cast to uint64_t.
Instead, use the ASSERT3P() macro, with == and NULL as second and third
arguments, which should work regardless of the target's bitness.
Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Dimitry Andric <dimitry@andric.com>
Closes #15224
Kristof Provost [Tue, 29 Aug 2023 09:33:17 +0000 (11:33 +0200)]
mcast: fix memory leak in imf_purge()
The IGMP code buffers packets in the imf_inm->inm_scq mbufq, but does
not clear this queue when struct in_mfilter is freed by imf_purge().
This can cause memory leaks if IGMPv3 is used.
Kristof Provost [Tue, 29 Aug 2023 15:16:19 +0000 (17:16 +0200)]
snmp_pf: use libpfctl's pfctl_get_status() rather than DIOCGETSTATUS
Prefer libpfctl functions over direct access to the ioctl whenever
possible. This will allow subsequent removal of DIOCGETSTATUS (in 15) as
there already is an nvlist-based alternative.
Kristof Provost [Tue, 29 Aug 2023 15:04:17 +0000 (17:04 +0200)]
libpfctl: implement status counter accessor functions
The new nvlist-based status call allows us to easily add new counters.
However, the libpfctl interface defines a TAILQ, so it's not quite
trivial to find the counter consumers are interested in.
Provide convenience functions to access the counters.
Kristof Provost [Tue, 29 Aug 2023 15:00:44 +0000 (17:00 +0200)]
pf (t)ftp-proxy: use libpfctl instead of DIOCGETSTATUS
Prefer libpfctl functions over direct access to the ioctl whenever
possible. This will allow subsequent removal of DIOCGETSTATUS (in 15) as
there already is an nvlist-based alternative.
Kristof Provost [Thu, 31 Aug 2023 07:32:54 +0000 (09:32 +0200)]
vmxnet3: do restart on VLAN changes
At least one user reports issues with vmx interfaces after 725e4008ef,
where we default to not resetting the interface on VLAN changes. This
was on an ESXi 7.0.3 setup.
nss_tacplus: Provide dummy setpwent(), getpwent_r(), endpwent().
These aren't really needed, since TACACS+ does not support enumeration, but providing placeholders keeps nsdispatch() from complaining that they're missing.
Simon J. Gerraty [Wed, 30 Aug 2023 14:46:08 +0000 (07:46 -0700)]
Add sys.dirdeps.mk to share/mk FILES
A few recent makefiles should have been added to FILES.
Rename sys.machine.mk to local.sys.machine.mk as it is very
tree specific so does not belong in /usr/share/mk/
Zhenlei Huang [Wed, 30 Aug 2023 09:36:38 +0000 (17:36 +0800)]
net: Remove vlan metadata on pcp / vlan encapsulation
For oubound traffic, the flag M_VLANTAG is set in mbuf packet header to
indicate the underlaying interface do hardware VLAN tag insertion if
capable, otherwise the net stack will do 802.1Q encapsulation instead.
Commit 868aabb4708d introduced per-flow priority which set the priority ID
in the mbuf packet header. There's a corner case that when the driver is
disabled to do hardware VLAN tag insertion, and the net stack do 802.1Q
encapsulation, then it will result double tagged packets if the driver do
not check the enabled capability (hardware VLAN tag insertion).
Unfortunately some drivers, currently known cxgbe(4) re(4) ure(4) igc(4)
and vmx(4), have this issue. From a quick review for other interface
drivers I believe a lot more drivers have the same issue. It makes more
sense to fix in net stack than to try to change every single driver.
Kristof Provost [Tue, 29 Aug 2023 09:23:49 +0000 (11:23 +0200)]
igmp: do not upgrade IGMP version beyond net.inet.igmp.default_version
IGMP requires hosts to use the lowest version they've seen on the
network. When the IGMP timers expire we take the opportunity to upgrade again.
However, we did not take the net.inet.igmp.default_version sysctl
setting into account, so we could end up switching to IGMPv3 even if the
user had requested IGMPv2 or IGMPv1 via the sysctl.
Check V_igmp_default_version before we upgrade the IGMP version.
John Baldwin [Tue, 29 Aug 2023 21:39:36 +0000 (14:39 -0700)]
libcrypto: Refactor Makefile.asm so it can be run outside of buildenv
Currently Makefile.asm relies on the current buildenv to set CFLAGS
for i386. The current approach also leaves various temporary *.s
files around in the current directory. To make this a bit better:
- Instead of using CFLAGS from buildenv for i386, define the actual
flags the perl scripts need: -DOPENSSL_IA32_SSE2 to enable SSE2.
- Change i386 to have the perl scripts write to /dev/stdout to avoid
creating temporaries. Previously i386 was generating the temporary
files in the OpenSSL contrib src.
- Cleanup temporary *.s files in the all target after generating the
real *.S files for architectures which need them.
Tom Cosgrove [Tue, 29 Aug 2023 21:38:11 +0000 (14:38 -0700)]
OpenSSL: Fix handling of the "0:" label in arm-xlate.pl
When $label == "0", $label is not truthy, so `if ($label)` thinks there isn't
a label. Correct this by looking at the result of the s/// command.
Verified that there are no changes in the .S files created during a normal
build, and that the "0:" labels appear in the translation given in the error
report (and they are the only difference in the before and after output).
Ed Maste [Fri, 18 Aug 2023 03:39:08 +0000 (23:39 -0400)]
x86: Introduce APIC ID limit by default on AMD hardware
Lack of an AMD IOMMU driver means we cannot successfully route
interrupts to APIC IDs 255 and over. Do not add the corresponding CPUs
to the per-domain lists of CPUs to which interrupts can be assigned.
This change should be reverted (or, at least the APIC ID limit) once we
have an AMD IOMMU / interrupt remapping driver.
See also commits fa5f94140a83 ("msi: handle error from BUS_REMAP_INTR in
msi_assign_cpu") and 4258eb5a0d97 ("x86: handle domains with no CPUs
usable for intr delivery.").
Reviewed by: markj, jhb
Tested by: cperciva (earlier version)
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D41618
Mark Johnston [Tue, 29 Aug 2023 01:26:53 +0000 (21:26 -0400)]
aesni: Push FPU sections down further
After commit 937b4473be21 aesni_cipher_crypt() and aesni_cipher_mac()
execute in a FPU_KERN_NOCTX section, which means that they must run with
preemption disabled. These functions handle discontiguous I/O buffers
by allocating a contiguous buffer and copying as necessary, but this
allocation cannot happen with preemption disabled. Fix the problem by
pushing the FPU section down into aesni_cipher_crypt() and
aesni_cipher_mac(). In particular, encrypt-then-auth transforms need
not be handled with a single FPU section.
Reported by: syzbot+78258dbb02eb92157357@syzkaller.appspotmail.com
Discussed with: jhb
Fixes: 937b4473be21 ("aesni: Switch to using FPU_KERN_NOCTX.")
Justin Hibbits [Mon, 28 Aug 2023 23:27:11 +0000 (19:27 -0400)]
spibus: Make ofw_spibus probe just a little more favored
With ade70a1ad(svn r332196) ofw_spibus probes at the BUS_PROBE_DEFAULT
instead of 0. However, this races with spibus, resulting in ofw_spibus
often times losing the race and the OFW node not being referenced. This
in turn causes child device tree nodes to not be attached. Solve this
by returning 1 higher than spibus, just like acpi_spibus.
Sponsored by: Juniper Networks, Inc.
MFC after: 1 week
Jamie Gritton [Mon, 28 Aug 2023 18:22:36 +0000 (11:22 -0700)]
jail: make jail(8) man page more readable and more correct
The synopsis section of jail(8) is fine at showing everything that could
be on the command line, but doesn't make much sense. Add some sub-
ections for the different uses of the command.
Also fix up the paragraph about command-line parameter specification,
including removing some clearly erroneous information.
Reviewed by: dvl
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D41606
This adds formatted input/output of binary integer numbers to the printf(), scanf(), and strtol() families, including their wide-character counterparts.
libzfs uses librt as a dependency. Following 315ee00fa961 systems with
a separate / and /usr will fail to load the libzfs.so library because
librt.so is not available due to the fact that /usr is not mounted yet.
Install librt in /lib making it available to libzfs.
Pietro Cerutti [Thu, 17 Aug 2023 11:42:23 +0000 (11:42 +0000)]
netcat: add --crlf to convert LF into CRLF
This adds the --crlf option to netcat, which triggers translation of \n
characters into \r\n sequences in the input -> network direction.
The Linux version of nc also supports this functionality with --crlf and
-C. The OpenBSD version uses -C to specify client certificates. Our
version is too old and doesn't have it, but I avoided adding -C anyway
to ease future syncs with upstream.
Attempts to upstream the feature were unsuccessful:
https://marc.info/?t=169282068500001
Wei Hu [Mon, 28 Aug 2023 09:15:16 +0000 (09:15 +0000)]
mana: batch ringing RX queue doorbell on receiving packets
It's inefficient to ring the doorbell page every time a WQE is posted to
the received queue. Excessive MMIO writes result in CPU spending more
time waiting on LOCK instructions (atomic operations), resulting in
poor scaling performance.
Move the code for ringing doorbell page to where after we have posted all
WQEs to the receive queue in mana_poll_rx_cq().
In addition, use the correct WQE count for ringing RQ doorbell.
The hardware specification specifies that WQE_COUNT should set to 0 for
the Receive Queue. Although currently the hardware doesn't enforce the
check, in the future releases it may check on this value.
Tested by: whu
MFC after: 1 week
Sponsored by: Microsoft
Notable upstream pull request merges:
#15024 Add missed DMU_PROJECTUSED_OBJECT prefetch
#15029 Do not request data L1 buffers on scan prefetch
#15036 FreeBSD: catch up to __FreeBSD_version 1400093
#15039 Fix raw receive with different indirect block size
#15047 FreeBSD: Fix build on stable/13 after 1302506
#15049 Fix the ZFS checksum error histograms with larger record sizes
#15052 Reduce bloat in ereport.fs.zfs.checksum events
#15056 Avoid extra snprintf() in dsl_deadlist_merge()
#15061 Ignore pool ashift property during vdev attachment
#15063 Don't panic if setting vdev properties is unsupported for this vdev type
#15067 spa_min_alloc should be GCD, not min
#15071 Add explicit prefetches to bpobj_iterate()
#15072 Adjust prefetch parameters
#15076 Refactor dmu_prefetch()
#15079 set autotrim default to 'off' everywhere
#15080 ZIL: Fix config lock deadlock
#15088 metaslab: tuneable to better control force ganging
#15096 Avoid waiting in dmu_sync_late_arrival()
#15097 BRT should return EOPNOTSUPP
#15103 Remove zl_issuer_lock from zil_suspend()
#15107 Remove fastwrite mechanism
#15113 libzfs: sendrecv: send_progress_thread: handle SIGINFO/SIGUSR1
#15122 ZIL: Second attempt to reduce scope of zl_issuer_lock
#15129 zpool_vdev_remove() should handle EALREADY error return
#15132 ZIL: Replay blocks without next block pointer
#15148 zfs_clone_range should return descriptive error codes
#15153 ZIL: Avoid dbuf_read() before dmu_sync()
#15172 copy_file_range: fix fallback when source create on same txg
#15180 Update outdated assertion from zio_write_compress
Paul Dagnelie [Sat, 26 Aug 2023 18:34:43 +0000 (11:34 -0700)]
Increase limit of redaction list by using spill block
Currently redaction bookmarks and their associated redaction lists
have a relatively low limit of 36 redaction snapshots. This is imposed
by the number of snapshot GUIDs that fit in the bonus buffer of the
redaction list object. While this is more than enough for most use
cases, there are some limited cases where larger numbers would be
useful to support.
We tweak the redaction list creation code to use a spill block if
the number of redaction snapshots is above the amount that would fit
in the bonus buffer. We also make a small change to allow spill blocks
to be use for types of data besides SA. In order to fully leverage
this logic, we also change the redaction code to use vmem_alloc, to
handle extremely large allocations if needed. Finally, small tweaks
were made to the zfs commands and the test suite.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #15018
Rich Ercolani [Sat, 26 Aug 2023 18:25:46 +0000 (14:25 -0400)]
Avoid save/restoring AMX registers to avoid a SPR erratum
Intel SPR erratum SPR4 says that if you trip into a vmexit while
doing FPU save/restore, your AMX register state might misbehave...
and by misbehave, I mean save all zeroes incorrectly, leading to
explosions if you restore it.
Since we're not using AMX for anything, the simple way to avoid
this is to just not save/restore those when we do anything, since
we're killing preemption of any sort across our save/restores.
If we ever decide to use AMX, it's not clear that we have any
way to mitigate this, on Linux...but I am not an expert.
Justin Hibbits [Fri, 25 Aug 2023 15:36:35 +0000 (11:36 -0400)]
dtsec: Support multicast receive.
Implemented based on the tsec(4) multicast support. This is the minimum
required to support VLANs. The hardware does support vlan tagging,
among other acceleration features, which will be added at a later time.
Doug Rabson [Sat, 26 Aug 2023 09:32:32 +0000 (10:32 +0100)]
Fix MNT_IGNORE for devfs, fdescfs and nullfs
The MNT_IGNORE flag can be used to mark certain filesystem mounts so
that utilities such as df(1) and mount(8) can filter out those mounts by
default. This can be used, for instance, to reduce the noise from
running container workloads inside jails which often have at least three
and sometimes as many as ten mounts per container.
The flag is supplied by the nmount(2) system call and is recorded so
that it can be reported by statfs(2). Unfortunately several filesystems
override the default behaviour and mask out the flag, defeating its
purpose. This change preserves the MNT_IGNORE flag for those filesystems
so that it can be reported correctly.