32-bit compatibility code is conventionally stored in
sys/compat/freebsd32. Move freebsd32_timerfd_gettime() and
freebsd32_timerfd_settime() from sys/kern/sys_timerfd.c to
sys/compat/freebsd32/freebsd32_misc.c.
Do not pollute userspace with <sys/proc.h>, instead declare struct thread
when _KERNEL is defined.
Include <sys/time.h> instead of <sys/timespec.h>. This causes intentional
namespace pollution that mimics Linux. g/musl libcs include <time.h> in
their <sys/timerfd.h>, exposing clock gettime, settime functions and
CLOCK_ macro constants. Ports like Chromium expect this namespace
pollution and fail without it.
Define a locking regime for the members of struct timerfd and document
it so future code can follow the standard. The lock legend can be found
in a comment above struct timerfd.
Additionally,
* Add assertions based on locking regime.
* Fill kn_data with the expiration count when EVFILT_READ is triggered.
* Report st_ctim for stat(2).
* Check if file has f_type == DTYPE_TIMERFD before assigning timerfd
pointer to f_data.
cxgbe(4): Avoid hang on kldunload on netlink enabled kernels.
netlink(4) calls back into the driver during detach and it attempts to
start an internal synchronized op recursively, causing an interruptible
hang. Fix it by failing the ioctl if the VI has been marked as DOOMED
by cxgbe_detach.
Here's the stack for the hang for reference.
#6 begin_synchronized_op
#7 cxgbe_media_status
#8 ifmedia_ioctl
#9 cxgbe_ioctl
#10 if_ioctl
#11 get_operstate_ether
#12 get_operstate
#13 dump_iface
#14 rtnl_handle_ifevent
#15 rtnl_handle_ifnet_event
#16 rt_ifmsg
#17 if_unroute
#18 if_down
#19 if_detach_internal
#20 if_detach
#21 ether_ifdetach
#22 cxgbe_vi_detach
#23 cxgbe_detach
#24 DEVICE_DETACH
MFC after: 3 days
Sponsored by: Chelsio Communications
Warner Losh [Tue, 5 Sep 2023 20:58:06 +0000 (14:58 -0600)]
msun: LIBCSRCDIR is too fragile, use ${SRCTOP}/lib/libc instead
LIBCSRCDIR is defined in bsd.libnames.mk, which is read in later in the
Makefile than the line:
.if exists(${LIBCSRCDIR}/${MACHINE_ARCH})
so we test to see if /${MARCHIN_ARCH} exists which it usually doesn't
(but did for me since I mounted 13.2R SD image there). Move to defining
our own LIBC_SRCTOP in terms of SRCTOP to treat these uniformily.
Update ieee80211_request_smps() to the new number of arguments in
LinuxKPI (which was already prepared) and update the one call in the
older iwlwifi driver version.
This will allow iwlwifi as-is now and rtw88 to compile in case someone
else wants to work on the latter in parallel to predominant efforts on
the former.
Sponsored by: The FreeBSD Foundation
MFC after: 20 days
This update follows other currently disconnected LinuxKPI based wireless
drivers to lift them all to a same version in case someone else wants to
work on this driver in parallel to predominant iwlwifi efforts.
As announced on freebsd-wireless [1] disconnect rtw88 from the build.
Add a note to the man page about the current state but leave the man
page in place for now as this is supposed to be temporary.
Bjoern A. Zeeb [Thu, 8 Jun 2023 21:35:09 +0000 (21:35 +0000)]
rtw88/rtw89: remove local firmware.
Remove firmware from src/ in favor of the ports/packages and fwget(8).
This will allow us to shrink the size of src (and installed modules).
Update the rtw88 man page to reflect the change.
These won't be added elsewhere, so add a little bit of room to make
these messages a little easier to read. The existing set is a mixed
bag, there are somewhere in the ballpark of 45, 46 printfs to stderr and
19 of those had newlines.
Reviewed by: yuripv
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D41693
POSIX defines a number of other control characters as well as
alternative aliases for some that should be provided in the default set,
so let's go ahead and add those.
The lr register is cleared at the beginning of the _dl_start and _start,
so there is no need to initialize it.
Gnu libc _start takes an rtld_fini pointer in x0 which is set by ld.so
for __libc_start_main, the kernel does not register any atexit pointers.
John Baldwin [Tue, 5 Sep 2023 16:04:54 +0000 (09:04 -0700)]
riscv: Save gp in the trapframe in both modes
Similar to d95fbf4e1a12565908b04b442263fe60c9e890b4, always save gp in
the trapframe even though it is only restored when returning to user
mode. This is mostly a debugging aid so that dump_regs() doesn't
print out random stack garbage as the value of gp for kernel faults
(e.g. sysctl debug.kdb.trap=1) as well as keeping kgdb's trapframe
unwinder from reporting bogus values of $gp for lower frames.
Vico Chen [Tue, 5 Sep 2023 08:53:02 +0000 (11:53 +0300)]
linux(4): Convert flags in timerfd_create
The timerfd is introduced in FreeBSD 14, and the Linux ABI timerfd is
also moved to FreeBSD native timerfd, but it can't work well as Linux
TFD_CLOEXEC and TFD_NONBLOCK haven't been converted to FreeBSD
TFD_CLOEXEC and TFD_NONBLOCK.
linux(4): Return ENOTSUP from listxattr instead of EPERM
FreeBSD does not permits manipulating extended attributes in the system
namespace by unprivileged accounts, even if account has appropriate
privileges to access filesystem object.
In Linux the system namespace is used to preserve posix acls. Some Gnu
coreutils binaries uses posix acls, eg, install, ls, cp. And fails if
we unexpectedly return EPERM error from xattr system calls.
In the other hands, in Linux read and write access to the system
namespace depend on the policy implemented for each filesystem, so we'll
mimics we're a filesystem that prohibits this for unpriveleged accounts.
The module entries should generally allow whatever is allowed as an
env_var in the pattern table. Notably, we're missing periods which
would allow proper entries for .dtb files in loader.conf that don't need
to specify a module_name entry for it.
%d in this expression is actually redundant as %w is actually
"all alphanumerics," but I've included it for now to match the env_var
entry. We should really remove it from both.
Reported by: "aribi" on the forums via allanjude@
MFC after: 1 week
Wei Hu [Mon, 4 Sep 2023 14:53:10 +0000 (14:53 +0000)]
mana: fix tso parameters and set hwassist bits
The parameters for tso on mana were not set correctly. Also the
hwassist bits were not set. These two cause tso on mana not work.
Fixed the issues and make tso working on mana.
Tested by: whu
MFC after: 3 days
Sponsored by: Microsoft
Wei Hu [Mon, 4 Sep 2023 10:15:06 +0000 (10:15 +0000)]
Hyper-V: hn: use VF's capabilities when it is attached
Current code in hn/netvsc tries to merge (logical AND) VF and
its own capability bits when a VF is attached. This results in
losing some key VF features, especially in tx path. For example,
the VF's txcsum, rxcsum or tso bits could be lost if any of
these are not in hn/netvsc's own capablility field.
Actually when VF is attached, hn just needs to use VF's caps
as all the tx packets would be forwarded to the VF interface.
Fix this problem by doing so.
Reported by: whu
Tested by: whu
MFC after: 3 days
Sponsored by: Microsoft
Whilst Clang does not use these, recent GCC does, and so on systems such
as FreeBSD that wish to use compiler-rt as the system runtime library
but also wish to support building programs with GCC these interfaces are
needed.
This is a light adaptation of the code committed to GCC by Sebastian Pop
<spop@amazon.com>, relicensed with permission for use in compiler-rt.
The pm_ev field of struct pmc_op_pmcallocate and struct pmc
traditionally contains the index of the chosen event, corresponding to
the __PMC_EVENTS array in pmc_events.h. This is a static list of events,
maintained by FreeBSD.
In the usual case, libpmc translates the user supplied event name
(string) into the pm_ev index, which is passed as an argument to the
allocation syscall. On the kernel side, the allocation method for the
relevant hwpmc class translates the given index into the event code that
will be written to an event selection register.
In 2018, a new source of performance event definitions was introduced:
the pmu-events json files, which are maintained by the Linux kernel. The
result was better coverage for newer Intel processors with a reduced
maintenance burden for libpmc/hwpmc. Intel and AMD CPUs were
unconditionally switched to allocate events from pmu-events instead of
the traditional scheme (959826ca1bb0a, 81eb4dcf9e0d).
Under the pmu-events scheme, the pm_ev field contains an index
corresponding to the selected event from the pmu-events table, something
which the kernel has no knowledge of. The configuration for the
performance counting registers is instead passed via class-dependent
fields (struct pmc_md_op_pmcallocate).
In 2021 I changed the allocation logic so that it would attempt to
pull from the pmu-events table first, and fall-back to the traditional
method (dfb4fb41166bc3). Later, pmu-events support for arm64 and power8
CPUs was added (28dd6730a5d6 and b48a2770d48b).
The problem that remains is that the pm_ev field is overloaded, without
a definitive way to determine whether the event allocation came from the
pmu-events table or FreeBSD's statically-defined PMC events. This
resulted in a recent fix, 21f7397a61f7.
Change:
To disambiguate these two supported but separate use-cases, add a new
flag, PMC_F_EV_PMU, to be set as part of the allocation, indicating that
the event index came from pmu-events.
This is useful in two ways:
1. On the kernel side, we can validate the syscall arguments better.
Some classes support only the traditional event scheme (e.g.
hwpmc_armv7), while others support only the pmu-events method (e.g.
hwpmc_core for Intel). We can now check for this. The hwpmc_arm64
class supports both methods, so the new flag supersedes the existing
MD flag, PM_MD_EVENT_RAW.
2. The flag will be tracked in struct pmc for the duration of its
lifetime, meaning it is communicated back to userspace. This allows
libpmc to perform the reverse index-to-event-name translation
without speculating about the meaning of the index value.
Adding the flag is a backwards-incompatible ABI change. We recently
bumped the major version of the hwpmc module, so this breakage is
acceptable.
Reviewed by: jkoshy
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D40753
Mitchell Horne [Mon, 19 Jun 2023 22:32:22 +0000 (19:32 -0300)]
libpmc: make pmc_pmu_pmcallocate() machine-independent
Have it call the platform-dependent version. For better layering, move
the reset logic inside the new function. This is mainly to facilitate an
upcoming change.
Reviewed by: jkoshy
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D40752
kern_kthread: fork1() does not handle locked Giant
fork1() does not behave if called under Giant. For instance, it might
need to call thread_suspend_check() which explicitly verifies that Giant
is not locked. On the other hand, the kthread KPI is often called from
SYSINIT() which is still Giant-locked.
Correct this by dropping Giant in kthread_add() and kproc_create().
Reported by: pho
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D41694
tcp: Initialize the maximum number of entries in a client cookie cache bucket
This vnet loader tunable is defined with SYSCTL_PROC, thus will not be
initialized by kernel on vnet creating and will always have the default
value TCP_FASTOPEN_CCACHE_BUCKET_LIMIT_DEFAULT.
Fix by fetching the value from the corresponding kernel environment during
vnet constructing.
PR: 273509
Reviewed by: #transport, tuexen
Fixes: c560df6f12f1 This is an implementation of the client side of TCP Fast Open (TFO) [RFC7413]
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D41691
Warner Losh [Sat, 2 Sep 2023 19:12:31 +0000 (13:12 -0600)]
tests: Skip all tests that require mdconfig when /dev/mdctl missing
When run in a jail, /dev/mdctl is missing. So skip any tests that use
mdconfig or mdmfs with md in this case: they can't possibly work. This
is in line with other tests that test for presence of required features
and skip if they aren't present. I did this instead of checking for
jails so they can still run in jails that allow creation of md devices.
Notable upstream pull request merges:
#15018 Increase limit of redaction list by using spill block
#15161 Make zoned/jailed zfsprops(7) make more sense
#15216 Relax error reporting in zpool import and zpool split
#15218 Selectable block allocators
#15227 ZIL: Tune some assertions
#15228 ZIL: Revert zl_lock scope reduction
#15233 ZIL: Change ZIOs issue order
ZFS historically has had several space allocators that were
dynamically selectable. While these have been retained in
OpenZFS, only a single allocator has been statically compiled
in. This patch compiles all allocators for OpenZFS and provides
a module parameter to allow for manual selection between them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes #15218
Relax error reporting in zpool import and zpool split
For zpool import and zpool split, zpool_enable_datasets is called
to mount and share all datasets in a pool. If there is an error
while mounting or sharing any dataset in the pool, the status of
import or split is reported as failure. However, the changes do
show up in zpool list.
This commit updates the error reporting in zpool import and zpool
split path. More descriptive messages are shown to user in case
there is an error during mount or share. Errors in mount or share
do not effect the overall status of zpool import and zpool split.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #15216
Andrea Righi [Sat, 2 Sep 2023 00:21:40 +0000 (02:21 +0200)]
Linux 6.5 compat: safe cleanup in spl_proc_fini()
If we fail to create a proc entry in spl_proc_init() we may end up
calling unregister_sysctl_table() twice: one in the failure path of
spl_proc_init() and another time during spl_proc_fini().
Avoid the double call to unregister_sysctl_table() and while at it
refactor the code a bit to reduce code duplication.
This was accidentally introduced when the spl code was
updated for Linux 6.5 compatibility.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Closes #15234
Closes #15235
Alexander Motin [Sat, 2 Sep 2023 00:14:50 +0000 (20:14 -0400)]
ZIL: Change ZIOs issue order.
In zil_lwb_write_issue(), after issuing lwb_root_zio/lwb_write_zio,
we have no right to access lwb->lwb_child_zio. If it was not there,
the first two ZIOs may have already completed and freed the lwb.
ZIOs issue in opposite order from children to parent should keep
the lwb valid till the end, since the lwb can be freed only after
lwb_root_zio completion callback.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15233
Alexander Motin [Sat, 2 Sep 2023 00:13:52 +0000 (20:13 -0400)]
ZIL: Revert zl_lock scope reduction.
While I have no reports of it, I suspect possible use-after-free
scenario when zil_commit_waiter() tries to dereference zcw_lwb
for lwb already freed by zil_sync(), while zcw_done is not set.
Extension of zl_lock scope as it was originally should block
zil_sync() from freeing the lwb, closing this race.
This reverts #14959 and couple chunks of #14841.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15228
Brooks Davis [Fri, 1 Sep 2023 16:42:52 +0000 (17:42 +0100)]
Add INIT_ALL build option
This option replaces WITH_INIT_ALL_PATTERN and WITH_INIT_ALL_ZERO with
INIT_ALL=pattern and INIT_ALL=zero respectively. As these are
relatively rarely used options no backwards compatibility is
implemented.
Brooks Davis [Fri, 1 Sep 2023 16:42:39 +0000 (17:42 +0100)]
libc: add LIBC_MALLOC option
This will enable alternative mallocs to be included in the tree and
selected by setting LIBC_MALLOC. As there is only one today (jemalloc)
this option does nothing, but we expect to add other implementations
in the future. This will also reduce diffs to CheriBSD.
Brooks Davis [Fri, 1 Sep 2023 16:41:59 +0000 (17:41 +0100)]
makeman: add minimal support for group options
Ignore OPT_* values in showconfig out in exising code paths and add
a new path to include descriptions for each. For now, hardcode the
description contents rather than attempting to generate it. This runs
the risk of docs getting out of date, limits the amount of new shell
code added today while a lua rewrite is nearly ready to land.
This change requires a followup commit to enable OPT_* values in
"make showconfig" in order to actually find group options.
Brooks Davis [Fri, 1 Sep 2023 16:41:07 +0000 (17:41 +0100)]
share/mk: support for "single" group options
Support group options where 1 of n values will be selected (or a default
value will be used). After processing, an OPT_FOO will be set to one
value from __FOO_OPTIONS for each FOO in __SINGLE_OPTIONS. If the user
sets FOO that value will be used, otherwise __FOO_DEFAULT will be used.
Options that don't work an a particular system can be remapped to an
alternative using BROKEN_SINGLE_OPTIONS which can be set to a list of
3-tuples of the form:
OPTION broken_value replacement_value
This is somewhat inspired by OPTIONS_SINGLE from ports, but the
structure is quite different with a per-option variable in the style of
MK_FOO={yes,no}.
Zachary Leaf [Thu, 31 Aug 2023 13:11:53 +0000 (14:11 +0100)]
armv8_crypto: fix recursive fpu_kern_enter call
Now armv8_crypto is using FPU_KERN_NOCTX, this results in a kernel panic
in armv8_crypto.c:armv8_crypto_cipher_setup:
panic: recursive fpu_kern_enter while in PCB_FP_NOSAVE state
This is because in armv8_crypto.c:armv8_crypto_cipher_process,
directly after calling fpu_kern_enter() a call is made to
armv8_crypto_cipher_setup(), resulting in nested calls to
fpu_kern_enter() without the required fpu_kern_leave() in between.
Move fpu_kern_enter() in armv8_crypto_cipher_process() after the
call to armv8_crypto_cipher_setup() to resolve this.
Reviewed by: markj, andrew Fixes: 6485286f536f ("armv8_crypto: Switch to using FPU_KERN_NOCTX")
Sponsored by: Arm Ltd
Differential Revision: https://reviews.freebsd.org/D41671
Andrew Turner [Tue, 22 Aug 2023 16:01:21 +0000 (17:01 +0100)]
gicv3: Support indirect ITS tables
The GICv3 ITS device supports two options for device tables. Currently
we support a single table to hold all device IDs, however when the
device ID space grows large this can be too large for the GITS_BASER
register to describe.
To handle this case, and to reduce the memory needed when this space
is sparse support the second option, the indirect table. The indirect
table is a 2 level table where the first level contains the physical
address of the second with a valid bit. The second level is an ITS
page sized table where each entry is the original entry size.
As we don't need to allocate a second level table for devices IDs that
don't exist this can reduce the allocation size.
Reviewed by: gallatin
Sponsored by: Arm Ltd
Differential Revision: https://reviews.freebsd.org/D41555
Andrew Turner [Wed, 23 Aug 2023 12:34:09 +0000 (13:34 +0100)]
arm: Add a userspace physical timer check
We currently use the same Arm generic time in both userspace and the
kernel. As we always enable userspace access to the virtual timer we
can tell userspace to use it.
Reviewed by: imp
Sponsored by: Arm Ltd
Differential Revision: https://reviews.freebsd.org/D41565
jail: Add the ability to access system-level filesystem extended attributes
Prior to this commit privileged accounts in a jail could not access to the
filesystem extended attributes in the system namespace. To control access to
the system namespace in a per-jail basis add a new configuration parameter
allow.extattr which is off by default.
linux(4): Return ENOTSUP from xattr syscalls instead of EPERM
FreeBSD does not permits manipulating extended attributes in the system
namespace by unprivileged accounts, even if account has appropriate
privileges to access filesystem object.
In Linux the system namespace is used to preserve posix acls. Some Gnu
coreutils binaries uses posix acls, eg, install, ls. And fails if we
unexpectedly return EPERM error from xattr system calls.
In the other hands, in Linux read and write access to the system
namespace depend on the policy implemented for each filesystem, so we'll
mimics we're a filesystem that prohibits this for unpriveleged accounts.
arm64: initialize pcb in the TBI/PAC/etc. fault case
After 2c10be9e06d, we may jump to the bad_far label without `pcb` being
set, resulting in a follow-up fault as we may dereference it immediately
after the jump if td_intr_nesting_level == 0. In this branch, it should
be safe to dereference `td` as we're not handling the special case
mentioned below of accessing it during promotion/demotion.
This seems to fix a null ptr deref I hit during my most recent pkgbase
build attempt on the Windows DevKit, though that was admittedly
encountered while we were on the way to a panic from an apparent
use-after-free in ZFS bits.
dmu_buf_will_clone: change assertion to fix 32-bit compiler warning
Building module/zfs/dbuf.c for 32-bit targets can result in a warning:
In file included from
/usr/src/sys/contrib/openzfs/include/sys/zfs_context.h:97,
from /usr/src/sys/contrib/openzfs/module/zfs/dbuf.c:32:
/usr/src/sys/contrib/openzfs/module/zfs/dbuf.c: In function
'dmu_buf_will_clone':
/usr/src/sys/contrib/openzfs/lib/libspl/include/assert.h:116:33: error:
cast from pointer to integer of different size
[-Werror=pointer-to-int-cast]
116 | const uint64_t __left = (uint64_t)(LEFT);
\
| ^
/usr/src/sys/contrib/openzfs/lib/libspl/include/assert.h:148:25: note:
in expansion of macro 'VERIFY0'
148 | #define ASSERT0 VERIFY0
| ^~~~~~~
/usr/src/sys/contrib/openzfs/module/zfs/dbuf.c:2704:9: note: in
expansion of macro 'ASSERT0'
2704 | ASSERT0(dbuf_find_dirty_eq(db, tx->tx_txg));
| ^~~~~~~
This is because dbuf_find_dirty_eq() returns a pointer, which if
pointers are 32-bit results in a warning about the cast to uint64_t.
Instead, use the ASSERT3P() macro, with == and NULL as second and third
arguments, which should work regardless of the target's bitness.
Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Dimitry Andric <dimitry@andric.com>
Closes #15224
Kristof Provost [Tue, 29 Aug 2023 09:33:17 +0000 (11:33 +0200)]
mcast: fix memory leak in imf_purge()
The IGMP code buffers packets in the imf_inm->inm_scq mbufq, but does
not clear this queue when struct in_mfilter is freed by imf_purge().
This can cause memory leaks if IGMPv3 is used.
Kristof Provost [Tue, 29 Aug 2023 15:16:19 +0000 (17:16 +0200)]
snmp_pf: use libpfctl's pfctl_get_status() rather than DIOCGETSTATUS
Prefer libpfctl functions over direct access to the ioctl whenever
possible. This will allow subsequent removal of DIOCGETSTATUS (in 15) as
there already is an nvlist-based alternative.
Kristof Provost [Tue, 29 Aug 2023 15:04:17 +0000 (17:04 +0200)]
libpfctl: implement status counter accessor functions
The new nvlist-based status call allows us to easily add new counters.
However, the libpfctl interface defines a TAILQ, so it's not quite
trivial to find the counter consumers are interested in.
Provide convenience functions to access the counters.
Kristof Provost [Tue, 29 Aug 2023 15:00:44 +0000 (17:00 +0200)]
pf (t)ftp-proxy: use libpfctl instead of DIOCGETSTATUS
Prefer libpfctl functions over direct access to the ioctl whenever
possible. This will allow subsequent removal of DIOCGETSTATUS (in 15) as
there already is an nvlist-based alternative.
Kristof Provost [Thu, 31 Aug 2023 07:32:54 +0000 (09:32 +0200)]
vmxnet3: do restart on VLAN changes
At least one user reports issues with vmx interfaces after 725e4008ef,
where we default to not resetting the interface on VLAN changes. This
was on an ESXi 7.0.3 setup.