Simon J. Gerraty [Fri, 25 Aug 2023 00:41:22 +0000 (17:41 -0700)]
Add mac_grantbylabel
This module allows controlled privilege escallation via mac labels
securely associated with a process via mac_veriexec.
There are over 700 PRIV_* but we can compress many of them into
a single GBL_* thus constraining the size of gbl labels.
The goal is to allow a daemon to run as an unprivileged process while
still being able a set of privileged operations needed.
We add APIs to libveriexec so that userland processes can check labels
and an exec_script API that allows a suitably labeled process to run
something like a python interpreter directly if necessary;
overcomming the 'indirect' flag applied to the interpreter.
Mateusz Guzik [Thu, 24 Aug 2023 05:34:08 +0000 (05:34 +0000)]
vfs: try harder to find free vnodes when recycling
The free vnode marker can slide past eligible entries.
Artificially reducing vnode limit to 300k and spawning 104 workers each
creating a million files results in all of them trying to recycle, which
often fails when it should not have to.
Because of the excessive traffic in this scenario, the trylock to
requeue is virtually guaranteed to fail, meaning nothing gets pushed
forward.
Since no vnodes were found, the most unfortunate sleep for 1 second is
induced (see vn_alloc_hard, the "vlruwk" msleep).
Without the fix the machine is mostly idle with almost everyone stuck
off CPU waiting for the sleep to finish. With the fix it is busy
creating files.
Unrelated to the above problem the marker could have landed in a
similarly problematic spot for because of any failure in vtryrecycle.
Originally reported as poudriere builders stalling in a vnode-count
restricted setup.
Fixes: 138a5dafba31 ("vfs: trylock vnode requeue")
Reported by: Mark Millard
Kevin Bowling [Thu, 24 Aug 2023 20:42:23 +0000 (13:42 -0700)]
iflib: invert default restart on VLAN changes
In rS360398, a new iflib device method was added to opt out of VLAN
events needing an interface reset.
I am switching the default to not requiring a restart for:
* VLAN events
* unknown events
After fixing various bugs, I do not think this would be a common need
of hardware and it is undesirable from the user's perspective causing
link flaps and much slower VLAN configuration. Currently, there are no
other restart events besides VLAN events, and setting the
ifdi_needs_restart default to false will alleviate the need to churn
every driver if an odd event is added in the future for specific
hardware.
markj points out this could cause churn in the other direction; I will
solve that problem with an event registration system as he mentions in
the review should we need it in the future.
These drivers will opt into restart and need further inspection or work:
* ixv (needs code audit, 61a8231 fixed principal issue; re-init probably
not necessary)
* axgbe (needs code audit; re-init probably not necessary)
* iavf - (needs code audit; interaction with Malicious Driver Detection
mentioned in rS360398)
* mgb - no VLAN functions are currently implemented. Left a comment.
Kevin Bowling [Thu, 24 Aug 2023 20:16:24 +0000 (13:16 -0700)]
bnxt: Don't restart on VLAN changes
In rS360398, a new iflib device method was added with default of opt out
for VLAN events needing an interface reset.
This is unintentional for bnxt(4) and is causing another bug in its VLAN
initialization code to affect the common case of adding and removing
VLANs on an existing interface.
Jake Freeland [Thu, 24 Aug 2023 04:39:54 +0000 (22:39 -0600)]
timerfd: Move implementation from linux compat to sys/kern
Move the timerfd impelemntation from linux compat code to sys/kern. Use
it to implement the new system calls for timerfd. Add a hook to kern_tc
to allow timerfd to know when the system time has stepped. Add kqueue
support to timerfd. Adjust a few names to be less Linux centric.
ps: add a new option -D to reimplement tree traversal
It takes a non-optional parameter string, one of "up", "down", or "both"
that can request tree traversal in the chosen directions. This adds PIDs
from the paths to the selection of PIDs and can be used together with -d
to draw a subset of the process tree.
By commiting ca8c0d5e8110 I was hoping that the existing option -d
could just be extended to work with -p to implement a feature that was
and I think is still needed, that is to show all descendant processes
of a given process id or a set of process ids.
After a complaint from -current which may represent a wider
dissatisfaction with this change in the program's behavior, I think it
will be better to revert ca8c0d5e8110 and reintroduce this feature
using a separate option -D.
Michael Tuexen [Thu, 24 Aug 2023 13:52:55 +0000 (15:52 +0200)]
sctp: improve handling of socket shutdown for reading
If a socket is marked as cannot read anymore, drop chunks which
should be added to a control element in the receive queue.
This is consistent with dropping control elements instead of
adding them in the same situation.
pf: Access r->rpool.cur->kif under mutex protection
pf_route() sends traffic to a specified next hop over a specific
interface. The next hop is obtained in pf_map_addr() but the interface
is obtained directly via r->rpool.cur->kif` outside of the lock held in
pf_map_addr() in multiple places around pf. The chosen interface is not
stored in source node.
Move the interface selection into pf_map_addr(), have the function
return it together with the chosen IP address and ensure its stored
in struct pf_ksrc_node, store it in the source node and use the stored
value when needed.
Robert Wing [Wed, 23 Aug 2023 18:39:13 +0000 (10:39 -0800)]
bectl: make mount subcommand less verbose
The mount subcommand currently produces output such as:
# bectl mount <bootenv>
Successfully mounted <bootenv> at <mountpoint>
This commit changes it to only print the mountpoint:
# bectl mount <bootenv>
<mountpoint>
This makes it easier to script the mount subcommand. If an error occurs
while mounting, an error message is printed to stderr and bectl will
exit with a non-zero value.
Jessica Clarke [Wed, 23 Aug 2023 17:00:16 +0000 (18:00 +0100)]
Makefile: Support universe-toolchain on non-FreeBSD
We currently pass MACHINE and MACHINE_ARCH as TARGET and TARGET_ARCH
respectively for universe-toolchain, but on non-FreeBSD these may not
have values that we understand (e.g. on Linux it will be x86_64 rather
than amd64) for TARGET/TARGET_ARCH (note that we do support them for
MACHINE/MACHINE_ARCH). Since the choice is a bit arbitrary and merely
determines what LLVM's default triple will be, use amd64 on non-FreeBSD
as a known-good default.
Jessica Clarke [Wed, 23 Aug 2023 16:56:56 +0000 (17:56 +0100)]
tools/build/make.py: Make --with-default-sys-path mirror usr.bin/bmake
The top-level Makefile passes -m to its sub-makes in order to ensure
they use the in-tree mk files in share/mk, but the top-level make itself
has to rely on whatever environment the bmake used has. For FreeBSD, we
configure the system bmake with .../share/mk:/usr/share/mk, which means
it will pick up src's share/mk whenever run from within the src tree,
but currently for non-FreeBSD we configure our bootstrap bmake only with
bmake's own mk files. This is mostly compatible, with two exceptions:
1. "targets" runs at the top level, but needs TARGET_MACHINE_LIST and
the corresponding MACHINE_ARCH_LIST_${target}, otherwise it will just
print an empty list.
2. "universe" and "universe-toolchain", when run at the top level (i.e.
not via the various wrappers around universe like tinderbox), end up
failing in universe-toolchain itself with:
bmake[1]: "/path/to/freebsd/share/mk/src.sys.obj.mk" line 112: Cannot use MAKEOBJDIR=
Unset MAKEOBJDIR to get default: MAKEOBJDIR='${.CURDIR:S,^${SRCTOP},${OBJTOP},}'
By including .../share/mk in the default sys path like FreeBSD's system
bmake we ensure that we get the in-tree mk files for the top-level make,
not just sub-makes, and avoid such issues.
Note that we cannot (yet) stop using the installed mk files, since the
MAKEOBJDIRPREFIX check in Makefile runs in the object directory and uses
env -i, thereby losing the MAKESYSPATH exported by src.sys.env.mk. Other
such issues may also exist, though are likely rare if so.
We currently assume that any existing bootstrapped bmake binary will
work, but this means it never gets updated as contrib/bmake is, and
similarly we won't rebuild it as and when the configure arguments given
to boot-strap change. Whilst the former isn't necessarily a huge problem
given WANT_MAKE_VERSION rarely gets bumped in Makefile, having fewer
variables is a good thing, and so it's easiest if we just always keep it
up-to-date rather than trying to do something similar to what's already
in Makefile (which may or may not be accurate, given updating FreeBSD
gives you an updated bmake, but nothing does so for our bootstrapped
bmake on non-FreeBSD). The latter is more problematic, though, and the
next commit will be changing this configuration.
We thus now add in two checks. The first is to compare MAKE_VERSION
against _MAKE_VERSION from contrib/bmake/VERSION. The second is to
record at bootstrap time the exact configuration used, and compare that
against what we would bootstrap with.
Andrew Turner [Wed, 23 Aug 2023 14:32:56 +0000 (15:32 +0100)]
Support dynamically sized register sets
We don't always know the size of the register set at compile time,
e.g. on arm64 the size of the SVE registers need to be queried on boot.
To support register sets that needs to be calculated at run time
query the correct size when it is zero.
Andrew Turner [Tue, 22 Aug 2023 10:51:26 +0000 (11:51 +0100)]
gicv3: Split out finding the page size
When adding indirect (2 level) tabled we will need to know the page
size to calculate the size of the level 1 table. To allow for this find
the page size before entering the loop to calculate the final register
value.
Piotr Kubaj [Tue, 22 Aug 2023 10:45:56 +0000 (12:45 +0200)]
iavf: remove compatibility code and address some warnings
Code for pre-11 FreeBSD versions is removed.
Also removed are macros that are not used anymore and "i" variable
does not shadow anymore other "i" variable.
Zhenlei Huang [Wed, 23 Aug 2023 09:48:12 +0000 (17:48 +0800)]
net: Do not overwrite if_vlan's PCP
In commit c7cffd65c5d8 the function ether_8021q_frame() was slightly
refactored to use pointer of struct ether_8021q_tag as parameter qtag to
include the new option proto.
It is wrong to write to qtag->pcp as it will effectively change the memory
that qtag points to. Unfortunately the transmit routine of if_vlan parses
pointer of the member ifv_qtag of its softc which stores vlan interface's
PCP internally, when transmitting mbufs that contains PCP the vlan
interface's PCP will get overwritten.
Fix by operating on a local copy of qtag->pcp. Also mark 'struct ether_8021q_tag'
as const so that compilers can pick up such kind of bug.
PR: 273304
Reviewed by: kp
Fixes: c7cffd65c5d85 Add support for stacked VLANs (IEEE 802.1ad, AKA Q-in-Q)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D39505
Michael Tuexen [Wed, 23 Aug 2023 06:36:15 +0000 (08:36 +0200)]
sctp: improve handling of SHUTDOWN and SHUTDOWN ACK chunks
When handling a SHUTDOWN or SHUTDOWN ACK chunk detect if the peer
is violating the protocol by not having made sure all user messages
are reveived by the peer. If this situation is detected, abort the
association.
Kyle Evans [Wed, 23 Aug 2023 03:40:45 +0000 (22:40 -0500)]
libc: iconv: zero out cv_shared on allocation
Right now we have to zero-initialize most fields in the varius callers,
but this is a little error prone. Simplify it by zeroing it out upon
allocation instead, drop the other redundant initialization.
Reviewed by: markj
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D41546
Kyle Evans [Wed, 23 Aug 2023 03:40:45 +0000 (22:40 -0500)]
libc: fix c*rtomb/mbrtoc*
In 693f88c9da8d ("iconv_std: complete the //IGNORE support"), we
more completely implemented //IGNORE, which changed the semantics of
ci_discard_ilseq. DISCARD_ILSEQ semantics are supposed to match
//IGNORE, so we really can't do much about that particular
incompatibility. This broke c*rtomb and mbrtoc* handling of invalid
sequences, but it turns out they don't want DISCARD_ILSEQ semantics at
all; they really want the subset that we call
_CITRUS_ICONV_F_HIDE_INVALID.
This restores the exact flow in iconv_std to precisely how it happened
prior to 693f88c9da8d.
PR: 265871
Fixes: 693f88c9da8d ("iconv_std: complete the //IGNORE support")
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D41513
This is an attempt at clean-room implementation of the Linux'
membarrier(2) syscall. For documentation, you would need to read
both membarrier(2) Linux man page, the comments in Linux
kernel/sched/membarrier.c implementation and possibly look at
actual uses.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32360
For amd64, i386, arm, and riscv, i.e. all architectures except arm64,
the custom implementation is provided since we maintain the bitmask of
active CPUs anyway.
Arm64 uses somewhat naive iteration over CPUs and match current vmspace'
pmap with the argument. It is not guaranteed that vmspace->pmap is the
same as the active pmap, but the inaccuracy should be toleratable.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32360
Bartosz Sobczak [Tue, 22 Aug 2023 23:07:11 +0000 (16:07 -0700)]
ofed: mask seq_num identifier to occupy only 3 bytes
The seq_num among other things is used to assign rq_psn value, which is
a 24-bit identifier. When the seq_num is full 4-byte value, we are
usually receiving: '_ib_modify_qp rq_psn overflow, masking to 24 bits'
warning.
This is burdensome for running rdma traffic with large number of
connections, because the number of logs is growing fast.
Jessica Clarke [Tue, 22 Aug 2023 20:01:03 +0000 (21:01 +0100)]
libzstd: Explicitly define ZSTD_DISABLE_ASM
On FreeBSD, ZSTD_ASM_SUPPORTED is defined as 0, but on macOS and Linux
it is defined as 1, yet we don't build any of the assembly sources.
Rather than add them just for bootstrapping on non-FreeBSD, explicitly
define ZSTD_DISABLE_ASM so they're not needed and everything is
consistent.
This fixes building a bootstrap LLVM toolchain on non-FreeBSD amd64 (the
only architecture with assembly available).
Jessica Clarke [Tue, 22 Aug 2023 20:00:37 +0000 (21:00 +0100)]
arm: Add missing no-ctfconvert for fw_stub.awk target
This target produces a C file not an object file, so using ctfconvert on
it should not be attempted. This keeps it in sync with all other uses of
fw_stub.awk, squashes a warning seen during the build of TEGRA124 on
FreeBSD and avoids the same issue failing the build on non-FreeBSD (such
errors are #ifdef'ed into being warnings on FreeBSD in ctfconvert, which
should be revisited in the future).
Reviewed by: manu
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D41542
Jessica Clarke [Tue, 22 Aug 2023 20:00:28 +0000 (21:00 +0100)]
kbdcontrol: Support building as a bootstrap tool on old and non-FreeBSD
Systems that predate 971bac5ace7a ("kbd: consolidate kb interfaces
(phase one)") cannot build kbdcontrol since kbdelays and kbrates moved
to sys/kbio.h. Moreover, on non-FreeBSD, it requires all kinds of ioctls
and sysctls that are highly FreeBSD-specific to build, but we use it as
a bootstrap tool to generate the keymaps used by some kernels (LINT ones
in particular). Thus, when bootstrapping kbdcontrol, disable everything
that's not needed for that singular use, and use the in-tree kbio.h to
get the definitions of the necessary structures.
This allows KBDMUX_DFLT_KEYMAP, UKBD_DFLT_KEYMAP and ATKBD_DFLT_KEYMAP
to be enabled when building on non-FreeBSD, and thus LINT kernels.
Marius Strobl [Tue, 22 Aug 2023 18:12:59 +0000 (20:12 +0200)]
tcp_info: Add and export more FreeBSD-specific fields
This change adds struct tcp_info fields corresponding to the following
struct tcpcb ones:
- snd_una
- snd_max
- rcv_numsacks
- rcv_adv
- dupacks
Note that while both tcp_fill_info() and fill_tcp_info_from_tcb() are
extended accordingly, no counterpart of rcv_numsacks is available in
the cxgbe(4) TOE PCB, though.
Zhenlei Huang [Tue, 22 Aug 2023 09:20:10 +0000 (17:20 +0800)]
geom_linux_lvm: Check the offset of physical volume header
The LVM label is stored on any of the first four sectors, and the
PV (physical volume) header is stored within the same sector following
the LVM label. The current implementation does not fully check the
offset of PV header, when attaching a bad formatted LVM PV the kernel
may crash due to out-of-bounds memory read.
bhyve: add config option to load ACPI tables into memory
For backward compatibility, the ACPI tables are loaded into the guest
memory. Windows scans the memory, finds the ACPI tables and uses them.
It ignores the ACPI tables provided by the UEFI. We are patching the
ACPI tables in the guest memory, so that's mostly fine. However, Windows
will break when the ACPI tables become to large or when we add entries
which can't be patched by bhyve. One example of an unpatchable entry, is
a TPM log. The TPM log has to be allocated by the guest firmware. As the
address of the TPM log is unpredictable, bhyve can't assign it in the
memory version of the ACPI tables. Additionally, this makes it
impossible for bhyve to calculate a correct checksum of the table.
By default ACPI tables are still loaded into guest memory for backward
compatibility. The new acpi_tables_in_memory config value can be set to
false to avoid this behaviour.
John Baldwin [Tue, 22 Aug 2023 04:02:42 +0000 (21:02 -0700)]
libcrypto: Update assembly build glue for x86 for OpenSSL 3.0.
Notably, define AES_ASM which is required for any AES acceleration
(OpenSSL 1.0 gated all AES acceleration on OPENSSL_CPUID_OBJ instead).
Enabling this exposed that new assembly files added in OpenSSL 3.0
needed to be included in the build (aes-x86-64.S and aes-586.S). Both
of these files supplant both aes_core.c and aes_cbc.c. The last file
had to be moved out of the MI SRCS line for aes and into each ASM_*
for non-x86.
As part of this I audited the generated configdata.pm for amd64, i386,
and aarch64 and found the following additional discrepecancies that are
fixed here as well:
- Enabled BSAES_ASM on amd64 which requires bsase-x86_64.S
- Enabled WHIRLPOOL_ASM on amd64 (asm sources already built)
- Enabled CMLL_ASM on amd64 and i386 (asm sources already built)
aarch64 had no discreprecancies in configdata.pm, and no *.pl asm
generators were missing for aarch64 in Makefile.asm. I did not check
powerpc or armv7, but for armv7 all of the asm generators seem to be
present in Makefile.asm.
Reported by: gallatin (AES-GCM using plain software on amd64)
Reviewed by: gallatin, ngie, emaste
Differential Revision: https://reviews.freebsd.org/D41539
Ed Maste [Fri, 18 Aug 2023 03:29:33 +0000 (23:29 -0400)]
x86: handle domains with no CPUs usable for intr delivery
We can end up with a domain having no CPUs capable of receiving I/O
interrupts. This can occur, for example, when all APIC IDs in a given
domain are 256 or greater, and we have no IOMMU.
In this case disable per-domain interrupt support, effectively reverting
to the behaviour before commit a48de40bcc09 ("Only use CPUs in the
domain the device is attached to for default"). This has a performance
impact but at least allows the system to be functional. It is a stop-
gap until we can rely on the presence of an IOMMU on all x86 platforms.
Thanks to AMD for providing the high-thread-count machine I used for
testing this change, and to cperciva for testing on other hardware.
Reviewed by: jhb
Tested by: cperciva, emaste
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D41501
This changeset adds a baseline implementation of memcmp and bcmp
for amd64. The same code is used for both functions with conditional
code were the behaviour differs (we need more precise output for the
memcmp case).
FreeBSD documents that memcmp returns the difference between the
mismatching characters. Slightly faster code would be possible could
we relax this requirement to the ISO/IEC 9899:1999 requirement of
merely returning a negative/positive integer or zero.
Performance is better than bionic and glibc, except for long strings
were the two are 13% faster. This could be because they use SSE4
ptest which we cannot use in a baseline kernel.
Sponsored by: The FreeBSD Foundation
Approved by: mjg
Differential Revision: https://reviews.freebsd.org/D41442
This commit adds a baseline implementation of stpcpy(3) for amd64.
It performs quite well in comparison to the previous scalar implementation
as well as agains bionic and glibc (though glibc is faster for very long
strings). Fiddle with the Makefile to also have strcpy(3) call into the
optimised stpcpy(3) code, fixing an oversight from D9841.
Sponsored by: The FreeBSD Foundation
Reviewed by: imp ngie emaste
Approved by: mjg kib
Fixes: D9841
Differential Revision: https://reviews.freebsd.org/D41349
Doug Moore [Mon, 21 Aug 2023 17:28:51 +0000 (12:28 -0500)]
pctrie: change for vm_radix compatibility
Restructure parts of pctrie code to make it more compatible with the
needs of vm_radix code.
1. End passing function pointers for memory management.
By breaking insertion into two functions, the call for allocating
memory can happen at the top level and be inlined, rather than
happening via an function pointer to a memory allocator.
By changing the remove function slightly, freeing of memory, when
necessary, can happen at the top level and be inlined.
By turning the reclamation code into two functions, one for starting
iteration over to-be-freed nodes and the other continuing it, all the
freeing can happen at the top level and be inlined.
2. Offer a version of remove that does not panic and returns the freed
value (or NULL).
3. Offer a 'replace' operation, to replace one leaf with another that
has the same key.
These are three of the roadblocks that prevent code sharing between
pctrie and vm_radix code.
It is modelled after aligned_alloc(3). Most importantly, to free the
allocation, __crt_free() can be used. Additionally, caller may specify
offset into the aligned allocation, so that we return offset-ed from
alignment pointer.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D41150