Kristof Provost [Wed, 13 Oct 2021 14:06:47 +0000 (16:06 +0200)]
pfctl: delay label macro expansion until after rule optimisation
We used to expand the $nr macro in labels into the rule number prior to
the optimisation step. This would occasionally produce incorrect rule
numbers in the labels.
Delay all macro expansion until after the optimisation step to ensure
that we expand the correct values.
Rick Macklem [Fri, 15 Oct 2021 21:25:38 +0000 (14:25 -0700)]
nfscl: Add an argument to nfscl_tryclose()
This patch adds a new argument to nfscl_tryclose() to indicate
whether or not it should loop when a NFSERR_DELAY reply is received
from the NFSv4 server. Since this new argument is always passed in
as "true" at this time, no semantics change should occur.
This is being done to prepare the code for a future patch that fixes
the case where an NFSv4.1/4.2 server replies NFSERR_DELAY to a Close
operation.
Ed Maste [Thu, 7 Oct 2021 00:42:40 +0000 (20:42 -0400)]
Add libcbor to the build
From https://github.com/PJK/libcbor:
libcbor is a C library for parsing and generating CBOR, the general-
purpose schema-less binary data format.
libcbor will be used by ssh to support FIDO/U2F keys. It is currently
intended only for use by ssh, and so is installed as a PRIVATELIB and is
placed in the ssh pkgbase package.
cbor_export.h and configuration.h were generated by the upstream CMake
build. We could create them with bmake rules instead (as NetBSD has
done) but this is a fine start.
This is currently disabled for the 32-bit library build as libfido2 is
not compatible with the COMPAT_32BIT hack in usb_ioctl.h, and there is
no need for libcbor without libfido2.
Reviewed by: kevans
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32347
mixer(8): Fix mixer status line for /dev/dspX.vpY mixer devices.
In some cases when passing /dev/dspX.vpY as mixer devices, m->ci.longname and
m->ci.hw_info will be empty. Don't print any brackets and parentheses
in this case.
Dawid Gorecki [Wed, 13 Oct 2021 19:06:05 +0000 (21:06 +0200)]
libthr: Use kern.stacktop for thread stack calculation.
Use the new kern.stacktop sysctl to retrieve the address of stack top
instead of kern.usrstack. kern.usrstack does not have any knowledge
of the stack gap, so this can cause problems with thread stacks.
Using kern.stacktop sysctl should fix most of those problems.
kern.usrstack is used as a fallback when kern.stacktop cannot be read.
Rename usrstack variables to stacktop to reflect this change.
Fixes problems with firefox and thunderbird not starting with
stack gap enabled.
Dawid Gorecki [Wed, 13 Oct 2021 19:03:37 +0000 (21:03 +0200)]
kern_exec: Add kern.stacktop sysctl.
With stack gap enabled top of the stack is moved down by a random
amount of bytes. Because of that some multithreaded applications
which use kern.usrstack sysctl to calculate address of stacks for
their threads can fail. Add kern.stacktop sysctl, which can be used
to retrieve address of the stack after stack gap is applied to it.
Returns value identical to kern.usrstack for processes which have
no stack gap.
Dawid Gorecki [Wed, 13 Oct 2021 19:01:08 +0000 (21:01 +0200)]
setrlimit: Take stack gap into account.
Calling setrlimit with stack gap enabled and with low values of stack
resource limit often caused the program to abort immediately after
exiting the syscall. This happened due to the fact that the resource
limit was calculated assuming that the stack started at sv_usrstack,
while with stack gap enabled the stack is moved by a random number
of bytes.
Save information about stack size in struct vmspace and adjust the
rlim_cur value. If the rlim_cur and stack gap is bigger than rlim_max,
then the value is truncated to rlim_max.
Corvin Köhne [Fri, 15 Oct 2021 07:25:54 +0000 (09:25 +0200)]
bhyve: ignore low bits of CFGADR
Bhyve could emulate wrong PCI registers.
In the best case, the guest reads wrong registers and the device driver would
report some errors.
In the worst case, the guest writes to wrong PCI registers and could brick
hardware when using PCI passthrough.
According to Intels specification, low bits of CFGADR should be
ignored. Some OS like linux may rely on it. Otherwise, bhyve could
emulate a wrong PCI register.
E.g.
If linux would like to read 2 bytes from offset 0x02, following would
happen.
linux:
outl 0x80000002 at CFGADR
inw at CFGDAT + 2
bhyve:
cfgoff = 0x80000002 & 0xFF = 0x02
coff = cfgoff + (port - CFGDAT) = 0x02 + 0x02 = 0x04
Bhyve would emulate the register at offset 0x04 not 0x02.
Rick Macklem [Fri, 15 Oct 2021 00:28:01 +0000 (17:28 -0700)]
nfscl: Restructure nfscl_freeopen() slightly
This patch factors the unlinking of the nfsclopen structure out of
nfscl_freeopen() into a separate function called nfscl_unlinkopen().
It also adds a new argument to nfscl_freeopen() to conditionally do
the unlink. Since this new argument is always passed in as "true"
at this time, no semantics change should occur.
This is being done to prepare the code for a future patch that fixes
the case where an NFSv4.1/4.2 server replies NFSERR_DELAY to a Close
operation.
John Baldwin [Thu, 14 Oct 2021 22:48:34 +0000 (15:48 -0700)]
ktls: Defer creation of threads and zones until first use.
Run ktls_init() when the first KTLS session is created rather than
unconditionally during boot. This avoids creating unused threads and
allocating unused resources on systems which do not use KTLS.
John Baldwin [Thu, 14 Oct 2021 17:59:16 +0000 (10:59 -0700)]
cxgbe: Only run ktls_tick when NIC TLS is enabled.
Previously the body of ktls_tick was a nop when NIC TLS was disabled,
but the callout was still scheduled consuming power on otherwise-idle
systems with Chelsio T6 adapters. Now the callout only runs while NIC
TLS is enabled on at least one interface of an adapter.
Leandro Lupori [Thu, 14 Oct 2021 16:13:27 +0000 (13:13 -0300)]
powerpc64: make radix with superpages default
As Radix MMU with superpages enabled is now stable, make it the
default choice on supported hardware (POWER9 and above), since its
performance is greater than that of HPT MMU.
Reviewed by: alfredo, jhibbits
Sponsored by: Instituto de Pesquisas Eldorado (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D30797
Warner Losh [Thu, 14 Oct 2021 14:44:37 +0000 (08:44 -0600)]
nvme: Reduce traffic to the doorbell register
Reduce traffic to doorbell register when processing multiple completion
events at once. Only write it at the end of the loop after we've
processed everything (assuming we found at least one completion,
even if that completion wasn't valid).
Leandro Lupori [Thu, 14 Oct 2021 13:39:52 +0000 (10:39 -0300)]
powerpc64: fix OFWFB with Radix MMU
Current implementation of Radix MMU doesn't support mapping
arbitrary virtual addresses, such as the ones generated by
"direct mapping" I/O addresses. This caused the system to hang, when
early I/O addresses, such as those used by OpenFirmware Frame Buffer,
were remapped after the MMU was up.
To avoid having to modify mmu_radix_kenter_attr just to support this
use case, this change makes early I/O map use virtual addresses from
KVA area instead (similar to what mmu_radix_mapdev_attr does), as
these can be safely remapped later.
Reviewed by: alfredo (earlier version), jhibbits (in irc)
MFC after: 2 weeks
Sponsored by: Instituto de Pesquisas Eldorado (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D31232
unionfs: Ensure SAVENAME is set for unionfs vnode operations
"rm-style" system calls such as kern_frmdirat() and kern_funlinkat()
don't supply SAVENAME to preserve the pathname buffer for subsequent
vnode ops. For unionfs this poses an issue because the pathname may
be needed for a relookup operation in unionfs_remove()/unionfs_rmdir().
Currently unionfs doesn't check for this case, leading to a panic on
DIAGNOSTIC kernels and use-after-free of cn_nameptr otherwise.
The unionfs node's stored buffer would suffice as a replacement for
cnp->cn_nameptr in some (but not all) cases, but it's cleaner to just
ensure that unionfs vnode ops always have a valid cn_nameptr by setting
SAVENAME in unionfs_lookup().
While here, do some light cleanup in unionfs_lookup() and assert that
HASBUF is always present in the relevant relookup calls.
Rick Macklem [Wed, 13 Oct 2021 22:48:54 +0000 (15:48 -0700)]
nfscl: Make nfscl_getlayout() acquire the correct pNFS layout
Without this patch, if a pNFS read layout has already been acquired
for a file, writes would be redirected to the Metadata Server (MDS),
because nfscl_getlayout() would not acquire a read/write layout for
the file. This happened because there was no "mode" argument to
nfscl_getlayout() to indicate whether reading or writing was being done.
Since doing I/O through the Metadata Server is not encouraged for some
pNFS servers, it is preferable to get a read/write layout for writes
instead of redirecting the write to the MDS.
This patch adds a access mode argument to nfscl_getlayout() and
nfsrpc_getlayout(), so that nfscl_getlayout() knows to acquire a read/write
layout for writing, even if a read layout has already been acquired.
This patch only affects NFSv4.1/4.2 client behaviour when pNFS ("pnfs" mount
option against a server that supports pNFS) is in use.
This problem was detected during a recent NFSv4 interoperability
testing event held by the IETF working group.
John Baldwin [Wed, 13 Oct 2021 19:30:15 +0000 (12:30 -0700)]
ktls: Ensure FIFO encryption order for TLS 1.0.
TLS 1.0 records are encrypted as one continuous CBC chain where the
last block of the previous record is used as the IV for the next
record. As a result, TLS 1.0 records cannot be encrypted out of order
but must be encrypted as a FIFO.
If the later pages of a sendfile(2) request complete before the first
pages, then TLS records can be encrypted out of order. For TLS 1.1
and later this is fine, but this can break for TLS 1.0.
To cope, add a queue in each TLS session to hold TLS records that
contain valid unencrypted data but are waiting for an earlier TLS
record to be encrypted first.
- In ktls_enqueue(), check if a TLS record being queued is the next
record expected for a TLS 1.0 session. If not, it is placed in
sorted order in the pending_records queue in the TLS session.
If it is the next expected record, queue it for SW encryption like
normal. In addition, check if this new record (really a potential
batch of records) was holding up any previously queued records in
the pending_records queue. Any of those records that are now in
order are also placed on the queue for SW encryption.
- In ktls_destroy(), free any TLS records on the pending_records
queue. These mbufs are marked M_NOTREADY so were not freed when the
socket buffer was purged in sbdestroy(). Instead, they must be
freed explicitly.
Gleb Smirnoff [Fri, 8 Oct 2021 19:56:24 +0000 (12:56 -0700)]
Remove in_ifaddr_lock acquisiton to access in_ifaddrhead.
An IPv4 address is embedded into an ifaddr which is freed
via epoch. And the in_ifaddrhead is already a CK list. Use
the network epoch to protect against use after free.
Next step would be to CK-ify the in_addr hash and get rid of the...
Mark Johnston [Wed, 13 Oct 2021 00:11:02 +0000 (20:11 -0400)]
mount: Check for !VDIR mount points before handling -o emptydir
To implement -o emptydir, vfs_emptydir() checks that the passed
directory is empty. This should be done after checking whether the
vnode is of type VDIR, though, or vfs_emptydir() may end up calling
VOP_READDIR on a non-directory.
Reported by: syzbot+4006732c69fb0f792b2c@syzkaller.appspotmail.com
Reviewed by: kib, imp
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32475
Kyle Evans [Wed, 13 Oct 2021 09:21:28 +0000 (04:21 -0500)]
native-xtools: avoid libllvm while populating the sysroot
Prior to 021385aba562, MK_CLANG=no was sufficient to avoid descending
into lib/clang, but the referenced change added a couple of other
enabling knobs. Turn those off, too, to continue avoiding libllvm.
With this change, we no longer end up with a libllvm using the wrong
default target triple; `poudriere jail -cx` works once again.
Hartmut Brandt [Sun, 10 Oct 2021 15:03:51 +0000 (17:03 +0200)]
Allow the BPF to be select for write. This is needed for boost:asio
which otherwise fails to handle BPFs.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D31967
Rick Macklem [Wed, 13 Oct 2021 00:21:01 +0000 (17:21 -0700)]
nfscl: Fix another deadlock related to the NFSv4 clientID lock
Without this patch, it is possible to hang the NFSv4 client,
when a rename/remove is being done on a file where the client
holds a delegation, if pNFS is being used. For a delegation
to be returned, dirty data blocks must be flushed to the NFSv4
server. When pNFS is in use, a shared lock on the clientID
must be acquired while doing a write to the DS(s).
However, if rename/remove is doing the delegation return
an exclusive lock will be acquired on the clientID, preventing
the write to the DS(s) from acquiring a shared lock on the clientID.
This patch stops rename/remove from doing a delegation return
if pNFS is enabled. Since doing delegation return in the same
compound as rename/remove is only an optimization, not doing
so should not cause problems.
This problem was detected during a recent NFSv4 interoperability
testing event held by the IETF working group.
crt_malloc: Be more persistent when handling mmap() failure
In the situation with limited address space, together with
fragmentation, it is possible for mmap() request in morecore() to fail
when asking for required size + NPOOLPAGES, but succeed without the
addend. Retry allocation there.
John Baldwin [Tue, 12 Oct 2021 21:03:07 +0000 (14:03 -0700)]
Stop creating socket aio kprocs during boot.
Create the initial pool of kprocs on demand when the first socket AIO
request is submitted instead. The pool of kprocs used for other AIO
requests is similarly created on first use.
This partially reverts e81e77c5a055, leaving the option both in
GENERICs on amd64/arm64/arm, and in global NOTES file. Apparently
this better matches existing practice, where we do not try to hard
to make LINT and GENERIC complimentary.
Requested and reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Andrew Turner [Tue, 12 Oct 2021 11:39:14 +0000 (12:39 +0100)]
Stop reading the arm64 domain when it's known
There is no need to read the domain on arm64 when there is only one
in the ACPI tables. This can also happen when the table is missing
as it is unneeded.
Reported by: dch
Sponsored by: The FreeBSD Foundation
Kyle Evans [Sat, 2 Oct 2021 05:23:03 +0000 (00:23 -0500)]
fifos: delegate unhandled kqueue filters to underlying filesystem
This gives the vfs layer a chance to provide handling for EVFILT_VNODE,
for instance. Change pipe_specops to use the default vop_kqfilter to
accommodate fifoops that don't specify the method (i.e. all in-tree).
Rick Macklem [Tue, 12 Oct 2021 04:58:24 +0000 (21:58 -0700)]
nfscl: Fix a deadlock related to the NFSv4 clientID lock
Without this patch, it is possible for a process doing an NFSv4
Open/create of a file to block to allow another process
to acquire the exclusive lock on the clientID when holding
a shared lock on the clientID. As such, both processes
deadlock, with one wanting the exclusive lock, while the
other holds the shared lock. This deadlock is unlikely to occur
unless delegations are in use on the NFSv4 mount.
This patch fixes the problem by not deferring to the process
waiting for the exclusive lock when a shared lock (reference cnt)
is already held by the process.
This problem was detected during a recent NFSv4 interoperability
testing event held by the IETF working group.
Warner Losh [Mon, 11 Oct 2021 18:59:39 +0000 (12:59 -0600)]
forward declare struct thread
sys/sysctl.h moved struct thread forward declaration under #ifdef
_KERNEL and so this header fails when included from userland. Add a
forward declaration here.
Although the change worked locally, it's breaking something in the CI
build for the riscv64 build (which makes no sense it would only break
that since we're building host tools to bootstrap at that point).
Warner Losh [Mon, 11 Oct 2021 17:14:51 +0000 (11:14 -0600)]
sysctl: make sys/sysctl.h self contained
sys/sysctl.h only needs u_int and size_t from sys/types.h. When the
sysctl interface was designed, having one more more prerequisites
(especially sys/types.h) was the norm. Times have changed, and to make
things more portable, make sys/types.h optional. We do this by including
sys/_types.h, defining size_t if needed, and changing u_int to 'unsigned
int' in a prototype for userland builds. For kernel builds, sys/types.h
is still required.
Warner Losh [Mon, 11 Oct 2021 17:13:39 +0000 (11:13 -0600)]
bootstrap: No need to disable shared libraries for bootstrap tools
There's no need to disable shared libraries when building the bootstrap
tools. This was added on 2000 (commit ad879ce9552c) when the perl
bootstrap was added (libperl and miniperl) and saved a fair amount of
time (perl took a long time to build on 2000-era hardware).
For many years now, however, we rarely build any libraries when
bootstrapping. Even when we do, the optimization saves at most a few
seconds when upgrading since the libraries built have been small. Shared
libraries are more robust accross versions that static libraries due to
creaping dependencies (we aren't crossing versions of share libraries,
though, just using what's on the host). In addition, linux and macos
have been building like this for some time because static binaries on
those systems are difficult to impossible.
last: improve non-UTF8 locale output after libxo support was added
Some strftime(3) conversion specifications will generate strings encoded
with the current locale, not necessarily UTF8. As per xo_format.5, use
the h string modifier so that libxo interprets it appropriately.
Reviewed by: eugen, philip
Differential Revision: https://reviews.freebsd.org/D32437
Alex Richardson [Mon, 11 Oct 2021 10:46:30 +0000 (11:46 +0100)]
Update OptionalObsoleteFiles.inc after 021385aba562
I forgot to update this file so make delete-old would incorrectly remove
the newly-installed LLVM binutils. While touching the file also update
for 8e1c989abbd1 since ObsoleteFiles.inc now inludes the tablegen binaries.
Reported by: Herbert J. Skuhra <herbert@gojira.at>
Reviewed By: emaste, imp
Andrew Turner [Wed, 6 Oct 2021 16:38:22 +0000 (17:38 +0100)]
Only demote when needed in the arm64 pmap_change_props_locked
When changing page table properties there is no need to demote a
level 1 or level 2 block if we are changing the entire memory range the
block is mapping. In this case just change the block directly.
Reported by: alc, kib, markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32339
Rick Macklem [Mon, 11 Oct 2021 01:46:02 +0000 (18:46 -0700)]
nfsd: Disable the NFSv4.2 Allocate operation by default
Some exported file systems, such as ZFS ones, cannot do VOP_ALLOCATE().
Since an NFSv4.2 server must either support the Allocate operation for
all file systems or not support it at all, define a sysctl called
vfs.nfsd.enable_v42allocate to enable the Allocate operation.
This sysctl is false by default and can only be set true if all
exported file systems (or all DSs for a pNFS server) can perform
VOP_ALLOCATE().
Unfortunately, there is no way to know if a ZFS file system will
be exported once the nfsd is operational, even if there are none
exported when the nfsd is started up, so enabling Allocate must
be done manually for a server configuration.
This problem was detected during a recent NFSv4 interoperability
testing event held by the IETF working group.
Rick Macklem [Sun, 10 Oct 2021 21:27:52 +0000 (14:27 -0700)]
nfscl: Fix NFS VOP_ALLOCATE for mounts without Allocate support
Without this patch, nfs_allocate() fell back on using vop_stdallocate()
for NFS mounts without Allocate operation support. This was incorrect,
since some file systems, such as ZFS, cannot do allocate via
vop_stdallocate(), which uses writes to try and allocate blocks.
Also, fix nfs_allocate() to return EINVAL when mounts cannot do Allocate,
since that is the correct error for posix_fallocate(2).
Note that Allocate is only supported by some NFSv4.2 servers.
Mark Peek [Sat, 9 Oct 2021 21:21:16 +0000 (14:21 -0700)]
vmci: fix panic due to freeing unallocated resources
Summary:
An error mapping PCI resources results in a panic due to unallocated
resources being freed up. This change puts the appropriate checks in
place to prevent the panic.
PR: 252445
Reported by: Marek Zarychta <zarychtam@plan-b.pwste.edu.pl>
Tested by: marcus
MFC after: 1 week
Sponsored by: VMware
Test Plan:
Along with user testing, also simulated error by inserting a ENXIO
return in vmci_map_bars().
Reviewed by: marcus
Subscribers: imp
Differential Revision: https://reviews.freebsd.org/D32016
Mark Johnston [Sat, 9 Oct 2021 15:36:19 +0000 (11:36 -0400)]
bhyve: Map the MSI-X table unconditionally for passthrough
It is possible for the PBA to reside in the same page as the MSI-X
table. And, while devices are not supposed to do this, at least some
Intel wifi devices place registers in a page shared with the MSI-X
table. To handle the first case we currently map the PBA page using
/dev/mem, and the second case is not handled.
Kill two birds with one stone: map the MSI-X table BAR using the
PCIOCBARMMAP ioctl instead of /dev/mem, and map the entire table so that
accesses beyond the bounds of the table can be emulated. Regions of the
BAR not containing the table are left unmapped.
Reviewed by: bz, grehan, jhb
MFC after: 3 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32359
Tracking the number of unused holes in the trie and the range table
was a bad metric based on which full trie and / or range rebuilds
were triggered, which would happen in vain by far too frequently,
particularly with live BGP feeds.
Instead, track the total unused space inside the trie and range table
structures, and trigger rebuilds if the percentage of unused space
exceeds a sysctl-tunable threshold.