unionfs: Ensure SAVENAME is set for unionfs vnode operations
"rm-style" system calls such as kern_frmdirat() and kern_funlinkat()
don't supply SAVENAME to preserve the pathname buffer for subsequent
vnode ops. For unionfs this poses an issue because the pathname may
be needed for a relookup operation in unionfs_remove()/unionfs_rmdir().
Currently unionfs doesn't check for this case, leading to a panic on
DIAGNOSTIC kernels and use-after-free of cn_nameptr otherwise.
The unionfs node's stored buffer would suffice as a replacement for
cnp->cn_nameptr in some (but not all) cases, but it's cleaner to just
ensure that unionfs vnode ops always have a valid cn_nameptr by setting
SAVENAME in unionfs_lookup().
While here, do some light cleanup in unionfs_lookup() and assert that
HASBUF is always present in the relevant relookup calls.
Rick Macklem [Wed, 13 Oct 2021 22:48:54 +0000 (15:48 -0700)]
nfscl: Make nfscl_getlayout() acquire the correct pNFS layout
Without this patch, if a pNFS read layout has already been acquired
for a file, writes would be redirected to the Metadata Server (MDS),
because nfscl_getlayout() would not acquire a read/write layout for
the file. This happened because there was no "mode" argument to
nfscl_getlayout() to indicate whether reading or writing was being done.
Since doing I/O through the Metadata Server is not encouraged for some
pNFS servers, it is preferable to get a read/write layout for writes
instead of redirecting the write to the MDS.
This patch adds a access mode argument to nfscl_getlayout() and
nfsrpc_getlayout(), so that nfscl_getlayout() knows to acquire a read/write
layout for writing, even if a read layout has already been acquired.
This patch only affects NFSv4.1/4.2 client behaviour when pNFS ("pnfs" mount
option against a server that supports pNFS) is in use.
This problem was detected during a recent NFSv4 interoperability
testing event held by the IETF working group.
John Baldwin [Wed, 13 Oct 2021 19:30:15 +0000 (12:30 -0700)]
ktls: Ensure FIFO encryption order for TLS 1.0.
TLS 1.0 records are encrypted as one continuous CBC chain where the
last block of the previous record is used as the IV for the next
record. As a result, TLS 1.0 records cannot be encrypted out of order
but must be encrypted as a FIFO.
If the later pages of a sendfile(2) request complete before the first
pages, then TLS records can be encrypted out of order. For TLS 1.1
and later this is fine, but this can break for TLS 1.0.
To cope, add a queue in each TLS session to hold TLS records that
contain valid unencrypted data but are waiting for an earlier TLS
record to be encrypted first.
- In ktls_enqueue(), check if a TLS record being queued is the next
record expected for a TLS 1.0 session. If not, it is placed in
sorted order in the pending_records queue in the TLS session.
If it is the next expected record, queue it for SW encryption like
normal. In addition, check if this new record (really a potential
batch of records) was holding up any previously queued records in
the pending_records queue. Any of those records that are now in
order are also placed on the queue for SW encryption.
- In ktls_destroy(), free any TLS records on the pending_records
queue. These mbufs are marked M_NOTREADY so were not freed when the
socket buffer was purged in sbdestroy(). Instead, they must be
freed explicitly.
Gleb Smirnoff [Fri, 8 Oct 2021 19:56:24 +0000 (12:56 -0700)]
Remove in_ifaddr_lock acquisiton to access in_ifaddrhead.
An IPv4 address is embedded into an ifaddr which is freed
via epoch. And the in_ifaddrhead is already a CK list. Use
the network epoch to protect against use after free.
Next step would be to CK-ify the in_addr hash and get rid of the...
Mark Johnston [Wed, 13 Oct 2021 00:11:02 +0000 (20:11 -0400)]
mount: Check for !VDIR mount points before handling -o emptydir
To implement -o emptydir, vfs_emptydir() checks that the passed
directory is empty. This should be done after checking whether the
vnode is of type VDIR, though, or vfs_emptydir() may end up calling
VOP_READDIR on a non-directory.
Reported by: syzbot+4006732c69fb0f792b2c@syzkaller.appspotmail.com
Reviewed by: kib, imp
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32475
Kyle Evans [Wed, 13 Oct 2021 09:21:28 +0000 (04:21 -0500)]
native-xtools: avoid libllvm while populating the sysroot
Prior to 021385aba562, MK_CLANG=no was sufficient to avoid descending
into lib/clang, but the referenced change added a couple of other
enabling knobs. Turn those off, too, to continue avoiding libllvm.
With this change, we no longer end up with a libllvm using the wrong
default target triple; `poudriere jail -cx` works once again.
Hartmut Brandt [Sun, 10 Oct 2021 15:03:51 +0000 (17:03 +0200)]
Allow the BPF to be select for write. This is needed for boost:asio
which otherwise fails to handle BPFs.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D31967
Rick Macklem [Wed, 13 Oct 2021 00:21:01 +0000 (17:21 -0700)]
nfscl: Fix another deadlock related to the NFSv4 clientID lock
Without this patch, it is possible to hang the NFSv4 client,
when a rename/remove is being done on a file where the client
holds a delegation, if pNFS is being used. For a delegation
to be returned, dirty data blocks must be flushed to the NFSv4
server. When pNFS is in use, a shared lock on the clientID
must be acquired while doing a write to the DS(s).
However, if rename/remove is doing the delegation return
an exclusive lock will be acquired on the clientID, preventing
the write to the DS(s) from acquiring a shared lock on the clientID.
This patch stops rename/remove from doing a delegation return
if pNFS is enabled. Since doing delegation return in the same
compound as rename/remove is only an optimization, not doing
so should not cause problems.
This problem was detected during a recent NFSv4 interoperability
testing event held by the IETF working group.
crt_malloc: Be more persistent when handling mmap() failure
In the situation with limited address space, together with
fragmentation, it is possible for mmap() request in morecore() to fail
when asking for required size + NPOOLPAGES, but succeed without the
addend. Retry allocation there.
John Baldwin [Tue, 12 Oct 2021 21:03:07 +0000 (14:03 -0700)]
Stop creating socket aio kprocs during boot.
Create the initial pool of kprocs on demand when the first socket AIO
request is submitted instead. The pool of kprocs used for other AIO
requests is similarly created on first use.
This partially reverts e81e77c5a055, leaving the option both in
GENERICs on amd64/arm64/arm, and in global NOTES file. Apparently
this better matches existing practice, where we do not try to hard
to make LINT and GENERIC complimentary.
Requested and reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Andrew Turner [Tue, 12 Oct 2021 11:39:14 +0000 (12:39 +0100)]
Stop reading the arm64 domain when it's known
There is no need to read the domain on arm64 when there is only one
in the ACPI tables. This can also happen when the table is missing
as it is unneeded.
Reported by: dch
Sponsored by: The FreeBSD Foundation
Kyle Evans [Sat, 2 Oct 2021 05:23:03 +0000 (00:23 -0500)]
fifos: delegate unhandled kqueue filters to underlying filesystem
This gives the vfs layer a chance to provide handling for EVFILT_VNODE,
for instance. Change pipe_specops to use the default vop_kqfilter to
accommodate fifoops that don't specify the method (i.e. all in-tree).
Rick Macklem [Tue, 12 Oct 2021 04:58:24 +0000 (21:58 -0700)]
nfscl: Fix a deadlock related to the NFSv4 clientID lock
Without this patch, it is possible for a process doing an NFSv4
Open/create of a file to block to allow another process
to acquire the exclusive lock on the clientID when holding
a shared lock on the clientID. As such, both processes
deadlock, with one wanting the exclusive lock, while the
other holds the shared lock. This deadlock is unlikely to occur
unless delegations are in use on the NFSv4 mount.
This patch fixes the problem by not deferring to the process
waiting for the exclusive lock when a shared lock (reference cnt)
is already held by the process.
This problem was detected during a recent NFSv4 interoperability
testing event held by the IETF working group.
Warner Losh [Mon, 11 Oct 2021 18:59:39 +0000 (12:59 -0600)]
forward declare struct thread
sys/sysctl.h moved struct thread forward declaration under #ifdef
_KERNEL and so this header fails when included from userland. Add a
forward declaration here.
Although the change worked locally, it's breaking something in the CI
build for the riscv64 build (which makes no sense it would only break
that since we're building host tools to bootstrap at that point).
Warner Losh [Mon, 11 Oct 2021 17:14:51 +0000 (11:14 -0600)]
sysctl: make sys/sysctl.h self contained
sys/sysctl.h only needs u_int and size_t from sys/types.h. When the
sysctl interface was designed, having one more more prerequisites
(especially sys/types.h) was the norm. Times have changed, and to make
things more portable, make sys/types.h optional. We do this by including
sys/_types.h, defining size_t if needed, and changing u_int to 'unsigned
int' in a prototype for userland builds. For kernel builds, sys/types.h
is still required.
Warner Losh [Mon, 11 Oct 2021 17:13:39 +0000 (11:13 -0600)]
bootstrap: No need to disable shared libraries for bootstrap tools
There's no need to disable shared libraries when building the bootstrap
tools. This was added on 2000 (commit ad879ce9552c) when the perl
bootstrap was added (libperl and miniperl) and saved a fair amount of
time (perl took a long time to build on 2000-era hardware).
For many years now, however, we rarely build any libraries when
bootstrapping. Even when we do, the optimization saves at most a few
seconds when upgrading since the libraries built have been small. Shared
libraries are more robust accross versions that static libraries due to
creaping dependencies (we aren't crossing versions of share libraries,
though, just using what's on the host). In addition, linux and macos
have been building like this for some time because static binaries on
those systems are difficult to impossible.
last: improve non-UTF8 locale output after libxo support was added
Some strftime(3) conversion specifications will generate strings encoded
with the current locale, not necessarily UTF8. As per xo_format.5, use
the h string modifier so that libxo interprets it appropriately.
Reviewed by: eugen, philip
Differential Revision: https://reviews.freebsd.org/D32437
Alex Richardson [Mon, 11 Oct 2021 10:46:30 +0000 (11:46 +0100)]
Update OptionalObsoleteFiles.inc after 021385aba562
I forgot to update this file so make delete-old would incorrectly remove
the newly-installed LLVM binutils. While touching the file also update
for 8e1c989abbd1 since ObsoleteFiles.inc now inludes the tablegen binaries.
Reported by: Herbert J. Skuhra <herbert@gojira.at>
Reviewed By: emaste, imp
Andrew Turner [Wed, 6 Oct 2021 16:38:22 +0000 (17:38 +0100)]
Only demote when needed in the arm64 pmap_change_props_locked
When changing page table properties there is no need to demote a
level 1 or level 2 block if we are changing the entire memory range the
block is mapping. In this case just change the block directly.
Reported by: alc, kib, markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32339
Rick Macklem [Mon, 11 Oct 2021 01:46:02 +0000 (18:46 -0700)]
nfsd: Disable the NFSv4.2 Allocate operation by default
Some exported file systems, such as ZFS ones, cannot do VOP_ALLOCATE().
Since an NFSv4.2 server must either support the Allocate operation for
all file systems or not support it at all, define a sysctl called
vfs.nfsd.enable_v42allocate to enable the Allocate operation.
This sysctl is false by default and can only be set true if all
exported file systems (or all DSs for a pNFS server) can perform
VOP_ALLOCATE().
Unfortunately, there is no way to know if a ZFS file system will
be exported once the nfsd is operational, even if there are none
exported when the nfsd is started up, so enabling Allocate must
be done manually for a server configuration.
This problem was detected during a recent NFSv4 interoperability
testing event held by the IETF working group.
Rick Macklem [Sun, 10 Oct 2021 21:27:52 +0000 (14:27 -0700)]
nfscl: Fix NFS VOP_ALLOCATE for mounts without Allocate support
Without this patch, nfs_allocate() fell back on using vop_stdallocate()
for NFS mounts without Allocate operation support. This was incorrect,
since some file systems, such as ZFS, cannot do allocate via
vop_stdallocate(), which uses writes to try and allocate blocks.
Also, fix nfs_allocate() to return EINVAL when mounts cannot do Allocate,
since that is the correct error for posix_fallocate(2).
Note that Allocate is only supported by some NFSv4.2 servers.
Mark Peek [Sat, 9 Oct 2021 21:21:16 +0000 (14:21 -0700)]
vmci: fix panic due to freeing unallocated resources
Summary:
An error mapping PCI resources results in a panic due to unallocated
resources being freed up. This change puts the appropriate checks in
place to prevent the panic.
PR: 252445
Reported by: Marek Zarychta <zarychtam@plan-b.pwste.edu.pl>
Tested by: marcus
MFC after: 1 week
Sponsored by: VMware
Test Plan:
Along with user testing, also simulated error by inserting a ENXIO
return in vmci_map_bars().
Reviewed by: marcus
Subscribers: imp
Differential Revision: https://reviews.freebsd.org/D32016
Mark Johnston [Sat, 9 Oct 2021 15:36:19 +0000 (11:36 -0400)]
bhyve: Map the MSI-X table unconditionally for passthrough
It is possible for the PBA to reside in the same page as the MSI-X
table. And, while devices are not supposed to do this, at least some
Intel wifi devices place registers in a page shared with the MSI-X
table. To handle the first case we currently map the PBA page using
/dev/mem, and the second case is not handled.
Kill two birds with one stone: map the MSI-X table BAR using the
PCIOCBARMMAP ioctl instead of /dev/mem, and map the entire table so that
accesses beyond the bounds of the table can be emulated. Regions of the
BAR not containing the table are left unmapped.
Reviewed by: bz, grehan, jhb
MFC after: 3 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32359
Tracking the number of unused holes in the trie and the range table
was a bad metric based on which full trie and / or range rebuilds
were triggered, which would happen in vain by far too frequently,
particularly with live BGP feeds.
Instead, track the total unused space inside the trie and range table
structures, and trigger rebuilds if the percentage of unused space
exceeds a sysctl-tunable threshold.
Devin Teske [Fri, 8 Oct 2021 23:26:21 +0000 (16:26 -0700)]
bsdconfig: Comments
My current style is to copy C for "/* NOTREACHED */" instead of spelling
out "Not reached". Make this one nominal change in this one file and the
others later.
While here, word-smith "Preload" into "Pre-load" as I believe that to
be more grammatically correct in this instance.
Also while here, fix a comment capitalization error.
In 526370fb85db4b659cff4625eb2f379acaa4a1a8 "net80211: proper ssid
length check in setmlme_assoc_adhoc()" we are checking the
sizeof on an array function parameter which leads to a warning that
it will resturn the size of the type of the array rather than the
array size itself. Use the defined length used both in the ioctl
and the sizing of the array function parameter instead.
Bjoern A. Zeeb [Fri, 1 Oct 2021 13:37:01 +0000 (13:37 +0000)]
USB: adjust the Generic XHCI ACPI probe return value
Change the probe return value from BUS_PROBE_DEFAULT to BUS_PROBE_GENERIC
given this is the "generic" attach method. This allows individual
drivers using XHCI generic but needing their own intialisation to
gain priority for attaching over the generic implementation.
Bjoern A. Zeeb [Wed, 6 Oct 2021 18:09:39 +0000 (18:09 +0000)]
net80211: correct length check in ieee80211_ies_expand()
In ieee80211_ies_expand() we are looping over Elements
(also known as Information Elements or IEs).
The comment suggests that we assume well-formedness of
the IEs themselves.
Checking the buffer length being least 2 (1 byte Element ID and
1 byte Length fields) rather than just 1 before accessing ie[1]
is still good practise and can prevent and out-of-bounds read in
case the input is not behaving according to the comment.
Bjoern A. Zeeb [Wed, 6 Oct 2021 18:41:37 +0000 (18:41 +0000)]
net80211: proper ssid length check in setmlme_assoc_adhoc()
A user supplied SSID length is used without proper checks in
setmlme_assoc_adhoc() which can lead to copies beyond the end
of the user supplied buffer.
The ssid is a fixed size array for the ioctl and the argument
to setmlme_assoc_adhoc().
In addition to an ssid_len check of 0 also error in case the
ssid_len is larger than the size of the ssid array to prevent
problems.
PR: 254737
Reported by: Tommaso (cutesmilee.research protonmail.com)
MFC after: 3 days
Reviewed by: emaste, adrian
Differential Revision: https://reviews.freebsd.org/D32341
Wakeup in vm_waitpfault() does not mean that the thread would get the
page on the next vm_page_alloc() call, other thread might steal the free
page we were waiting for. On the other hand, this wakeup might come much
earlier than just vm_pfault_oom_wait seconds, if the rate of the page
reclamation is high enough.
If wakeups come fast and we loose the allocation race enough times, OOM
could be undeservably triggered much earlier than vm_pfault_oom_attempts
x vm_pfault_oom_wait seconds. Fix it by not counting the number of sleeps,
but measuring the time to th first allocation failure, and triggering OOM
when it was older than oom_attempts x oom_wait seconds.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D32287
Michal Meloun [Thu, 7 Oct 2021 18:42:56 +0000 (20:42 +0200)]
dwmmc: Calculate the maximum transaction length correctly.
We should reserve two descriptors (not MMC_SECTORS) for potentially
unaligned (so bounced) buffer fragments, one for the starting fragment
and one for the ending fragment.