Mark Johnston [Sat, 16 Oct 2021 13:46:55 +0000 (09:46 -0400)]
timecounter: Lock the timecounter list
Timecounter registration is dynamic, i.e., there is no requirement that
timecounters must be registered during single-threaded boot. Loadable
drivers may in principle register timecounters (which can be switched to
automatically). Timecounters cannot be unregistered, though this could
be implemented.
Registered timecounters belong to a global linked list. Add a mutex to
synchronize insertions and the traversals done by (mpsafe) sysctl
handlers. No functional change intended.
Reviewed by: imp, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32511
Mark Johnston [Sat, 16 Oct 2021 13:44:40 +0000 (09:44 -0400)]
signal: Add SIG_FOREACH and refactor issignal()
Add a SIG_FOREACH macro that can be used to iterate over a signal set.
This is a bit cleaner and more efficient than calling sig_ffs() in a
loop. The implementation is based on BIT_FOREACH_ISSET(), except
that the bitset limbs are always 32 bits wide, and signal sets are
1-indexed rather than 0-indexed like bitset(9) sets.
issignal() cannot really be modified to use SIG_FOREACH() directly.
Take this opportunity to split the function into two explicit loops.
I've always found this function hard to read and think that this change
is an improvement.
Remove sig_ffs(), nothing uses it now.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32473
Mark Johnston [Mon, 18 Oct 2021 13:29:20 +0000 (09:29 -0400)]
amd64: Zero the PML5 PTI page when initializing a pmap
The root page is not zeroed at allocation time since with 4-level tables
each entry is copied from a template. However, with 5-level tables only
a single entry is filled, so the rest need to be cleared.
Reported by: alc
Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32525
Rick Macklem [Mon, 18 Oct 2021 00:50:56 +0000 (17:50 -0700)]
nfscl: Modify Close RPC so that it does not use "owner" for NFSv4.1/4.2
This patch modifies the function that does the Close RPC (nfsrpc_closerpc)
so that it does not use the open_owner (nfso_own) for NFSv4.1/4.2.
Use of the seqid in the open_owner structure is only needed for NFSv4.0.
Same applies to a NFSERR_STALESTATEID reply, which should only happen
for NFSv4.0. This allows nfsrpc_closerpc() to be called when nfso_own
is no longer valid. This, in turn, allows nfsrpc_closerpc() to be called
after the shared lock on the clientID is released, for NFSv4.1/4.2.
This is being done to prepare the code for a future patch that fixes
the case where an NFSv4.1/4.2 server replies NFSERR_DELAY to a Close
operation.
Colin Percival [Sun, 17 Oct 2021 20:36:38 +0000 (13:36 -0700)]
TSLOG: Report final execname, not first
In cases such as daemons launched via limits(1), a process may call
exec multiple times; the last name of the last binary executed is
usually (always?) more informative.
Jessica Clarke [Sun, 17 Oct 2021 14:32:35 +0000 (15:32 +0100)]
LinuxKPI: Support lazy BAR allocation
Linux KPIs like pci_resource_start/len assume that BARs have been
allocated, but FreeBSD lazily allocates BARs if it cannot allocate the
firmware-allocated BARs. Thus using the Linux KPIs must force allocation
of the BARs rather than returning 0 for the start and length, which can
crash drm-kmod drivers that assume the BARs are valid. This is needed
for the AMDGPU driver to be able to attach on SiFive's HiFive Unmatched.
Jessica Clarke [Sun, 17 Oct 2021 14:32:20 +0000 (15:32 +0100)]
LinuxKPI: Implement _ioremap_attr for riscv
Now that riscv implements pmap_mapdev_attr we can enable the non-stub
implementation for riscv, which is needed for drm-kmod to not fail at
run time for drivers that need to map I/O regions.
Jessica Clarke [Sun, 17 Oct 2021 14:31:35 +0000 (15:31 +0100)]
riscv: Implement pmap_mapdev_attr
This is needed for LinuxKPI's _ioremap_attr. This reuses the generic
implementation introduced for aarch64, and itself requires implementing
pmap_kenter, which is trivial to do given riscv currently treats all
mapping attributes the same due to the Svpbmt extension not yet being
ratified and in hardware.
Make vmdaemon timeout configurable, so that one can adjust
how often it runs.
Here's a trick: set this to 1, then run 'limits -m 0 sh',
then run whatever you want with 'ktrace -it XXX', and observe
how the working set changes over time.
Fangrui Song [Sat, 16 Oct 2021 21:34:37 +0000 (14:34 -0700)]
rtld: Support DT_RELR relative relocation format
PIE and shared objects usually have many relative relocations. In
2017/2018, a compact relative relocation format RELR was proposed on
https://groups.google.com/g/generic-abi/c/bX460iggiKg/m/GxjM0L-PBAAJ
("Proposal for a new section type SHT_RELR") and is a pre-standard.
RELR usually takes 3% or smaller space than R_*_RELATIVE relocations.
The virtual memory size of a mostly statically linked PIE is typically
5~10% smaller.
ld.lld --pack-dyn-relocs=relr emits RELR relocations. DT_RELR has been
adopted by Android bionic, Linux kernel's arm64 port, Chrome OS (patched
glibc).
This patch adds DT_RELR support to FreeBSD rtld-elf.
Kristof Provost [Sat, 16 Oct 2021 16:53:39 +0000 (18:53 +0200)]
pf: don't drop packets when redirection information comes from a state
For some traffic there might be no matching rule in the current ruleset,
for example when a state was imported via pfsync from a sytem with a
different ruleset checksum. In this case pf_route uses s->rt_addr for
routing target instead of r->rpool.cur but r->rpool is checked anyway,
resulting in dropped packets.
Rick Macklem [Sat, 16 Oct 2021 22:49:38 +0000 (15:49 -0700)]
nfscl: Move release of the clientID lock into nfscl_doclose()
This patch moves release of the shared clientID lock from nfsrpc_close()
just after the nfscl_doclose() call to the end of nfscl_doclose() call.
This does make the code cleaner, since the shared lock is acquired at
the beginning of nfscl_doclose(). The only semantics change is that
the code no longer drops and reaquires the NFSCLSTATELOCK() mutex,
which I do not believe will have a negative effect on the NFSv4 client.
This is being done to prepare the code for a future patch that fixes
the case where an NFSv4.1/4.2 server replies NFSERR_DELAY to a Close
operation.
Dimitry Andric [Sat, 16 Oct 2021 21:16:46 +0000 (23:16 +0200)]
llvm-readobj: Add missed source file
In some configurations (e.g. powerpc64) the llvm-readobj tool also needs
contrib/llvm-project/llvm/BinaryFormat/MsgPackWriter.cpp, so add it to
libllvm.
Colin Percival [Sat, 16 Oct 2021 18:47:34 +0000 (11:47 -0700)]
Add userland boot profiling to TSLOG
On kernels compiled with 'options TSLOG', record for each process ID:
* The timestamp of the fork() which creates it and the parent
process ID,
* The first path passed to execve(), if any,
* The first path resolved by namei, if any, and
* The timestamp of the exit() which terminates the process.
Expose this information via a new sysctl, debug.tslog_user.
On kernels lacking 'options TSLOG' (the default), no information is
recorded and the sysctl does not exist.
Note that recording namei is needed in order to obtain the names of
rc.d scripts being launched, as the rc system sources them in a
subshell rather than execing the scripts.
With this commit it is now possible to generate flamecharts of the
entire boot process from the start of the loader to the end of
/etc/rc. The code needed to perform this processing is currently
found in github: https://github.com/cperciva/freebsd-boot-profiling
Kristof Provost [Sat, 16 Oct 2021 07:32:15 +0000 (09:32 +0200)]
pf: selecting pf_map_addr is not an error
When a redirection/nat IP address is selected by pf_map_addr it is
logged with PF_DEBUG_MISC level. This one according to the manual means
"Generate debug messages for various errors". Selecting an IP address is
not an error, it's a normal function of pf for route-to, nat and some
other operations. Therefore PF_DEBUG_NOISY level should be choosen which
is means "Generate debug messages for common conditions".
Kristof Provost [Wed, 13 Oct 2021 14:06:47 +0000 (16:06 +0200)]
pfctl: delay label macro expansion until after rule optimisation
We used to expand the $nr macro in labels into the rule number prior to
the optimisation step. This would occasionally produce incorrect rule
numbers in the labels.
Delay all macro expansion until after the optimisation step to ensure
that we expand the correct values.
Rick Macklem [Fri, 15 Oct 2021 21:25:38 +0000 (14:25 -0700)]
nfscl: Add an argument to nfscl_tryclose()
This patch adds a new argument to nfscl_tryclose() to indicate
whether or not it should loop when a NFSERR_DELAY reply is received
from the NFSv4 server. Since this new argument is always passed in
as "true" at this time, no semantics change should occur.
This is being done to prepare the code for a future patch that fixes
the case where an NFSv4.1/4.2 server replies NFSERR_DELAY to a Close
operation.
Ed Maste [Thu, 7 Oct 2021 00:42:40 +0000 (20:42 -0400)]
Add libcbor to the build
From https://github.com/PJK/libcbor:
libcbor is a C library for parsing and generating CBOR, the general-
purpose schema-less binary data format.
libcbor will be used by ssh to support FIDO/U2F keys. It is currently
intended only for use by ssh, and so is installed as a PRIVATELIB and is
placed in the ssh pkgbase package.
cbor_export.h and configuration.h were generated by the upstream CMake
build. We could create them with bmake rules instead (as NetBSD has
done) but this is a fine start.
This is currently disabled for the 32-bit library build as libfido2 is
not compatible with the COMPAT_32BIT hack in usb_ioctl.h, and there is
no need for libcbor without libfido2.
Reviewed by: kevans
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32347
mixer(8): Fix mixer status line for /dev/dspX.vpY mixer devices.
In some cases when passing /dev/dspX.vpY as mixer devices, m->ci.longname and
m->ci.hw_info will be empty. Don't print any brackets and parentheses
in this case.
Dawid Gorecki [Wed, 13 Oct 2021 19:06:05 +0000 (21:06 +0200)]
libthr: Use kern.stacktop for thread stack calculation.
Use the new kern.stacktop sysctl to retrieve the address of stack top
instead of kern.usrstack. kern.usrstack does not have any knowledge
of the stack gap, so this can cause problems with thread stacks.
Using kern.stacktop sysctl should fix most of those problems.
kern.usrstack is used as a fallback when kern.stacktop cannot be read.
Rename usrstack variables to stacktop to reflect this change.
Fixes problems with firefox and thunderbird not starting with
stack gap enabled.
Dawid Gorecki [Wed, 13 Oct 2021 19:03:37 +0000 (21:03 +0200)]
kern_exec: Add kern.stacktop sysctl.
With stack gap enabled top of the stack is moved down by a random
amount of bytes. Because of that some multithreaded applications
which use kern.usrstack sysctl to calculate address of stacks for
their threads can fail. Add kern.stacktop sysctl, which can be used
to retrieve address of the stack after stack gap is applied to it.
Returns value identical to kern.usrstack for processes which have
no stack gap.
Dawid Gorecki [Wed, 13 Oct 2021 19:01:08 +0000 (21:01 +0200)]
setrlimit: Take stack gap into account.
Calling setrlimit with stack gap enabled and with low values of stack
resource limit often caused the program to abort immediately after
exiting the syscall. This happened due to the fact that the resource
limit was calculated assuming that the stack started at sv_usrstack,
while with stack gap enabled the stack is moved by a random number
of bytes.
Save information about stack size in struct vmspace and adjust the
rlim_cur value. If the rlim_cur and stack gap is bigger than rlim_max,
then the value is truncated to rlim_max.
Corvin Köhne [Fri, 15 Oct 2021 07:25:54 +0000 (09:25 +0200)]
bhyve: ignore low bits of CFGADR
Bhyve could emulate wrong PCI registers.
In the best case, the guest reads wrong registers and the device driver would
report some errors.
In the worst case, the guest writes to wrong PCI registers and could brick
hardware when using PCI passthrough.
According to Intels specification, low bits of CFGADR should be
ignored. Some OS like linux may rely on it. Otherwise, bhyve could
emulate a wrong PCI register.
E.g.
If linux would like to read 2 bytes from offset 0x02, following would
happen.
linux:
outl 0x80000002 at CFGADR
inw at CFGDAT + 2
bhyve:
cfgoff = 0x80000002 & 0xFF = 0x02
coff = cfgoff + (port - CFGDAT) = 0x02 + 0x02 = 0x04
Bhyve would emulate the register at offset 0x04 not 0x02.
Rick Macklem [Fri, 15 Oct 2021 00:28:01 +0000 (17:28 -0700)]
nfscl: Restructure nfscl_freeopen() slightly
This patch factors the unlinking of the nfsclopen structure out of
nfscl_freeopen() into a separate function called nfscl_unlinkopen().
It also adds a new argument to nfscl_freeopen() to conditionally do
the unlink. Since this new argument is always passed in as "true"
at this time, no semantics change should occur.
This is being done to prepare the code for a future patch that fixes
the case where an NFSv4.1/4.2 server replies NFSERR_DELAY to a Close
operation.
John Baldwin [Thu, 14 Oct 2021 22:48:34 +0000 (15:48 -0700)]
ktls: Defer creation of threads and zones until first use.
Run ktls_init() when the first KTLS session is created rather than
unconditionally during boot. This avoids creating unused threads and
allocating unused resources on systems which do not use KTLS.
John Baldwin [Thu, 14 Oct 2021 17:59:16 +0000 (10:59 -0700)]
cxgbe: Only run ktls_tick when NIC TLS is enabled.
Previously the body of ktls_tick was a nop when NIC TLS was disabled,
but the callout was still scheduled consuming power on otherwise-idle
systems with Chelsio T6 adapters. Now the callout only runs while NIC
TLS is enabled on at least one interface of an adapter.
Leandro Lupori [Thu, 14 Oct 2021 16:13:27 +0000 (13:13 -0300)]
powerpc64: make radix with superpages default
As Radix MMU with superpages enabled is now stable, make it the
default choice on supported hardware (POWER9 and above), since its
performance is greater than that of HPT MMU.
Reviewed by: alfredo, jhibbits
Sponsored by: Instituto de Pesquisas Eldorado (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D30797
Warner Losh [Thu, 14 Oct 2021 14:44:37 +0000 (08:44 -0600)]
nvme: Reduce traffic to the doorbell register
Reduce traffic to doorbell register when processing multiple completion
events at once. Only write it at the end of the loop after we've
processed everything (assuming we found at least one completion,
even if that completion wasn't valid).
Leandro Lupori [Thu, 14 Oct 2021 13:39:52 +0000 (10:39 -0300)]
powerpc64: fix OFWFB with Radix MMU
Current implementation of Radix MMU doesn't support mapping
arbitrary virtual addresses, such as the ones generated by
"direct mapping" I/O addresses. This caused the system to hang, when
early I/O addresses, such as those used by OpenFirmware Frame Buffer,
were remapped after the MMU was up.
To avoid having to modify mmu_radix_kenter_attr just to support this
use case, this change makes early I/O map use virtual addresses from
KVA area instead (similar to what mmu_radix_mapdev_attr does), as
these can be safely remapped later.
Reviewed by: alfredo (earlier version), jhibbits (in irc)
MFC after: 2 weeks
Sponsored by: Instituto de Pesquisas Eldorado (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D31232
unionfs: Ensure SAVENAME is set for unionfs vnode operations
"rm-style" system calls such as kern_frmdirat() and kern_funlinkat()
don't supply SAVENAME to preserve the pathname buffer for subsequent
vnode ops. For unionfs this poses an issue because the pathname may
be needed for a relookup operation in unionfs_remove()/unionfs_rmdir().
Currently unionfs doesn't check for this case, leading to a panic on
DIAGNOSTIC kernels and use-after-free of cn_nameptr otherwise.
The unionfs node's stored buffer would suffice as a replacement for
cnp->cn_nameptr in some (but not all) cases, but it's cleaner to just
ensure that unionfs vnode ops always have a valid cn_nameptr by setting
SAVENAME in unionfs_lookup().
While here, do some light cleanup in unionfs_lookup() and assert that
HASBUF is always present in the relevant relookup calls.
Rick Macklem [Wed, 13 Oct 2021 22:48:54 +0000 (15:48 -0700)]
nfscl: Make nfscl_getlayout() acquire the correct pNFS layout
Without this patch, if a pNFS read layout has already been acquired
for a file, writes would be redirected to the Metadata Server (MDS),
because nfscl_getlayout() would not acquire a read/write layout for
the file. This happened because there was no "mode" argument to
nfscl_getlayout() to indicate whether reading or writing was being done.
Since doing I/O through the Metadata Server is not encouraged for some
pNFS servers, it is preferable to get a read/write layout for writes
instead of redirecting the write to the MDS.
This patch adds a access mode argument to nfscl_getlayout() and
nfsrpc_getlayout(), so that nfscl_getlayout() knows to acquire a read/write
layout for writing, even if a read layout has already been acquired.
This patch only affects NFSv4.1/4.2 client behaviour when pNFS ("pnfs" mount
option against a server that supports pNFS) is in use.
This problem was detected during a recent NFSv4 interoperability
testing event held by the IETF working group.
John Baldwin [Wed, 13 Oct 2021 19:30:15 +0000 (12:30 -0700)]
ktls: Ensure FIFO encryption order for TLS 1.0.
TLS 1.0 records are encrypted as one continuous CBC chain where the
last block of the previous record is used as the IV for the next
record. As a result, TLS 1.0 records cannot be encrypted out of order
but must be encrypted as a FIFO.
If the later pages of a sendfile(2) request complete before the first
pages, then TLS records can be encrypted out of order. For TLS 1.1
and later this is fine, but this can break for TLS 1.0.
To cope, add a queue in each TLS session to hold TLS records that
contain valid unencrypted data but are waiting for an earlier TLS
record to be encrypted first.
- In ktls_enqueue(), check if a TLS record being queued is the next
record expected for a TLS 1.0 session. If not, it is placed in
sorted order in the pending_records queue in the TLS session.
If it is the next expected record, queue it for SW encryption like
normal. In addition, check if this new record (really a potential
batch of records) was holding up any previously queued records in
the pending_records queue. Any of those records that are now in
order are also placed on the queue for SW encryption.
- In ktls_destroy(), free any TLS records on the pending_records
queue. These mbufs are marked M_NOTREADY so were not freed when the
socket buffer was purged in sbdestroy(). Instead, they must be
freed explicitly.
Gleb Smirnoff [Fri, 8 Oct 2021 19:56:24 +0000 (12:56 -0700)]
Remove in_ifaddr_lock acquisiton to access in_ifaddrhead.
An IPv4 address is embedded into an ifaddr which is freed
via epoch. And the in_ifaddrhead is already a CK list. Use
the network epoch to protect against use after free.
Next step would be to CK-ify the in_addr hash and get rid of the...
Mark Johnston [Wed, 13 Oct 2021 00:11:02 +0000 (20:11 -0400)]
mount: Check for !VDIR mount points before handling -o emptydir
To implement -o emptydir, vfs_emptydir() checks that the passed
directory is empty. This should be done after checking whether the
vnode is of type VDIR, though, or vfs_emptydir() may end up calling
VOP_READDIR on a non-directory.
Reported by: syzbot+4006732c69fb0f792b2c@syzkaller.appspotmail.com
Reviewed by: kib, imp
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32475
Kyle Evans [Wed, 13 Oct 2021 09:21:28 +0000 (04:21 -0500)]
native-xtools: avoid libllvm while populating the sysroot
Prior to 021385aba562, MK_CLANG=no was sufficient to avoid descending
into lib/clang, but the referenced change added a couple of other
enabling knobs. Turn those off, too, to continue avoiding libllvm.
With this change, we no longer end up with a libllvm using the wrong
default target triple; `poudriere jail -cx` works once again.
Hartmut Brandt [Sun, 10 Oct 2021 15:03:51 +0000 (17:03 +0200)]
Allow the BPF to be select for write. This is needed for boost:asio
which otherwise fails to handle BPFs.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D31967
Rick Macklem [Wed, 13 Oct 2021 00:21:01 +0000 (17:21 -0700)]
nfscl: Fix another deadlock related to the NFSv4 clientID lock
Without this patch, it is possible to hang the NFSv4 client,
when a rename/remove is being done on a file where the client
holds a delegation, if pNFS is being used. For a delegation
to be returned, dirty data blocks must be flushed to the NFSv4
server. When pNFS is in use, a shared lock on the clientID
must be acquired while doing a write to the DS(s).
However, if rename/remove is doing the delegation return
an exclusive lock will be acquired on the clientID, preventing
the write to the DS(s) from acquiring a shared lock on the clientID.
This patch stops rename/remove from doing a delegation return
if pNFS is enabled. Since doing delegation return in the same
compound as rename/remove is only an optimization, not doing
so should not cause problems.
This problem was detected during a recent NFSv4 interoperability
testing event held by the IETF working group.
crt_malloc: Be more persistent when handling mmap() failure
In the situation with limited address space, together with
fragmentation, it is possible for mmap() request in morecore() to fail
when asking for required size + NPOOLPAGES, but succeed without the
addend. Retry allocation there.
John Baldwin [Tue, 12 Oct 2021 21:03:07 +0000 (14:03 -0700)]
Stop creating socket aio kprocs during boot.
Create the initial pool of kprocs on demand when the first socket AIO
request is submitted instead. The pool of kprocs used for other AIO
requests is similarly created on first use.
This partially reverts e81e77c5a055, leaving the option both in
GENERICs on amd64/arm64/arm, and in global NOTES file. Apparently
this better matches existing practice, where we do not try to hard
to make LINT and GENERIC complimentary.
Requested and reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Andrew Turner [Tue, 12 Oct 2021 11:39:14 +0000 (12:39 +0100)]
Stop reading the arm64 domain when it's known
There is no need to read the domain on arm64 when there is only one
in the ACPI tables. This can also happen when the table is missing
as it is unneeded.
Reported by: dch
Sponsored by: The FreeBSD Foundation
Kyle Evans [Sat, 2 Oct 2021 05:23:03 +0000 (00:23 -0500)]
fifos: delegate unhandled kqueue filters to underlying filesystem
This gives the vfs layer a chance to provide handling for EVFILT_VNODE,
for instance. Change pipe_specops to use the default vop_kqfilter to
accommodate fifoops that don't specify the method (i.e. all in-tree).
Rick Macklem [Tue, 12 Oct 2021 04:58:24 +0000 (21:58 -0700)]
nfscl: Fix a deadlock related to the NFSv4 clientID lock
Without this patch, it is possible for a process doing an NFSv4
Open/create of a file to block to allow another process
to acquire the exclusive lock on the clientID when holding
a shared lock on the clientID. As such, both processes
deadlock, with one wanting the exclusive lock, while the
other holds the shared lock. This deadlock is unlikely to occur
unless delegations are in use on the NFSv4 mount.
This patch fixes the problem by not deferring to the process
waiting for the exclusive lock when a shared lock (reference cnt)
is already held by the process.
This problem was detected during a recent NFSv4 interoperability
testing event held by the IETF working group.
Warner Losh [Mon, 11 Oct 2021 18:59:39 +0000 (12:59 -0600)]
forward declare struct thread
sys/sysctl.h moved struct thread forward declaration under #ifdef
_KERNEL and so this header fails when included from userland. Add a
forward declaration here.