Alan Somers [Wed, 4 Oct 2023 18:48:01 +0000 (12:48 -0600)]
fusefs: sanitize FUSE_READLINK results for embedded NULs
If VOP_READLINK returns a path that contains a NUL, it will trigger an
assertion in vfs_lookup. Sanitize such paths in fusefs, rejecting any
and warning the user about the misbehaving server.
Mateusz Guzik [Wed, 11 Oct 2023 09:42:12 +0000 (09:42 +0000)]
vfs: further speed up continuous free vnode recycle
The primary bottleneck *was* vnode_list mtx, which got artificially
worsened due to the following work done with the lock held:
1. the global heavily modified numvnodes counter was being read,
inducing massive cache line ping pong
2. should the value fit limits (which it normally did) there would be an
avoidable write to vn_alloc_cyclecount, which is being read outside
of the lock, once more inducing traffic
But if vn_alloc_cyclecount is 0, which it normally is even when facing
vnode shortage, there is no need to check numvnodes nor set it to 0 again.
Another problem was numvnodes adjustment (which made the locked read
much worse). While it fundamentally does not scale as it is not
distributed in any fashion, it was avoidably slow. When bumping over the
vnode limit, it would be modified with atomics 3 times: inc + dec to
backpedal in vn_alloc, then final inc in vn_alloc_hard.
One can let some slop persist over calls to vnlru_free instead.
In principle each thread in the system could get here and bump it, so a
limit is put in place to keep things sane.
Bench setup same as in prior commits: zfs, 20 separate directory trees
each with 1 million files in total and 20 find(1) processes stating them
in parallel (one per each tree).
Total run time (in seconds) goes down as follows:
vnode limit 8388608 400000
before ~20 ~35
after ~8 ~15
With this in place the primary bottleneck is now ZFS.
vfs: prefix regular vnlru with a special case for free vnodes
Works around severe performance problems in certain corner cases, see
the commentary added.
Modifying vnlru logic has proven rather error prone in the past and a
release is near, thus take the easy way out and fix it without having to
dig into the current machinery.
Mateusz Guzik [Tue, 10 Oct 2023 16:19:53 +0000 (16:19 +0000)]
vfs: consult freevnodes in vnlru_kick_cond
If the count is high enough there is no point trying to produce more.
Not going there reduces traffic on the vnode_list mtx.
This further shaves total real time in a test mentioned in: 74be676d87745eb7 ("vfs: drop one vnode list lock trip during vnlru free
recycle") -- 20 instances of find each creating 1 million vnodes, while
total limit is set to 400k.
Ed Maste [Fri, 29 Sep 2023 15:47:41 +0000 (11:47 -0400)]
freebsd-update: add a note about when files may be deleted
Files under /var/db/freebsd-update are required during the upgrade
process, and to support rollback. They may be deleted if no upgrade is
in progress and rollback will not be required.
PR: 273601
Reviewed by: bcr
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D42022
Kristof Provost [Thu, 5 Oct 2023 14:57:50 +0000 (16:57 +0200)]
pf: fix SCTP SDT probe
We want the return value of pf_test_rule(), i.e. the result of the
evaluation of the new state, not the result of the evaluation of the
original packet/state.
MFC after: 1 week
Sponsored by: Orange Business Services
Zhenlei Huang [Mon, 9 Oct 2023 10:30:22 +0000 (18:30 +0800)]
buf: Add sysctl flag CTLFLAG_TUN to loader tunable
The sysctl variable 'vfs.unmapped_buf_allowed' is actually a loader
tunable. Add sysctl flag CTLFLAG_TUN to it so that `sysctl -T` will
report it correctly.
Zhenlei Huang [Mon, 9 Oct 2023 10:30:22 +0000 (18:30 +0800)]
sockets: Add sysctl flag CTLFLAG_TUN to loader tunable
The sysctl variable 'kern.ipc.maxsockets' is actually a loader tunable.
Add sysctl flag CTLFLAG_TUN to it so that `sysctl -T` will report it
correctly.
Zhenlei Huang [Mon, 9 Oct 2023 10:30:21 +0000 (18:30 +0800)]
ddb: Add sysctl flag CTLFLAG_TUN to loader tunable
The sysctl variable 'debug.ddb.capture.bufsize' is actually a loader
tunable. Add sysctl flag CTLFLAG_TUN to it so that `sysctl -T` will
report it correctly.
Zhenlei Huang [Mon, 9 Oct 2023 10:30:21 +0000 (18:30 +0800)]
cam/scsi: Add sysctl flag CTLFLAG_TUN to loader tunable
The sysctl variable 'kern.cam.scsi_delay' is actually a loader tunable.
Add sysctl flag CTLFLAG_TUN to it so that `sysctl -T` will report it
correctly.
John Baldwin [Fri, 8 Sep 2023 23:30:35 +0000 (16:30 -0700)]
cxgbe tom: Call t4_rcvd_locked from do_rx_data to return RX credits
In particular, the kernel RPC layer used by the NFS client never
invokes pru_rcvd since it always reads data from the socket upcall
via MSG_SOCALLBCK which avoids calling pru_rcvd. As a result, on an
NFS client connection managed by t4_tom, RX credits were never
returned to the TOE connection to open the TCP window resulting in
connection hangs.
To fix, expand the set of conditions in do_rx_data where RX credits
are returned to match those in t4_rcvd_locked by calling the function
directly.
John Hay [Thu, 28 Sep 2023 21:08:08 +0000 (14:08 -0700)]
x86: Properly align interrupt vectors for MSI
MSI (not MSI-X) interrupt vectors must be allocated in groups that are
powers of 2, and the block of IDT vectors must be aligned to the size
of the request.
The code in native_apic_alloc_vectors() does an alignment check in the loop:
if ((vector & (align - 1)) != 0)
continue;
first = vector;
But it adds APIC_IO_INTS to the value it returns:
return (first + APIC_IO_INTS);
The problem is that APIC_IO_INTS is not a multiple of 32. It is 48:
As a result, a request for 32 vectors (the max supported by MSI), was
not always aligned. To fix, check the alignment of
'vector + APIC_IO_INTS' in the loop.
The AX88179A has two firmware modes, one of which is backward
compatible with existing AX88178A/179 driver. The active firmware mode
can be controlled through a register.
Update axge(4) man page to mention 179A support and ensure that, when
bound to a AX88179A, the driver activates the compatible firmware mode.
Ed Maste [Tue, 27 Jul 2021 16:44:45 +0000 (12:44 -0400)]
pkgbase: accommodate pkg < 1.17
6cafdee71d2b adapted the pkgbase build for 1.17, but broke Cirrus-CI's
use of PKG_FORMAT=tar (the quarterly package set still has pkg 1.16).
Because of this I disabled the pkgbase build and test in 2bfba2a04b05.
Now, check `pkg --version` and use the old logic for < 1.17.
To be reverted once we no longer encounter pkg 1.16 in Cirrus-CI (i.e.,
via GCP cloud images) to avoid keeping this extra complexity around.
PR: 257422
Reviewed by: manu
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31324
Ed Maste [Fri, 1 Apr 2022 22:58:00 +0000 (18:58 -0400)]
Speed up *-old-* make targets by using sed instead of xargs
Targets like 'list-old-files' used "xargs -n1" to produce a list with
one file per line. Using xargs resulted in one fork+exec for each
Argument, resulting in rather long runtime. Instead, use sed to split
the list. On one machine `make list-old-files` took 30s wall clock time
with xargs and less than 1s with sed.
Reviewed by: jhb
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34741
Mark Johnston [Mon, 2 Oct 2023 11:49:27 +0000 (07:49 -0400)]
swap_pager: Fix a race in swap_pager_swapoff_object()
When we disable swapping to a device, we scan the full VM object list
looking for objects with swap trie nodes that reference the device in
question. The pages corresponding to those nodes are paged in.
While paging in, we drop the VM object lock. Moreover, we do not hold a
reference for the object; swap_pager_swapoff_object() merely bumps the
paging-in-progress counter. vm_object_terminate() waits for this
counter to drain before proceeding and freeing pages.
However, swap_pager_swapoff_object() decrements the counter before
re-acquiring the VM object lock, which means that vm_object_terminate()
can race to acquire the lock and free the pages. Then,
swap_pager_swapoff_object() ends up unbusying a freed page. Fix the
problem by acquiring the lock before waking up sleepers.
David Sloan [Thu, 7 Sep 2023 16:22:21 +0000 (10:22 -0600)]
nvme: Fix memory leak in pt ioctl commands
When running nvme passthrough commands through the ioctl interface
memory is mapped with vmapbuf() but not unmapped. This results in leaked
memory whenever a process executes an nvme passthrough command with a
data buffer. This can be replicated with a simple c function (error
checks skipped for brevity):
Alan Somers [Wed, 22 Feb 2023 22:06:43 +0000 (15:06 -0700)]
mprutil: "fix user reply buffer (64)..." warnings
Depending on the card's firmware version, it may return different length
responses for MPI2_FUNCTION_IOC_FACTS. But the first part of the
response contains the length of the rest, so query it first to get the
length and then use that to size the buffer for the full response.
Also, correctly zero-initialize MPI2_IOC_FACTS_REQUEST. It only worked
by luck before.
Alan Somers [Wed, 20 Sep 2023 21:37:31 +0000 (15:37 -0600)]
fusefs: fix some bugs updating atime during close
When using cached attributes, we must update a file's atime during
close, if it has been read since the last attribute refresh. But,
* Don't update atime if we lack write permissions to the file or if the
file system is readonly.
* If the daemon fails our atime update request for any reason, don't
report this as a failure for VOP_CLOSE.
Michael Osipov [Tue, 3 Oct 2023 05:53:20 +0000 (07:53 +0200)]
libfetch: don't rely on ca_root_nss for certificate validation
Before certctl(8), there was no system trust store, and libfetch
relied on the CA certificate bundle from the ca_root_nss port to
verify peers.
We now have a system trust store and a reliable mechanism for
manipulating it (to explicitly add, remove, or revoke certificates),
but if ca_root_nss is installed, libfetch will still prefer that to
the system trust store.
With this change, unless explicitly overridden, libfetch will rely on
OpenSSL to pick up the default system trust store.
* Whenever possible, use strtonum() to parse numeric arguments.
* Improve usefulness and consistency of error messages.
* While here, fix some type and style issues.
* Like GNU split, turn autoextend back on if given -a0.
* Add a test case that verifies that -a<non-zero> turns autoextend off.
* Add a test case that verifies that -a0 turns autoextend back on.
libc: Rewrite quick_exit() and at_quick_exit() using C11 atomics.
Compiler memory barriers do not prevent the CPU from executing the code
out of order. Switch to C11 atomics. This also lets us get rid of the
mutex; instead, loop until the compare_exchange succeeds.
While here, change the return value of at_quick_exit() on failure to
the more traditional -1, matching atexit().
Mark Johnston [Wed, 27 Sep 2023 12:23:58 +0000 (08:23 -0400)]
hdac: Defer interrupt allocation in hdac_attach()
hdac_attach() registers an interrupt handler before allocating various
driver resources which are accessed by the interrupt handler. On some
platforms we observe what appear to be spurious interrupts upon a cold
boot, resulting in panics.
Partially work around the problem by deferring irq allocation until
after other resources are allocated. I think this is not a complete
solution, but is correct and sufficient to work around the problems
reported in the PR.
PR: 268393
Tested by: Alexander Sherikov <asherikov@yandex.com>
Tested by: Oleh Hushchenkov <o.hushchenkov@gmail.com>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D41883
vfs: drop one vnode list lock trip during vnlru free recycle
vnlru_free_impl would take the lock prior to returning even though most
frequent caller does not need it.
Unsurprisingly vnode_list mtx is the primary bottleneck when recycling
and avoiding the useless lock trip helps.
Setting maxvnodes to 400000 and running 20 parallel finds each with a
dedicated directory tree of 1 million vnodes in total:
before: 4.50s user 1225.71s system 1979% cpu 1:02.14 total
after: 4.20s user 806.23s system 1973% cpu 41.059 total
That's 34% reduction in total real time.
With this the block *remains* the primary bottleneck when running on
ZFS.
Warner Losh [Mon, 21 Jun 2021 14:23:34 +0000 (08:23 -0600)]
basename: fix history
Basename(1) first appeared in the 7th edition. It was not in the 6th
edition, or PWB releases. It's on all the subsequent descendants.
Dirname(1) first appeared in System III, and was later picked up in
4.3-Reno and 8th edition research unix (though was not in 4.1BSD where
the bulk of 8th edition came from). In System III and V8 it was a shell
script, though the BSD version is in C.
Warner Losh [Mon, 21 Jun 2021 14:16:10 +0000 (08:16 -0600)]
banner: Correct history.
Banner appeared in the 6th edition of AT&T Research unit. It was
subsequently on all the Berkeley tapes, as well as PWB, System III and
System V. The PWB/AT&T and BSD banner programs were different, and the
current FreeBSD banner program shares many elements of the 3BSD one,
though the font has changed.
The manual page of gmirror describes how gmirror providers can be used
for kernel dumps. Unfortunately, the instruction references
/etc/rc.early, which is no longer a part of rc(8).
Remove references to rc.early and suggest creating an rc(8) service
script instead.
Future work: In the Problem Report on Bugzilla, Lawrence Chen suggested
adding example rc(8) scripts to the gmirror. However, those examples
need to be tested before they become official reference examples in the
base. Also, those scripts should probably land directly to /etc/rc.d,
/usr/share/examples/rc.d, or /usr/share/examples/gmirror instead of the
gmirror manual page.
PR: 178818
Reported by: Lawrence Chen <beastie@tardisi.com>
Fixes: dd2b024a336f Removal of early.sh
MFC after: 1 week