Randall Stewart [Wed, 14 Dec 2022 20:37:48 +0000 (15:37 -0500)]
Rack cannot be loaded without cc_newreno compiled into the kernel.
Right now rack will fail to load due to its hack in accessing symbol names
in cc_newreno. This was fine when newreno was always compiled into the
kernel but now ... not so much. Instead lets fix up rack to use the socket
option queries to get the information it wants and set the parameters. We
also fix the CC parameter so they are always settable.
* Separate interface creation from interface modification code
* Support setting some interface attributes (ifdescr, mtu, up/down, promisc)
* Improve interaction with the cloners requiring to parse/write custom
interface attributes
* Add bitmask-based way of checking if the attribute is present in the
message
* Don't use multipart RTM_GETLINK replies when searching for the
specific interface names
* Use ENODEV instead of ENOENT in case of failed RTM_GETLINK search
* Add python netlink test helpers
* Add some netlink interface tests
Andrew Gallatin [Wed, 14 Dec 2022 19:34:07 +0000 (14:34 -0500)]
vm: reduce lock contention when processing vm batchqueues
Rather than waiting until the batchqueue is full to acquire the lock &
process the queue, we now start trying to acquire the lock using trylocks
when the batchqueue is 1/2 full. This removes almost all contention on the
vm pagequeue mutex for for our busy sendfile() based web workload.
It also greadly reduces the amount of time a network driver ithread
remains blocked on a mutex, and eliminates some packet drops under
heavy load.
So that the system does not loose the benefit of processing large
batchqueues, I've doubled the size of the batchqueues. This way, when
there is no contention, we process the same batch size as before.
This has been run for several months on a busy Netflix server, as well
as on my personal desktop.
Andrew Gallatin [Wed, 14 Dec 2022 19:19:35 +0000 (14:19 -0500)]
allocate inpcb aligned to cachelines
The inpcb struct is one of the most heavily utilized in the kernel
on a busy network server. By aligning it to a cacheline
boundary, we can ensure that closely related fields in the inpcb
and tcbcb can be predictably located on the same cacheline. rrs
has already done a lot of this work to put related fields on the
same line for the tcbcb.
In combination with a forthcoming patch to align the start of the tcpcb,
we see a roughly 3% reduction in CPU use on a busy web server serving
traffic over roughly 50,000 TCP connections.
Gleb Smirnoff [Wed, 14 Dec 2022 18:02:44 +0000 (10:02 -0800)]
sockets: provide sousrsend() that does socket specific error handling
Sockets have special handling for EPIPE on a write, that was spread out
into several places. Treating transient errors is also special - if
protocol is atomic, than we should ignore any changes to uio_resid, a
transient error means the write had completely failed (see d2b3a0ed31e).
- Provide sousrsend() that expects a valid uio, and leave sosend() for
kernel consumers only. Do all special error handling right here.
- In dofilewrite() don't do special handling of error for DTYPE_SOCKET.
- For send(2), write(2) and aio_write(2) call into sousrsend() and remove
error handling for kern_sendit(), soo_write() and soaio_process_job().
Mark Johnston [Wed, 14 Dec 2022 14:32:17 +0000 (09:32 -0500)]
sys/conf: Remove an unneeded flag variable
After commit fac6dee9eb58 ("Remove tests for obsolete compilers in the
build system"), we always set -fdebug-prefix-map, so there's no point in
defining and testing _MAP_DEBUG_PREFIX. No functional change intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Mark Johnston [Wed, 14 Dec 2022 14:29:59 +0000 (09:29 -0500)]
pf: Fix definitions of pf_pfil_*_hooked
This use of "volatile" in the vnet definitions doesn't have any effect.
VNET_DEFINE_STATE(volatile int, ...) should work, but let's avoid using
"volatile" altogether and convert to atomic_load/atomic_store. Also
convert to bool while here.
Reviewed by: kp, mjg
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D37684
Nick Reilly [Wed, 30 Nov 2022 14:19:44 +0000 (15:19 +0100)]
pf: fix pfi_ifnet leak on interface removal
The detach of the interface and group were leaving pfi_ifnet memory
behind. Check if the kif still has references, and clean it up if it
doesn't
On interface detach, the group deletion was notified first and then a
change notification was sent. This would recreate the group in the kif
layer. Reorder the change to before the delete.
Kristof Provost [Mon, 5 Dec 2022 13:14:49 +0000 (14:14 +0100)]
if_ovpn: cleanup offsetof() use
Move the use of the `offsetof(struct ovpn_counters, fieldname) /
sizeof(uint64_t)` construct into a macro.
This removes a fair bit of code duplication and should make things a
little easier to read.
Kristof Provost [Fri, 2 Dec 2022 15:59:38 +0000 (16:59 +0100)]
if_ovpn: include peer counters in a OVPN_NOTIF_DEL_PEER message
When we remove a peer userspace can no longer retrieve its counters. To
ensure that userspace can get a full count of the entire session we now
include the counters in the deletion message.
Kristof Provost [Tue, 29 Nov 2022 11:06:32 +0000 (12:06 +0100)]
if_ovpn: allow peer lookup by vpn4/vpn6 address
Introduce two more RB_TREEs so that we can look up peers by their peer
id (already present) or vpn4 or vpn6 address.
This removes the last linear scan of the peer list.
Kristof Provost [Sat, 26 Nov 2022 12:52:40 +0000 (13:52 +0100)]
if_ovpn: remove OVPN_SEND_PKT
OpenVPN userspace no longer uses the ioctl interface to send control
packets. It instead uses the socket directly.
The use of OVPN_SEND_PKT was never released, so we can remove this
without worrying about compatibility.
Gleb Smirnoff [Wed, 14 Dec 2022 03:31:05 +0000 (19:31 -0800)]
tcp: fix counter leak for SYN_RCVD state when syncache_socket() fails
The SYN_RCVD state count is tricky here due to default code path and TFO
being so different. In the default case the count is incremented when a
syncache entry is added to the the database in syncache_insert(). Later
when connection transitions from syncache entry to a socket in
syncache_expand(), this counter is inherited by the tcpcb. If socket or
tcpcb allocation failed in syncache_socket() failed the syncache_expand()
is responsible for decrement. In the TFO case the syncache entry is not
inserted into database and count of SYN_RCVD is first incremented in the
syncache_tfo_expand() after successful socket allocation. Thus, inside
syncache_socket() we can't tell whether we need to decrement in a case of
a failure or not. The caller is responsible for this book keeping.
Kyle Evans [Wed, 14 Dec 2022 01:31:21 +0000 (19:31 -0600)]
diff: fix side-by-side output with tabbed input
The previous logic conflated some things... in this block:
- j: input characters rendered so far
- nc: number of characters in the line
- col: columns rendered so far
- hw: column width ((h)ard (w)idth?)
Comparing j to hw or col to nc are naturally wrong, as col and hw are
limits on their respective counters and nc is already brought down to hw
if the input line should be truncated to start with.
Right now, we end up easily truncating lines with tabs in them as we
count each tab for $tabwidth lines in the input line, but we really
should only be accounting for them in the column count. The problem is
most easily demonstrated by the two input files added for the tests,
the two tabbed lines lose at least a word or two even though there's
plenty of space left in the row for each side.
Mateusz Guzik [Tue, 13 Dec 2022 20:42:32 +0000 (20:42 +0000)]
kref: switch internal type to atomic_t and bring back const to kref_read
This unbreak drm-kmod build.
the const is part of Linux API
Unfortunately drm-kmod uses hand-rolled refcount* calls on a kref
object. For now go the easy route of keeping it operational by casting
stuff internally.
The general goal here is to make FreeBSD refcount API use an opaque
type, hence the ongoing removal of hand-rolled accesses.
Ed Maste [Mon, 2 Mar 2020 17:31:52 +0000 (12:31 -0500)]
Add deprecation notices to ce,cp sync serial drivers
And the related sconfig utility. Sync serial (e.g. E1/T1) interfaces
are obsolete, and nobody responded to several inquires on the mailing
lists about use of these drivers.
Relnotes: Yes
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23928
Martin Matuska [Tue, 13 Dec 2022 19:21:13 +0000 (20:21 +0100)]
libarchive: merge from vendor branch
Libarchive 3.6.2
Important bug fixes:
rar5 reader: fix possible garbled output with bsdtar -O (#1745)
mtree reader: support reading mtree files with tabs (#1783)
various small fixes for issues found by CodeQL
Warner Losh [Tue, 13 Dec 2022 04:46:05 +0000 (21:46 -0700)]
stand: Make ioctl declaration consistent
It typically had two args with an optional third from the userland
declaration in sys/ioccom.h. However, the funciton definition used a
non-optional char * argument. This mismatch is UB behavior (but worked
due to the calling convetions of all our machines).
Instead, add a declaration for ioctl to stand.h, make the third arg
'void *' which is a better match to the ... declaration before. This
prevents the convert int * -> char * errors as well. Make the ioctl
user-space declaration truly user-space specific (omit it in the
stand-alone build).
Ed Maste [Mon, 12 Dec 2022 22:00:13 +0000 (17:00 -0500)]
ssh: remove note about local change to [Use]PrivilegeSeparation
We documented "[Use]PrivilegeSeparation defaults to sandbox" as one of
our modifications to ssh's server-side defaults, but this is not (any
longer) a difference from upstream.
Alan Cox [Sat, 8 Oct 2022 07:20:25 +0000 (02:20 -0500)]
pmap: standardize promotion conditions between amd64 and arm64
On amd64, don't abort promotion due to a missing accessed bit in a
mapping before possibly write protecting that mapping. Previously,
in some cases, we might not repromote after madvise(MADV_FREE) because
there was no write fault to trigger the repromotion. Conversely, on
arm64, don't pointlessly, yet harmlessly, write protect physical pages
that aren't part of the physical superpage.
Don't count aborted promotions due to explicit promotion prohibition
(arm64) or hardware errata (amd64) as ordinary promotion failures.
Chuck Silvers [Mon, 12 Dec 2022 16:14:17 +0000 (08:14 -0800)]
restore: fix restore of NFS4 ACLs
Changing the mode bits on a file with an NFS4 ACL results in the
NFS4 ACL being replaced by one matching the new mode bits being set,
so when restoring a file with an NFS4 ACL, set the owner/group/mode first
and then set the NFS4 ACL, so that setting the mode does not throw away
the ACL that we just set.
Mark Johnston [Sun, 11 Dec 2022 16:40:12 +0000 (11:40 -0500)]
bridge: Fix a potential memory leak in bridge_enqueue()
A comment at the beginning of the function notes that we may be
transmitting multiple fragments as distinct packets. So, the function
loops over all fragments, transmitting each mbuf chain. If if_transmit
fails, we need to free all of the fragments, but m_freem() only frees an
mbuf chain - it doesn't follow m_nextpkt.
Change the error handler to free each untransmitted packet fragment, and
count each fragment as a separate error since we increment OPACKETS once
per fragment when transmission is successful.
Mark Johnston [Sun, 11 Dec 2022 16:27:22 +0000 (11:27 -0500)]
libdtrace: Change the binding of USDT probe symbols to STB_WEAK
Otherwise, if multiple object files contain references to the same
probe, newish lld will refuse to link them by default, raising a
duplicate global symbol definition error. Previously, duplicate global
symbols with identical absolute st_values were permitted by both lld and
GNU ld.
Since dtrace has no use for probe function symbols after the relocation
performed by dtrace -G, make the symbols weak as well, following a
suggestion from MaskRay.
Reported by: dim
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
When the lower filesystem directory hierarchy is the same as the nullfs
mount point (admittedly not likely to be a useful situation in
practice), nullfs is subject to the exact deadlock between the busy
count drain and the covered vnode lock that VV_CROSSLOCK is intended
to address.
unionfs: allow recursion on covered vnode lock during mount/unmount
When taking the covered vnode lock during mount and unmount operations,
specify LK_CANRECURSE as the existing lock state of the covered vnode
is not guaranteed (AFAIK) either by assertion or documentation for
these code paths.
For the mount path, this is done only for completeness as the covered
vnode lock is not currently held when VFS_MOUNT() is called.
For the unmount path, the covered vnode is currently held across
VFS_UNMOUNT(), and the existing code only happens to work when unionfs
is mounted atop FFS because FFS sets LO_RECURSABLE on its vnode locks.
This of course doesn't cover a hypothetical case in which the covered
vnode may be held shared, but for the mount and unmount paths such a
scenario seems unlikely to materialize.
When VV_CROSSLOCK is present, the lock for the vnode at the current
stage of lookup must be held across the VFS_ROOT() call for the
filesystem mounted at the vnode. Since VV_CROSSLOCK implies that
the root vnode reuses the already-held lock, the possibility for
recursion should be made clear in the flags passed to VFS_ROOT().
For cases in which the lock is held exclusive, this means passing
LK_CANRECURSE. For cases in which the lock is held shared, it
means clearing LK_NODDLKTREAT to allow VFS_ROOT() to potentially
recurse on the shared lock even in the presence of an exclusive
waiter.
That the existing code works for unionfs is due to a coincidence
of the current unionfs implementation.
Mike Karels [Sat, 10 Dec 2022 19:40:55 +0000 (13:40 -0600)]
growfs(7): document addition of swap partition and growfs_fstab script
Add documentation of the growfs script's new ability to add a swap
partition, expanding on the previous functionality as well. Add the
growfs_fstab helper script, which runs separately. Add a description
of how to expand a file system a second time if swap had been added.
While here, fix a typo.
Mike Karels [Sat, 10 Dec 2022 19:39:59 +0000 (13:39 -0600)]
growfs_fstab: add new /etc/rc.d script to add swap to fstab
The growfs_fstab script is a helper for the growfs script to add any
new swap partition to /etc/fstab on first boot. If growfs adds a
swap partition, it sets growfs_swap_pdev in the kenv. In this case,
after the root file system is read/write, if there is no swap partition
in the fstab, growfs_fstab adds growfs_swap as a swap partition to the
fstab. Also, it runs dumpon to add the swap partition (as this
happened earlier in the startup sequence).
Mike Karels [Sat, 10 Dec 2022 19:38:36 +0000 (13:38 -0600)]
growfs script: add swap partition as well as growing root
Add the ability to create a swap partition in the course of growing
the root file system on first boot, enabling by default. The default
rules are: add swap if the disk is at least 15 GB (decimal), and the
existing root is less than 40% of the disk. The default size is 10%
of the disk, but is limited by the memory size. The limit is twice
memory size up to 4 GB, 8 GB up to 8 GB memory, and memory size over
8 GB memory. Swap size is clamped at vm.swap_maxpages/2 as well.
The new swap partition is labeled as "growfs_swap".
The default behavior can be overridden by setting growfs_swap_size in
/etc/rc.conf or in the kernel environment, with kenv taking priority.
A value of 0 inhibits the addition of swap, an empty value specifies
the default, and other values indicate a swap size in bytes.
By default, addition of swap is inhibited if a swap partition is found
in the output of the sysctl kern.geom.conftxt before the current root
partition, usually meaning that there is another disk present.
Swap space is not added if one is already present in /etc/fstab.
The root partition is read-only when growfs runs, so /etc/fstab can
not be modified. That step is handled by a new growfs_fstab script,
added in a separate commit. Set the value "growfs_swap_pdev" in kenv
to indicate that this should be done, as well as for internal use.
There is optional verbose output meant for debugging; it can only be
enabled by modifying the script (in two places, for sh and awk).
This should be removed before release, after testing on -current.
John Baldwin [Fri, 9 Dec 2022 18:26:23 +0000 (10:26 -0800)]
vmm: Don't lock a vCPU for VM_PPTDEV_MSI[X].
These are manipulating state in a ppt(4) device none of which is
vCPU-specific. Mark the vcpu fields in the relevant ioctl structures
as unused, but don't remove them for now.
Martin Matuska [Fri, 9 Dec 2022 16:26:30 +0000 (17:26 +0100)]
Update vendor/libarchive to libarchive/libarchive@ba80276cc
Important Bugfixes:
rar5 reader: fix possible garbled output with bsdtar -O (#1745)
mtree reader: support reading mtree files with tabs (#1783)
various small fixes for issues found by CodeQL
Warner Losh [Fri, 9 Dec 2022 14:55:42 +0000 (07:55 -0700)]
kboot: Use (void) instead of () for functiosn with no args
`int foo();` means 'a function that takes any number of arguments.`
not `a function that takes no arguemnts`, that's spelled `int foo(void);`
Adopt the latter.
Cy Schubert [Fri, 9 Dec 2022 14:06:04 +0000 (06:06 -0800)]
heimdal: kadm5_c_get_principal() should check return code
kadm5_c_get_principal() should check the return code from
kadm5_ret_principal_ent(). As it doesn't it assumes success when
there is none and can lead to potential vulnerability. Fix this.
Cy Schubert [Thu, 8 Dec 2022 23:22:43 +0000 (15:22 -0800)]
heimdal: Properly ix bus fault when zero-length request received
Zero length client requests result in a bus fault when attempting to
free malloc()ed pointers within the requests softc. Return an error
when the request is zero length.
This properly fixes PR/268062 without regressions.
PR: 268062
Reported by: Robert Morris <rtm@lcs.mit.edu>
MFC after: 3 days
tmpfs: for used pages, account really allocated pages, instead of file sizes
This makes tmpfs size accounting correct for the sparce files. Also
correct report st_blocks/va_bytes. Previously the reported value did not
accounted for the swapped out pages.
PR: 223015
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37097
tmpfs: make vm_object point to the tmpfs node instead of vnode
The vnode could be reclaimed and allocated again during the lifecycle of
the node, but the node cannot. Also, referencing the node would allow
to reach it and tmpfs mount data from the object, regardless of the
state of the possibly absent vnode.
Still use swp_tmpfs for back-pointer, instead of using handle. Use of
named swap objects would incur taking the sw_alloc_sx on node allocation
and deallocation.
swp_tmpfs is renamed to swp_priv to remove the last bit of tmpfs in vm/.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37097
netlink: add interface notification on link status / flags change.
* Add link-state change notifications by subscribing to ifnet_link_event.
In the Linux netlink model, link state is reported in 2 places: first is
the IFLA_OPERSTATE, which stores state per RFC2863.
The second is an IFF_LOWER_UP interface flag. As many applications rely
on the latter, reserve 1 bit from if_flags, named as IFF_NETLINK_1.
This flag is mapped to IFF_LOWER_UP in the netlink headers. This is done
to avoid making applications think this flag is actually
supported / presented in non-netlink outputs.
* Add flag change notifications, by hooking into rt_ifmsg().
In the netlink model, notification should include the bitmask for the
change flags. Update rt_ifmsg() to include such bitmask.
Warner Losh [Fri, 9 Dec 2022 05:07:52 +0000 (22:07 -0700)]
kboot: Allow loading fdt from different sources
Linux has /sys/firmware/fdt and /proc/device-tree to publish the dtb for
the system. The former has it all in one file, while the latter breaks
it out. Prefer the former since it's the more modern interface, but
retain both since I don't have a PS3 to test to see if its kernel is new
enough for /sys/firmware or not.