John Baldwin [Wed, 21 Dec 2022 18:32:24 +0000 (10:32 -0800)]
bhyve: Remove some no-op code for setting RIP.
fbsdrun_addcpu() read the current vCPU's RIP register from the kernel
via vm_get_register() to pass along through some layers to vm_loop()
which then set the register via vm_set_register(). However, this is
just always setting the value back to itself.
John Baldwin [Wed, 21 Dec 2022 18:31:16 +0000 (10:31 -0800)]
bhyve: Simplify setting vCPU capabilities.
- Enable VM_CAP_IPI_EXIT in fbsdrun_set_capabilities along with other
capabilities enabled on all vCPUs.
- Don't call fbsdrun_set_capabilities a second time on the BSP in
spinup_vcpu.
- To preserve previous behavior, don't unconditionally enable
unrestricted guest mode on the BSP (this unbreaks single-vCPU guests
on Nehalem systems, though supporting such setups is of dubious
value). Other places that enbale UG on the BSP are careful to check
the result of the operation and fail if it is not available.
- Don't set any capabilities in spinup_ap(). These are now all
redundant with earlier settings from spinup_vcpu().
- While here, axe a stale comment from fbsdrun_addcpu(). This
function is now always called from the main thread for all vCPUs.
Andrew Turner [Fri, 11 Nov 2022 08:55:59 +0000 (08:55 +0000)]
Add support for an array of hwresets
In some drivers we need to assert and deassert a group of hardware
resets in any order. To support this add a new hwreset_array that
manages all hwresets defined for a device.
Reviewed by: bz, manu, mmel
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D37357
Zhenlei Huang [Wed, 21 Dec 2022 01:04:30 +0000 (09:04 +0800)]
geom_part: Fix potential integer overflow when checking size of the table
`hdr_entries` and `hdr_entsz` are both uint32_t as defined in UEFI spec.
Current spec does not have upper limit of the number of partition
entries and the size of partition entry, it is potential that malicious
or corrupted GPT header read from untrusted source contains large size of
entry number or size.
Justin Hibbits [Tue, 20 Dec 2022 20:08:34 +0000 (15:08 -0500)]
inet6: Fix LINT build
mli_delete_locked() is the only function that takes a const ifnet.
Since it's a static function there's no advantage to keeping it const.
Since `if_t` is not a const struct (currently) the compiler throws an
error passing the ifp around to ifnet functions.
John Baldwin [Tue, 20 Dec 2022 19:38:28 +0000 (11:38 -0800)]
ktls_tests: Ignore errors from close for receive error tests.
For tests that send invalid data to a TLS socket to trigger read
errors the kernel may end up dropping the connection before close is
called at the conclusion of the test resulting in spurious ECONNRESET
errors from close. Ignore any errors from close for these tests.
John Baldwin [Tue, 20 Dec 2022 19:38:07 +0000 (11:38 -0800)]
ktls_tests: Ignore spurious errors from shutdown(2).
For some of the "bad size" tests, the remote end can notice the error
and drop the connection before the test program returns from write to
call shutdown. In that case, shutdown fails with ENOTCONN. Permit
these ENOTCONN errors without failing the test.
Justin Hibbits [Fri, 9 Dec 2022 20:54:51 +0000 (15:54 -0500)]
DrvAPI: Extend driver KPI with more accessors
Summary:
Add the following accessors to hide some more netstack details:
* if_get/setcapabilities2 and *bits analogue
* if_setdname
* if_getxname
* if_transmit - wrapper for call to ifp->if_transmit()
- This required changing the existing if_transmit to
if_transmit_default, since that's its purpose.
* if_getalloctype
* if_getindex
* if_foreach_addr_type - Like if_foreach_lladdr() but for any address
family type. Used by some drivers to iterate over all AF_INET
addresses.
* if_init() - wrapper for ifp->if_init() call
* if_setinputfn
* if_setsndtagallocfn
* if_togglehwassist
ufs/ffs: detect endian mismatch between machine and filesystem
Mount on a LE machine a filesystem formatted for BE is not supported
currently. This adds a check for the superblock magic number using
swapped bytes to guess and warn the user that it may be a valid
superblock but endian is incompatible.
Ruslan Bukin [Mon, 19 Dec 2022 20:16:18 +0000 (20:16 +0000)]
Add support for ARM System Control and Management Interface (SCMI) v3.1.
The SCMI specification describes a set of standard interfaces for power,
performance and system management.
SCMI is extensible and provides interfaces to access functions which are
often implemented in firmwares in the System Control Processor (SCP).
This implements Shared Memory-based transfer, which is one of the ways on
how messages are exchanged between agents and the platform.
This includes a driver for ARM Message Handling Unit (MHU) Doorbell, which
is a mechanism that the caller can use to alert the callee of the presence
of a message.
The support implements clock management interface. For instance this allows
us to control HDMI pixel clock on ARM Morello Board.
Doug Rabson [Sun, 4 Dec 2022 15:53:07 +0000 (15:53 +0000)]
Allow realpath to work for file mounts
For file mounts, the directory vnode is not available from namei and this
prevents the use of vn_fullpath_hardlink. In this case, we can use the
vnode which was covered by the file mount with vn_fullpath.
This also disallows file mounts over files with link counts greater than
one to ensure a deterministic path to the mount point.
Doug Rabson [Wed, 23 Nov 2022 14:51:13 +0000 (14:51 +0000)]
Add support for mounting single files in nullfs
The main use-case for this is to support mounting config files and
secrets into OCI containers. My current workaround copies the files into
the container which is messy and risks secrets leaking into container
images if the cleanup fails.
This adds a VFCF flag to indicate whether the filesystem supports file
mounts and allows fspath to be either a directory or a file if the flag
is set.
Test Plan:
$ sudo mkdir -p /mnt
$ sudo touch /mnt/foo
$ sudo mount -t nullfs /COPYRIGHT /mnt/foo
Doug Rabson [Mon, 7 Nov 2022 16:56:09 +0000 (16:56 +0000)]
Add support for mounting single files in nullfs
My main use-case for this is to support mounting config files and secrets
into OCI containers. My current workaround copies the files into the
container which is messy and risks secrets leaking into container images
if the cleanup fails.
Jose Luis Duran [Mon, 19 Dec 2022 04:54:52 +0000 (05:54 +0100)]
xlocale(3): Link man pages
- provide various missing MLINKS for library functions
- update various SEE ALSO section to include the
new linked manual pages
- add various definitions of new functions like isideogram_l(3)
- document COMPATIBILITY for some functions
- bump man page dates
Rick Macklem [Sun, 18 Dec 2022 20:40:48 +0000 (12:40 -0800)]
krpc: Allow mountd/nfsd to optionally run in a jail
This patch modifies the kernel RPC so that it will allow
mountd/nfsd to run inside of a vnet jail. Running mountd/nfsd
inside a vnet jail will be enabled via a new kernel build
option called VNET_NFSD, which will be implemented in future
commits.
Although I suspect cr_prison can be set from the credentials
of the current thread unconditionally, I #ifdef'd the code
VNET_NFSD and only did this for the jailed case mainly to
document that it is only needed for use in a jail.
The TLS support code has not yet been modified to work in
a jail. That is planned as future development after the
basic VNET_NFSD support is in the kernel.
This patch should not result in any semantics change until
VNET_NFSD is implemented and used in a kernel configuration.
This can be eventually improved or simplified or fixed if necessary.
Following devices work with proper drivers and with the necessary clocks:
Native networking via eqos driver
USB3 and USB2
PCIe support is working but a bit picky about what hardware it supports (but so is Linux)
SD & (e)MMC
With the EDK2 loader video also works
Supported hardwares are Quartz64, NanoPI R5S and Firefly Station P2, more to come as DTS files gets done.
Rick Macklem [Sat, 17 Dec 2022 21:54:33 +0000 (13:54 -0800)]
jail.8: Update the man page for allow.nfsd
Commit bba7a2e89602 added "allow.nfsd" to optionally allow
mountd/nfsd to be run inside a vnet prison when the kernel
is built with "options VNET_NFSD".
Rick Macklem [Sat, 17 Dec 2022 21:43:49 +0000 (13:43 -0800)]
kern_jail.c: Allow mountd/nfsd to optionally run in a jail
This patch adds "allow.nfsd" to the jail code based on a
new kernel build option VNET_NFSD. This will not work
until future patches fix nmount(2) to allow mountd to
run in a vnet prison and the NFS server code is patched
so that global variables are in a vnet.
The jail(8) man page will be patched in a future commit.
Rick Macklem [Fri, 16 Dec 2022 21:01:23 +0000 (13:01 -0800)]
vfs_mount.c: fix vfs_domount() for PRIV_VFS_MOUNT_EXPORTED
It appears that, prior to r158857 vfs_domount() checked
suser() when MNT_EXPORTED was specified.
r158857 appears to have broken this, since MNT_EXPORTED
was no longer set when mountd.c was converted to use nmount(2).
r164033 replaced the suser() check with
priv_check(td, PRIV_VFS_MOUNT_EXPORTED), which does the
same thing (ie. checks for effective uid == 0 assuming suses_enabled
is set).
This patch restores this check by setting MNT_EXPORTED when the
"export" mount option is specified to nmount().
I think this is reasonable since only mountd(8) should be setting
exports and I doubt any non-root mounted file system would
be setting its own exports.
Franco Fichtner [Fri, 16 Dec 2022 15:27:18 +0000 (10:27 -0500)]
debugnet: remove spurious message on boot
In non-INVARIANTS kernels, hide the warning message printed by debugnet
when an interface MTU is configured or link state changes, and debugnet
cannot infer the number of mbuf clusters to reserve. The warning isn't
really actionable and mostly serves to confuse users.
Mike Karels [Fri, 16 Dec 2022 15:13:31 +0000 (09:13 -0600)]
daily 440.status-mailq: avoid error from dma with submit queue
dma(8) supports mailq, but not mailq -Ac to print the submission
queue. Don't try to print that queue from the daily script if
mailq -Ac returns an error.
Mike Karels [Fri, 16 Dec 2022 15:13:07 +0000 (09:13 -0600)]
daily 150.clean-hoststat: suppress error when using dma
dma(8) does not have hoststat or purgestat, so this script produces
an error from the daily script. We could disable this script, but
that would mean yet another change to switch back to sendmail. Check
for purgestat in mailer.conf before attempting either hoststat or
purgestat.
John Baldwin [Thu, 15 Dec 2022 20:06:26 +0000 (12:06 -0800)]
ktls: Close a race with setting so_error when dropping a connection.
pr_abort calls tcp_usr_abort which calls tcp_drop with ECONNABORTED.
After pr_abort returns, the so_error is then set to a more specific
error. However, a reader can observe and return the ECONNABORTED
error before so_error is set to the desired error value. This is
resulting in spurious test failures of recently added tests for
invalid conditions such as invalid headers.
To fix, refactor the code to abort a connection to call tcp_drop
directly with the desired error value. ktls_reset_send_tag already
calls tcp_drop directly when it aborts a connection due to an error.
Randall Stewart [Wed, 14 Dec 2022 20:37:48 +0000 (15:37 -0500)]
Rack cannot be loaded without cc_newreno compiled into the kernel.
Right now rack will fail to load due to its hack in accessing symbol names
in cc_newreno. This was fine when newreno was always compiled into the
kernel but now ... not so much. Instead lets fix up rack to use the socket
option queries to get the information it wants and set the parameters. We
also fix the CC parameter so they are always settable.
* Separate interface creation from interface modification code
* Support setting some interface attributes (ifdescr, mtu, up/down, promisc)
* Improve interaction with the cloners requiring to parse/write custom
interface attributes
* Add bitmask-based way of checking if the attribute is present in the
message
* Don't use multipart RTM_GETLINK replies when searching for the
specific interface names
* Use ENODEV instead of ENOENT in case of failed RTM_GETLINK search
* Add python netlink test helpers
* Add some netlink interface tests
Andrew Gallatin [Wed, 14 Dec 2022 19:34:07 +0000 (14:34 -0500)]
vm: reduce lock contention when processing vm batchqueues
Rather than waiting until the batchqueue is full to acquire the lock &
process the queue, we now start trying to acquire the lock using trylocks
when the batchqueue is 1/2 full. This removes almost all contention on the
vm pagequeue mutex for for our busy sendfile() based web workload.
It also greadly reduces the amount of time a network driver ithread
remains blocked on a mutex, and eliminates some packet drops under
heavy load.
So that the system does not loose the benefit of processing large
batchqueues, I've doubled the size of the batchqueues. This way, when
there is no contention, we process the same batch size as before.
This has been run for several months on a busy Netflix server, as well
as on my personal desktop.
Andrew Gallatin [Wed, 14 Dec 2022 19:19:35 +0000 (14:19 -0500)]
allocate inpcb aligned to cachelines
The inpcb struct is one of the most heavily utilized in the kernel
on a busy network server. By aligning it to a cacheline
boundary, we can ensure that closely related fields in the inpcb
and tcbcb can be predictably located on the same cacheline. rrs
has already done a lot of this work to put related fields on the
same line for the tcbcb.
In combination with a forthcoming patch to align the start of the tcpcb,
we see a roughly 3% reduction in CPU use on a busy web server serving
traffic over roughly 50,000 TCP connections.
Gleb Smirnoff [Wed, 14 Dec 2022 18:02:44 +0000 (10:02 -0800)]
sockets: provide sousrsend() that does socket specific error handling
Sockets have special handling for EPIPE on a write, that was spread out
into several places. Treating transient errors is also special - if
protocol is atomic, than we should ignore any changes to uio_resid, a
transient error means the write had completely failed (see d2b3a0ed31e).
- Provide sousrsend() that expects a valid uio, and leave sosend() for
kernel consumers only. Do all special error handling right here.
- In dofilewrite() don't do special handling of error for DTYPE_SOCKET.
- For send(2), write(2) and aio_write(2) call into sousrsend() and remove
error handling for kern_sendit(), soo_write() and soaio_process_job().
Mark Johnston [Wed, 14 Dec 2022 14:32:17 +0000 (09:32 -0500)]
sys/conf: Remove an unneeded flag variable
After commit fac6dee9eb58 ("Remove tests for obsolete compilers in the
build system"), we always set -fdebug-prefix-map, so there's no point in
defining and testing _MAP_DEBUG_PREFIX. No functional change intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Mark Johnston [Wed, 14 Dec 2022 14:29:59 +0000 (09:29 -0500)]
pf: Fix definitions of pf_pfil_*_hooked
This use of "volatile" in the vnet definitions doesn't have any effect.
VNET_DEFINE_STATE(volatile int, ...) should work, but let's avoid using
"volatile" altogether and convert to atomic_load/atomic_store. Also
convert to bool while here.
Reviewed by: kp, mjg
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D37684
Nick Reilly [Wed, 30 Nov 2022 14:19:44 +0000 (15:19 +0100)]
pf: fix pfi_ifnet leak on interface removal
The detach of the interface and group were leaving pfi_ifnet memory
behind. Check if the kif still has references, and clean it up if it
doesn't
On interface detach, the group deletion was notified first and then a
change notification was sent. This would recreate the group in the kif
layer. Reorder the change to before the delete.
Kristof Provost [Mon, 5 Dec 2022 13:14:49 +0000 (14:14 +0100)]
if_ovpn: cleanup offsetof() use
Move the use of the `offsetof(struct ovpn_counters, fieldname) /
sizeof(uint64_t)` construct into a macro.
This removes a fair bit of code duplication and should make things a
little easier to read.
Kristof Provost [Fri, 2 Dec 2022 15:59:38 +0000 (16:59 +0100)]
if_ovpn: include peer counters in a OVPN_NOTIF_DEL_PEER message
When we remove a peer userspace can no longer retrieve its counters. To
ensure that userspace can get a full count of the entire session we now
include the counters in the deletion message.
Kristof Provost [Tue, 29 Nov 2022 11:06:32 +0000 (12:06 +0100)]
if_ovpn: allow peer lookup by vpn4/vpn6 address
Introduce two more RB_TREEs so that we can look up peers by their peer
id (already present) or vpn4 or vpn6 address.
This removes the last linear scan of the peer list.
Kristof Provost [Sat, 26 Nov 2022 12:52:40 +0000 (13:52 +0100)]
if_ovpn: remove OVPN_SEND_PKT
OpenVPN userspace no longer uses the ioctl interface to send control
packets. It instead uses the socket directly.
The use of OVPN_SEND_PKT was never released, so we can remove this
without worrying about compatibility.
Gleb Smirnoff [Wed, 14 Dec 2022 03:31:05 +0000 (19:31 -0800)]
tcp: fix counter leak for SYN_RCVD state when syncache_socket() fails
The SYN_RCVD state count is tricky here due to default code path and TFO
being so different. In the default case the count is incremented when a
syncache entry is added to the the database in syncache_insert(). Later
when connection transitions from syncache entry to a socket in
syncache_expand(), this counter is inherited by the tcpcb. If socket or
tcpcb allocation failed in syncache_socket() failed the syncache_expand()
is responsible for decrement. In the TFO case the syncache entry is not
inserted into database and count of SYN_RCVD is first incremented in the
syncache_tfo_expand() after successful socket allocation. Thus, inside
syncache_socket() we can't tell whether we need to decrement in a case of
a failure or not. The caller is responsible for this book keeping.