Mark Johnston [Wed, 3 Apr 2024 17:45:06 +0000 (13:45 -0400)]
libvmmapi: Conditionalize compilation of some functions
Hide definitions of several functions that currently don't have
implementatations in the arm64 vmm port. In particular, add a
WITH_VMMAPI_SNAPSHOT preprocessor variable that can be used to enable
compilation of save/restore functions, and conditionalize compilation of
some functions only used by amd64 bhyve. If in the long term they
remain amd64-only, they can move to vmmapi_machdep.c, but for now it's
not clear to me that that's the right thing to do.
Mark Johnston [Wed, 3 Apr 2024 17:44:40 +0000 (13:44 -0400)]
bhyve: Push option parsing down into bhyverun_machdep.c
After a couple of attempts I think this is the cleanest approach despite
the expense of some code duplication. Quite a few of the single-letter
bhyve options are x86-specific.
I think that going forward we should strongly discourage the addition of
new options and instead configure guests using the more general
configuration file syntax.
Mark Johnston [Wed, 3 Apr 2024 17:11:37 +0000 (13:11 -0400)]
bhyve: Add PCI mappings for arm64
- The extended config space and BAR ranges are listed in the FDT.
- Avoid referencing I/O ports in ACPI tables. Currently the arm64 port
does not support ACPI in any case.
Mark Johnston [Wed, 3 Apr 2024 17:11:24 +0000 (13:11 -0400)]
bhyve: Do not compile PCI passthrough support on arm64
Some required kernel functionality is not yet implemented.
For now this means that one cannot specify host PCI register values, but
that functionality is only used by amd64-specific device models for now.
Note that this limitation is rather artificial; it arises only because
pci_host_read_config() lives in pci_passthru.c.
Mark Johnston [Wed, 3 Apr 2024 17:09:32 +0000 (13:09 -0400)]
libvmmapi: Make vm_raise_msi() a common function
Currently, bhyve PCI emulation uses vm_lapic_msi() to raise an MSI in
the guest. The arm64 port has a similar function, vm_raise_msi().
Add vm_raise_msi() on amd64 as well and have it simply call
vm_lapic_msi() so that bhyve can use a common, generically named
function.
Mark Johnston [Wed, 3 Apr 2024 17:01:31 +0000 (13:01 -0400)]
libvmmapi: Make memory segment handling a bit more abstract
libvmmapi leaves a hole at [3GB, 4GB) in the guest physical address
space. This hole is not used in the arm64 port, which maps everything
above 4GB. This change makes the code a bit more general to accomodate
arm64 more naturally. In particular:
- Remove vm_set_lowmem_limit(): it is unused and doesn't have
well-defined constraints, e.g., nothing prevents a consumer from
setting a lowmem limit above the highmem base.
- Define a constant for the highmem base and use that everywhere that
the base is currently hard-coded.
- Make the lowmem limit a compile-time constant instead of a vmctx field.
- Store segment info in an array.
- Add vm_get_highmem_base(), for use in bhyve since the current value is
hard-coded in some places.
Mark Johnston [Wed, 3 Apr 2024 16:56:22 +0000 (12:56 -0400)]
libvmmapi: Move PCI passthrough ioctl wrappers into a separate file
The arm64 port doesn't implement PCI passthrough and in particular
doesn't define the ioctls used by these wrappers. It might be that the
ppt ioctl interface will require modification to support arm64. Until
that's sorted out one way or another, put this code in a separate file
so that it's easy to conditionally compile.
Mark Johnston [Wed, 3 Apr 2024 16:55:54 +0000 (12:55 -0400)]
libvmmapi: Split the ioctl list into MI and MD lists
To enable use in capability mode, libvmmapi needs a list of all the
ioctls that might be invoked on the vmm device handle. Some of these
ioctls are amd64-specific. Move the ioctl list to vmmapi_machdep.c and
define a list of MI ioctls so that the arm64 port can build its own list
without duplicating common ioctls. No functional change intended.
Mark Johnston [Wed, 3 Apr 2024 16:52:25 +0000 (12:52 -0400)]
libvmmapi: Move some ioctl wrappers to vmmapi_machdep.c
ioctls relating to segments and various x86-specific interrupt
controllers are easy candidates to move to vmmapi_machdep.c.
In vmmapi.h I'm just ifdefing MD prototypes for now. We could instead
split vmmapi.h into multiple headers, e.g., vmmapi.h and
vmmapi_machdep.h, but it's not obvious to me yet that that's the right
approach.
Mark Johnston [Wed, 3 Apr 2024 16:50:21 +0000 (12:50 -0400)]
bhyve: Add FDT building code for arm64
fdt.c provides some basic routines which let platform initialization
code build the FDT that gets passed into the guest. For now this is not
very generic; we declare info about CPUs, memory, a single UART
(specified by -o console), a PCIe controller (used for virtio devices),
an interrupt controller and the platform timer.
Co-authored-by: andrew
Reviewed by: corvink, jhb
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D40996
Mark Johnston [Wed, 3 Apr 2024 16:48:45 +0000 (12:48 -0400)]
bhyve: Provide optional libfdt linking
The arm64 port currently does not support ACPI, it instead builds up an
FDT which is exported to the guest. This mechanism will not be used on
amd64 but isn't really arm64-specific either, so provide an opt-in
mechanism to link libfdt.
sys_procctl(): Make it clear that negative commands are invalid
An initial reading of the preamble of sys_procctl() gives the impression
that no test prevents a malicious user from passing a negative commands
index (in 'uap->com'), which is soon used as an index into the static
array procctl_cmds_info[].
However, a closer examination leads to the conclusion that the existing
code is technically correct. Indeed, the comparison of 'uap->com' to
the nitems() expression, which expands to a ratio of sizeof(), leads to
a conversion of 'uap->com' to an 'unsigned int' as per Usual Arithmetic
Conversions/Integer Promotions applied by '<=', because sizeof() returns
'size_t' values, and we define 'size_t' as an equivalent of 'unsigned
int' (which is not mandated by the standard, the latter allowing, e.g.,
integers of lower ranks).
With this conversion, negative values of 'uap->com' are automatically
ruled-out since they are converted to very big unsigned integers which
are caught by the test. An analysis of assembly code produced by LLVM
16 on amd64 and practical tests confirm that no exploitation is possible.
However, the guard code as written is misleading to readers and might
trip up static analysis tools. Make sure that negative values are
explicitly excluded so that it is immediately clear that EINVAL will be
returned in this case.
Build tested with clang 16 and GCC 12.
Approved by: markj (mentor)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
exit(3) man page shows __cxa_atexit(3,) instead of __cxa_atexit(3), in a
particular section. It seems the comma gets inside the parenthesis and
with an extra space, it can be viewed as expected.
Colin Percival [Wed, 10 Apr 2024 03:27:19 +0000 (20:27 -0700)]
release: Don't reuse disc1/bootonly directories
The disc1 and bootonly directories have files distributed into them
for use in "full" and "mini" images; the former are disc1.iso and
memstick.img, and the latter is bootonly.iso and mini-memstick.img.
Unfortunately the scripts which package a directory tree into an ISO
or memory stick image also modify the directory, for example to
create an appropriate /etc/fstab file; so creating two images at the
same time breaks.
Resolve this by copying disc1 to disc1-disc1 and disc1-memstick,
and copying bootonly to bootonly-bootonly and bootonly-memstick,
before using those directories for constructing the ISO+memstick
images.
Colin Percival [Wed, 10 Apr 2024 03:25:34 +0000 (20:25 -0700)]
release: make -j compat: cd inside subshell
Place instances of "cd foo && bar" inside subshells for compatibility
with modern make(8) which uses a single shell for the duration of a
makefile target.
bcm2838_xhci(4) is a shim for the XHCI controller on the Raspberry Pi 4B
SoC. It loads the controller's firmware before passing control to the
normal xhci(4) driver.
When xhci(4) is built as a module (and not in the kernel), bcm2838_xhci
is not built at all and the RPi4's XHCI controller won't attach due to
missing firmware.
To fix this, build a new module, bcm2838_xhci.ko, which depends on
xhci.ko. For the dependency to work correctly, also modify xhci to
provide the 'xhci' module in addition to the 'xhci_pci' module it
already provided.
Since bcm2838_xhci is specific to a quirk of the RPi4 SoC, only build
the module for AArch64.
If userpath is not SHM_ANON, then copy it in early so ktrace(2) can
record it. Without this change, ktrace(2) will attempt to strcpy a
userspace string and trigger a page fault.
John Baldwin [Tue, 9 Apr 2024 22:02:58 +0000 (15:02 -0700)]
NOTES: Tidy entries for SATA controllers
- Add typical comments after device entries (copied from amd64
GENERIC)
- Add an entry for 'device ada'. Normally this is pulled in via
'device sd', but is documented in ada(4) and can be used to include
ATA/SATA disk support in a kernel without SCSI disk support.
periodic/daily/801.trim-zfs: Add a daily zfs trim script
As mentioned in zpoolprops(7), on some SSDs, it may not be desirable to
use ZFS autotrim because a large number of trim requests can degrade
disk performance; instead, the pool should be manually trimmed at
regular intervals.
Add a new daily periodic script for this purpose, 801.trim-zfs. If
enabled (daily_trim_zfs_enable=YES; the default is NO), it will run a
'zpool trim' operation on all online pools, or on the pools listed in
'daily_trim_zfs_pools'.
The trim is not started if the pool is degraded (which matches the
behaviour of the existing 800.scrub-zfs script) or if a trim is already
running on that pool. Having autotrim enabled does not inhibit the
periodic trim; it's sometimes desirable to run periodic trims even with
autotrim enabled, because autotrim can elide trims for very small
regions.
John Baldwin [Tue, 9 Apr 2024 21:55:40 +0000 (14:55 -0700)]
pci_host_generic: Tolerate range resource allocation failures
QEMU for armv7 includes a PCI memory range whose CPU address is
greater than 4GB. This falls outside the range of armv7's global
mem_rman used by the nexus driver. As a result, pcib0 fails to
attach blocking all PCI devices.
Instead, change the driver to be a bit more tolerant. If allocating a
resource for a range fails, don't fail attaching the entire driver,
but do skip adding the associated PCI range to the relevant rman in
the pcib driver. This will prevent child devices from using BARs that
allocate from this range. In the case of QEMU on armv7 devices can
still allocate from an earlier PCI memory range that is within the
32-bit address space (and in fact none of the firmware-assigned memory
BARs use addresses from the upper range).
While here, reorder the operations on I/O ranges a bit: 1) print the
range under bootverbose first (rather than last) so that the range is
printed before any relevant errors for the range, 2) move
rman_manage_region last after the parent resource has been set and
allocated.
Alan Cox [Mon, 8 Apr 2024 05:05:54 +0000 (00:05 -0500)]
arm64 pmap: Add ATTR_CONTIGUOUS support [Part 2]
Create ATTR_CONTIGUOUS mappings in pmap_enter_object(). As a result,
when the base page size is 4 KB, the read-only data and text sections
of large (2 MB+) executables, e.g., clang, can be mapped using 64 KB
pages. Similarly, when the base page size is 16 KB, the read-only
data section of large executables can be mapped using 2 MB pages.
Rename pmap_enter_2mpage(). Given that we have grown support for 16 KB
base pages, we should no longer include page sizes that may vary, e.g.,
2mpage, in pmap function names. Requested by: andrew
Rick Macklem [Tue, 9 Apr 2024 01:58:40 +0000 (18:58 -0700)]
mountd.8: Document the new -A mountd option
Commit fefb7c399b39 added warning messages noting
that administrative controls that exported directories
that are not local server file system mount points actually
export the entire local server file system.
This commit also added a new command line option "-A' that
silences these warnings.
Historically, BSD cp has followed symbolic links in the destination
when copying recursively, while GNU cp has not. POSIX is somewhat
vague on the topic, but both interpretations are within bounds. In 33ad990ce974, cp was changed to apply the same logic for symbolic
links in the destination as for symbolic links in the source: follow
if not recursing (which is moot, as this situation can only arise
while recursing) or if the `-L` option was given. There is no support
for this in POSIX. We can either switch back, or go all the way.
Having carefully weighed the kind of trouble you can run into by
following unexpected symlinks up against the kind of trouble you can
run into by not following symlinks you expected to follow, we choose
to go all the way.
Note that this means we need to stat the destination twice: once,
following links, to check if it is or references the same file as the
source, and a second time, not following links, to set the dne flag
and determine the destination's type.
While here, remove a needless complication in the dne logic. We don't
need to explicitly reject overwriting a directory with a non-directory,
because it will fail anyway.
Finally, add test cases for copying a directory to a symlink and
overwriting a directory with a non-directory.
MFC after: never
Relnotes: yes
Sponsored by: Klara, Inc.
Reviewed by: kevans
Differential Revision: https://reviews.freebsd.org/D44578
unix: new implementation of unix/stream & unix/seqpacket
Provide protocol specific pr_sosend and pr_soreceive for PF_UNIX
SOCK_STREAM sockets and implement SOCK_SEQPACKET sockets as an extension
of SOCK_STREAM. The change meets three goals: get rid of unix(4) specific
stuff in the generic socket code, provide a faster and robust unix/stream
sockets and bring unix/seqpacket much closer to specification. Highlights
follow:
- The send buffer now is truly bypassed. Previously it was always empty,
but the send(2) still needed to acquire its lock and do a variety of
tricks to be woken up in the right time while sleeping on it. Now the
only two things we care about in the send buffer is the I/O sx(9) lock
that serializes operations and value of so_snd.sb_hiwat, which we can read
without obtaining a lock. The sleep of a send(2) happens on the mutex of
the receive buffer of the peer. A bulk send/recv of data with large
socket buffers will make both syscalls just bounce between owning the
receive buffer lock and copyin(9)/copyout(9), no other locks would be
involved.
- The implementation uses new mchain structure to manipulate mbuf chains.
Note that this required converting to mchain two functions that are shared
with unix/dgram: unp_internalize() and unp_addsockcred() as well as adding
a new shared one uipc_process_kernel_mbuf(). This induces some non-
functional changes in the unix/dgram code as well. There is a space for
improvement here, as right now it is a mix of mchain and manually managed
mbuf chains.
- unix/seqpacket previously marked as PR_ADDR & PR_ATOMIC and thus treated
as a datagram socket by the generic socket code, now becomes a true stream
socket with record markers.
- unix/stream loses the sendfile(2) support. This can be brought back,
but requires some work. Let's first see if there is any interest in this
feature, except purely academical.
mbuf: provide mc_uiotomc() a function to copy from uio(9) to mchain
Implement m_uiotombuf() as a wrapper around mc_uiotomc(). The M_EXTPG is
left untouched. The m_uiotombuf() is left as a compat KPI. New code
should use either mc_uiotomc() or m_uiotombuf_nomap().
mbuf: add mc_split() that works on two struct mchain
It preserves tail points and all length/memory accounting, so that caller
doesn't need to do any extra traversals. It doesn't respect M_PKTHDR but
it may be improved if needed. It respects M_EOR, though. First consumer
will be the new unix(4) SOCK_STREAM and SOCK_SEQPACKET.
Also provide much more simple mc_concat() that glues two chains back.
mbuf: provide new type for mbuf manipulation - mbuf chain
It tracks both the first mbuf and last mbuf, making it handy to use inside
functions that are interested in both. It also tracks length of data and
memory usage. It can be allocated on stack and passed to an mbuf
allocation or another mbuf manipulation function. It can be embedded into
some kernel facility internal structure representing most simple data
buffer. It uses modern queue(3) based linkage, but is also compatible with
old style m_next linkage. Transitioning older code to new type can be done
gradually - a code that doesn't understand the chain yet, can be supplied
with STAILQ_FIRST(&mc.mc_q). So you can have a mix of old style and new
style code in one function as a temporary solution.
sendfile: mark it explicitly as a TCP only feature
Back in 2015 when it turned non-blocking, it was working with PF_UNIX
and it may still work. However, the usefullness of such application
of sendfile(2) is questionable. Disable the feature while unix/stream
is under refactoring.
tests/unix_seqpacket: test send(2) to a closed or aborted peer socket
In both cases the kernel returns EPIPE and delivers SIGPIPE, unless
blocked or disabled. The test isn't specific to SOCK_SEQPACKET, it is the
same for SOCK_STREAM. Put the test into this file, since it has all
primitives to write this test tersely.
tests/unix_seqpacket: provide random data pumping test with MSG_EOR
Allocate a big chunk of randomly initialized memory. Send it to the peer
in random sized chunks, throwing MSG_EOR at randomly initialized offsets.
Receive into random sized chunks setting MSG_WAITALL randomly. Check that
MSG_EORs where they should be, check that MSG_WAITALL is abode, but
overriden by MSG_EOR. And finally memcmp() what we receive.
David Marker [Mon, 8 Apr 2024 17:48:22 +0000 (10:48 -0700)]
ng_bridge: allow to automatically assign numbers to new hooks
This will allow a userland machinery that orchestrates a bridge (e.g. a
jail or vm manager) to not double the number allocation logic. See bug
278130 for longer description and examples.
Kristof Provost [Thu, 18 Jan 2024 19:44:47 +0000 (20:44 +0100)]
netinet: add a probe point for IP, IP6, ICMP, ICMP6, UDP and TCP stats counters
When debugging network issues one common clue is an unexpectedly
incrementing error counter. This is helpful, in that it gives us an
idea of what might be going wrong, but often these counters may be
incremented in different functions.
Add a static probe point for them so that we can use dtrace to get
futher information (e.g. a stack trace).
For example:
dtrace -n 'mib:ip:count: { printf("%d", arg0); stack(); }'
This can be disabled by setting the following kernel option:
options KDTRACE_NO_MIB_SDT
Rob Norris [Mon, 8 Apr 2024 13:07:32 +0000 (13:07 +0000)]
bhyvectl: generate usage from options table
The usage text had fallen out of sync with the actually available
options. Rather than keep them in sync by hand, just generate usage from
the available options.
LinuxKPI: Move [SU](8|16|32|64)_(MAX|MIN) defines to linux/limits.h
Some source files get them from linux/limits.h directly rather than from
linux/kernel.h.
While here replace Linux constant values with sys/stdint.h provided ones.