Jessica Clarke [Mon, 30 Jan 2023 23:55:03 +0000 (23:55 +0000)]
libc: Fix longjmp/_longjmp(buf, 0) for MIPS
Like AArch64 and RISC-V in the past, MIPS fails to handle this special
case, and will cause the corresponding setjmp/_setjmp to return 0 rather
than 1. Fix this so the newly-added regression tests pass.
This is a direct commit to stable/13 as mips no longer exists in main.
Jessica Clarke [Mon, 9 Jan 2023 18:34:43 +0000 (18:34 +0000)]
libc: Fix longjmp/_longjmp(buf, 0) for AArch64 and RISC-V
These architectures fail to handle this special case, and will cause the
corresponding setjmp/_setjmp to return 0 rather than 1. Fix this and add
regression tests (also committed upstream).
Jake Freeland [Thu, 19 Jan 2023 22:24:44 +0000 (22:24 +0000)]
Makefile: Avoid sanitizing PATH on non-FreeBSD systems
Allow the build process to find host binaries during the host-symlinks target when
cross-building on non-FreeBSD systems. Whilst most non-FreeBSD systems have all
the needed tools in /sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin (the final
path added by host-symlinks itself), Homebrew for macOS on Arm defaults to
/opt/homebrew/bin, other more niche systems may also deviate and users may
expect tools in a customised PATH to be picked up, unlike on FreeBSD where we
want to ensure everything comes from base. In particular, (un)xz are needed
from Homebrew on macOS, and thus cannot be found on Arm without this.
Note that non-FreeBSD builds enforce BUILD_WITH_STRICT_TMPPATH, and so the
actual main build steps will still use a sanitised PATH.
freebsd32: Make sendmsg match native ABI for unpadded final control message
The API says that CMSG_SPACE should be used for msg_controllen, but in
practice the native ABI allows you to only use CMSG_LEN for the final
(typically only) control message, and real-world software does this,
including Wayland. For freebsd32, this is in practice mostly harmless,
since control messages are generally used to carry file descriptors,
which are already 4 bytes in size and thus no padding is needed, but
they can carry other quantities that may not result in an aligned
length. This was discovered after CheriBSD's freebsd64 equivalent was
updated to match the freebsd32 implementation, as that uses 8 byte
alignment which does break the file descriptor use case, and thus
Wayland.
This used to be addressed by aligning buflen before the first iteration,
but that allowed unwanted invalid inputs and was lost in 1b1428dcc82b,
with no safer equivalent put in its place.
Reviewed by: brooks, kib, markj
Obtained from: CheriBSD
Fixes: 1b1428dcc82b ("Fix a TOCTOU vulnerability in freebsd32_copyin_control().")
Differential Revision: https://reviews.freebsd.org/D36554
Brooks Davis [Wed, 24 Aug 2022 17:34:39 +0000 (18:34 +0100)]
freebsd32_sendmsg: fix control message ABI
When a freebsd32 caller uses all or most allowed space for control
messages (MCLBYTES == 2K) then the message may no longer fit when
the messages are padded for 64-bit alignment. Historically we've just
shrugged and said there is no ABI guarantee. We ran into this on
CheriBSD where a capsicumized 64-bit nm would fail when called with more
than 64 files.
Fix this by not gratutiously capping size of mbuf data we'll allocate
to MCLBYTES and let m_get2 allocate up to MJUMPAGESIZE (4K or larger).
Instead of hard-coding a length check, let m_get2 do it and check for a
NULL return.
Mark Johnston [Fri, 13 Jan 2023 15:01:00 +0000 (10:01 -0500)]
kvmclock: Fix initialization when EARLY_AP_STARTUP is not defined
To attach to the hypervisor, kvmclock needs to write a per-CPU MSR.
When EARLY_AP_STARTUP is not defined, device attach happens too early:
APs are not yet spun up, so smp_rendezvous only runs the callback on the
local CPU. As a result, the timecounter only gets initialized on the
BSP, and then timekeeping is broken on SMP systems.
Implement handling for !EARLY_AP_STARTUP kernels: keep track of the CPU
on which device attach ran, and then use a SI_SUB_SMP SYSINIT to
register the rest of the CPUs with the hypervisor.
Reported by: Shrikanth R Kamath <kshrikanth@juniper.net>
Reviewed by: kib, jhb (earlier versions)
Sponsored by: Klara, Inc.
Sponsored by: Juniper Networks, Inc.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D37705
Jiajie Chen [Mon, 23 Jan 2023 16:36:59 +0000 (00:36 +0800)]
Add kf_file_nlink field to kf_file and populate it
This will allow user-space programs (e.g. lsof) to locate deleted files
whose nlink equals zero. Prior to this commit, programs has to use
stat(kf_path) to get nlink, but that will fail if the file is deleted.
* Automatically use IPv6 when IPv6 addresses are used, --ip6 is not needed.
* Building of ping requests and parsing of ping replies is done layer by
layer. This way most arguments are available both for IPv6 and IPv4,
for ICMP and TCP.
* Use argument groups for improved readability.
* Change ToS and TTL argument name to TC and HL to reflect the modern
IPv6 nomenclature. The argument still set related IPv4 header fields
properly.
* Instead of sniffing for the very specific case of duplicated packets,
allow for sniffing on multiple interfaces.
* Report which sniffer has failed by setting bits of error code.
* Raise meaningful exceptions when irrecoverable errors happen.
* Make IPv4 fragmentation flags configurable.
* Make IPv6 HL / IPv4 TTL configurable.
* Make TCP MSS configurable.
* Make TCP sequence number configurable.
* Make ICMP payload size configurable.
* Add debug output.
* Move command line argument parsing out of network functions.
* Make the code somehow PEP-8 compliant.
* Remove ambiguity of configuring recvif, it must be now explicitly specified.
* Don't catch exceptions around creating the sniffer, let it properly
fail and display the whole stack trace.
* Count correct packets so that duplicates can be found.
Rick Macklem [Sun, 15 Jan 2023 22:07:40 +0000 (14:07 -0800)]
nfsserver: Fix handling of SP4_NONE
For NFSv4.1/4.2, when the client specifies SP4_NONE for
state protection in the ExchangeID operation arguments,
the server MUST allow the state management operations for
any user credentials. (I misread the RFC and thought that
SP4_NONE meant "at the server's discression" and not MUST
be allowed.)
This means that the "sec=XXX" field of the "V4:" exports(5)
line only applies to NFSv4.0.
This patch fixes the server to always allow state management
operations for SP4_NONE, which is the only state management
option currently supported. (I have patches that add support
for SP4_MACH_CRED to the server. These will be in a future commit.)
In practice, this bug does not seem to have caused
interoperability problems.
Kyle Evans [Wed, 4 Jan 2023 05:21:10 +0000 (23:21 -0600)]
grep: properly switch EOL indicator with -z
-z is supposed to use only the NUL byte as EOL, but we were
inadvertently using both newline and NUL due to REG_NEWLINE in cflags.
The odds of anyone relying on this bsdgrep-specific bug are quite low,
so let's just fix it. At least one port in the wild has been reported
to expect the intended behavior.
Reported by: Hill Ma <maahiuzeon@gmail.com>
Triaged by: the self-proclaimed peanut gallery on Discord
Kristof Provost [Sat, 31 Dec 2022 18:23:15 +0000 (19:23 +0100)]
pf tests: test fast port re-use with syncookies
When a src/dst ip/port tuple is re-used before the pf state fully
expires we clean up the state and create a new one, unless syncookies
are enabled.
Test this, by running two back-to-back nc sessions, with a fixed source
port. Move the interface and IP to a different (vnet) jail, to trick the
network stack into letting us do this.
Kristof Provost [Sat, 31 Dec 2022 14:59:10 +0000 (15:59 +0100)]
pf: fix syncookies in conjunction with tcp fast port reuse
Basic scenario: we have a closed connection (In TCPS_FIN_WAIT_2), and
get a new connection (i.e. SYN) re-using the tuple.
Without syncookies we look at the SYN, and completely unlink the old,
closed state on the SYN.
With syncookies we send a generated SYN|ACK back, and drop the SYN,
never looking at the state table.
So when the ACK (i.e. the third step in the three way handshake for
connection setup) turns up, we’ve not actually removed the old state, so
we find it, and don’t do the syncookie dance, or allow the new
connection to get set up.
Explicitly check for this in pf_test_state_tcp(). If we find a state in
TCPS_FIN_WAIT_2 and the syncookie is valid we delete the existing state
so we can set up the new state.
Note that when we verify the syncookie in pf_test_state_tcp() we don't
decrement the number of half-open connections to avoid an incorrect
double decrement.
Kristof Provost [Fri, 13 Jan 2023 03:34:20 +0000 (04:34 +0100)]
pf: fix panic on deferred packets
The pfsync_defer_tmo() callout needs to set the correct vnet before it
can transmit packets. It used the rcvif in the mbuf to get this vnet,
but that doesn't work for locally originated traffic. In that case the
rcvif pointer is NULL, and the dereference leads to a panic.
Instead use the sc_sync_if, which is always set (if pfsync is enabled,
at least).
Alan Somers [Fri, 13 Jan 2023 20:19:03 +0000 (13:19 -0700)]
cal: don't print terminal control characters unless stdout is a TTY
A similar change was made in svn r223931, but it was incomplete, working
only when the utility was invoked as "ncal". Fix the same issue when
invoking as "cal".
Alan Somers [Sat, 7 Jan 2023 01:54:23 +0000 (18:54 -0700)]
fsx: bounds check the inputs
In particular, don't allow the user to specify a file size that can't be
expressed as an int, since fsx's random-number generator only has a 32
bit range.
Jose Luis Duran [Sun, 20 Nov 2022 02:56:49 +0000 (23:56 -0300)]
ping(8): man page cleanup
* Appease mandoc -T lint and igor
* Use example.com for documentation
* Update the IPv4 TTL section.
Update the IPv4 TTL section specifically for FreeBSD.
FreeBSD changed the default TTL to 64 in 5639e86bdd7ea151958776264bf5a67e60a54d68. NetBSD and OpenBSD still
use 255. Remove some references of extinct operating systems.
Alan Somers [Thu, 1 Dec 2022 16:49:57 +0000 (09:49 -0700)]
improvements to cap_sysctl.3
* Correct some function prototypes which were documented with the wrong
pointer type.
* Clarify return values and requirements for freeing the limit handle.
ena: Switch driver owners from semihalf to amazon in man file
1. Update ena.4 manual file to include amazon owner emails.
2. State that the driver is developed by amazon but leave
that it was originally written by Semihalf, similarly to other
drivers in the /share/man/ directory of the FreeBSD source code.
3. Advance year in copyright notice to 2022.
David Arinzon [Sun, 4 Dec 2022 10:37:32 +0000 (12:37 +0200)]
ena: Remove timer service re-arm on ena_restore_device failure
In case the reset sequence fails (ena_destroy_device() followed by
ena_restore_device() calls) during ena_restore_device(), the driver
resources are being freed. After the clean-up, the timer service is
re-armed in order to try and re-initialize the driver state.
But, such an attempt would fail given that the resources are freed.
Moreover, this would actually cause either the system to fail or a
panic.
When the driver fails in ena_restore_device() procedure, the only
recovery is either unloading and loading the driver or instance
reboot.
This change removes the timer service re-arm in case of failure
in ena_restore_device().
MFC after: 2 weeks
Sponsored by: Amazon, Inc. Fixes: 78554d0c707c ("ena: start timer service on attach")
(cherry picked from commit c4a85b8d684d3db9dc4d3d01d966130e21390529)
Commit [1] first added the ena_tx_buffer.print_once member,
so that a message about a missing tx completion is printed only
once per packet (and not every second when the watchdog runs).
In this commit print_once is initialized to true, and is set back
to false after detecting a missing tx completion and printing
a warning about it to dmesg.
Commit [2] incorrectly reverses the values assigned to print_once.
The variable is initialized to be true but is checked to be false
when a missing tx completion is detected. This is never true, and
therefore the warning print for each missing tx completion is never
printed since this commit.
Commit [3] added time passed since last TX cleanup to the missing
tx completions per-packet print. However, due to the issue in commit
[2], this time is never printed.
This commit reverses back the values assigned to ena_tx_buffer.print_once
erroneously by commit [2], bringing back to life the missing tx
completion per-packet print.
Also add a space after "." in the missing tx completion print.
[1] - 9b8d05b8ac78 ("Add support for Amazon Elastic Network Adapter (ENA) NIC")
[2] - 74dba3ad7851 ("Split function checking for missing TX completion in ENA driver")
[3] - d8aba82b5ca7 ("ena: Store ticks of last Tx cleanup")
Fixes: 74dba3ad7851 ("Split function checking for missing TX completion in ENA driver") Fixes: d8aba82b5ca7 ("ena: Store ticks of last Tx cleanup")
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
Netlink is a communication protocol defined in RFC 3549. It is async,
TLV-based protocol, providing 1-1 and 1-many communications between kernel
and userland. Netlink is currently used in Linux kernel to modify, read and
subscribe for nearly all networking states. Interface state, addresses, routes,
firewall, rules, fibs, etc, are controlled via Netlink.
Netlink support was added in D36002. It has got a number of improvements and
first customers since then:
* net/bird2 got netlink support, enabling route multipath in FreeBSD
* netlink-based devd notifications are being worked on ( D37574 ).
* linux(4) fully supports and depends on Netlink
Enabling Netlink in GENERIC targets two goals.
The first one is to provide stability for the third-party userland applications,
so they can rely on the fact that netlink always exists since 14.0 and potentially 13.2.
Loadable module makes life of the app delepers harder. For example, `net/bird2` can be
either build with netlink or rtsock support, but not both.
The second goal is to enable gradual conversion of the base userland tools
to use netlink(4) interfaces. Converting tools like netstat (D36529), route,
ifconfig one-by-one simplifies testing and addressing the feedback.
Othewise, switching all base to use netlink at once may be too big of a leap.
This change targets amd64, the other architectures will follow soon.
Robert Wing [Mon, 5 Dec 2022 17:22:45 +0000 (08:22 -0900)]
bhyveload: open guest boot disk image O_RDWR
When a boot environment has been booted via the bootonce feature,
userboot clears the bootonce value from an nvlist but fails to write the
updated nvlist back to disk.
The failure occurs because bhyveload opens the guest boot disk image
O_RDONLY, fix this by opening it O_RDWR.
John Baldwin [Fri, 20 Jan 2023 17:58:38 +0000 (09:58 -0800)]
bhyve: Fix a buffer overread in the PCI hda device model.
The sc->codecs array contains HDA_CODEC_MAX (15) entries. The
guest-supplied cad field in the verb provided to hda_send_command is a
4-bit field that was used as an index into sc->codecs without any
bounds checking. The highest value (15) would overflow the array.
Other uses of sc->codecs in the device model used sc->codecs_no to
determine which array indices have been initialized, so use a similar
check to reject requests for uninitialized or invalid cad indices in
hda_send_command.
PR: 264582
Reported by: Robert Morris <rtm@lcs.mit.edu>
Reviewed by: corvink, markj, emaste
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38128
John Baldwin [Fri, 20 Jan 2023 17:57:45 +0000 (09:57 -0800)]
bhyve: Fix a global buffer overread in the PCI hda device model.
hda_write did not validate the relative register offset before using
it as an index into the hda_set_reg_table array to lookup a function
pointer to execute after updating the register's value.
PR: 264435
Reported by: Robert Morris <rtm@lcs.mit.edu>
Reviewed by: corvink, markj, emaste
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38127
John Baldwin [Thu, 19 Jan 2023 18:30:18 +0000 (10:30 -0800)]
bhyve: Remove vmctx argument from PCI device model methods.
Most of these arguments were unused. Device models which do need
access to the vmctx in one of these methods can obtain it from the
pi_vmctx member of the pci_devinst argument instead.
John Baldwin [Thu, 26 Jan 2023 22:24:55 +0000 (14:24 -0800)]
bhyve: Fix a mismerge in the PCI passthrough device model.
This block of code was removed in stable/13 commit 7ea16192a01e. It
was accidentally re-added in commit aa5eea98b99c (probably due to
resolving a conflict during the merge).
XHCI port and slot numbers are 1-based rather than 0-based. To handle
this, bhyve was subtracting one item from the pointers saved in the
softc so that index 1 accessed index 0 of the allocated array.
However, this is UB and confused GCC 12. The compiler noticed that
the calls to free() were using an offset and emitted a warning.
Rather than storing UB pointers in the softc, push the decrement
operation into the existing macros that wrap accesses to the relevant
arrays.
John Baldwin [Wed, 21 Dec 2022 18:33:18 +0000 (10:33 -0800)]
bhyve: Tidy vCPU pthread startup.
Set the thread affinity in fbsdrun_start_thread next to where the
thread name is set. This keeps all the pthread initialization
operations at the start of a thread in one place.
John Baldwin [Wed, 21 Dec 2022 18:32:24 +0000 (10:32 -0800)]
bhyve: Remove some no-op code for setting RIP.
fbsdrun_addcpu() read the current vCPU's RIP register from the kernel
via vm_get_register() to pass along through some layers to vm_loop()
which then set the register via vm_set_register(). However, this is
just always setting the value back to itself.
John Baldwin [Wed, 21 Dec 2022 18:31:16 +0000 (10:31 -0800)]
bhyve: Simplify setting vCPU capabilities.
- Enable VM_CAP_IPI_EXIT in fbsdrun_set_capabilities along with other
capabilities enabled on all vCPUs.
- Don't call fbsdrun_set_capabilities a second time on the BSP in
spinup_vcpu.
- To preserve previous behavior, don't unconditionally enable
unrestricted guest mode on the BSP (this unbreaks single-vCPU guests
on Nehalem systems, though supporting such setups is of dubious
value). Other places that enbale UG on the BSP are careful to check
the result of the operation and fail if it is not available.
- Don't set any capabilities in spinup_ap(). These are now all
redundant with earlier settings from spinup_vcpu().
- While here, axe a stale comment from fbsdrun_addcpu(). This
function is now always called from the main thread for all vCPUs.
John Baldwin [Fri, 9 Dec 2022 18:26:23 +0000 (10:26 -0800)]
vmm: Don't lock a vCPU for VM_PPTDEV_MSI[X].
These are manipulating state in a ppt(4) device none of which is
vCPU-specific. Mark the vcpu fields in the relevant ioctl structures
as unused, but don't remove them for now.
John Baldwin [Wed, 30 Nov 2022 21:06:46 +0000 (13:06 -0800)]
vmm: Remove stale comment for vm_rendezvous.
Support for rendezvous outside of a vcpu context (vcpuid of -1) was
removed in commit 949f0f47a4e7, and the vm, vcpuid argument pair was
replaced by a single struct vcpu pointer in commit d8be3d523dd5.
John Baldwin [Fri, 18 Nov 2022 18:05:35 +0000 (10:05 -0800)]
vmm: Allocate vCPUs on first use of a vCPU.
Convert the vcpu[] array in struct vm to an array of pointers and
allocate vCPUs on first use. This avoids always allocating VM_MAXCPU
vCPUs for each VM, but instead only allocates the vCPUs in use. A new
per-VM sx lock is added to serialize attempts to allocate vCPUs on
first use. However, a given vCPU is never freed while the VM is
active, so the pointer is read via an unlocked read first to avoid the
need for the lock in the common case once the vCPU has been created.
Some ioctls need to lock all vCPUs. To prevent races with ioctls that
want to allocate a new vCPU, these ioctls also lock the sx lock that
protects vCPU creation.
John Baldwin [Fri, 18 Nov 2022 18:05:10 +0000 (10:05 -0800)]
vmm: Use a cpuset_t for vCPUs waiting for STARTUP IPIs.
Retire the boot_state member of struct vlapic and instead use a cpuset
in the VM to track vCPUs waiting for STARTUP IPIs. INIT IPIs add
vCPUs to this set, and STARTUP IPIs remove vCPUs from the set.
STARTUP IPIs are only reported to userland for vCPUs that were removed
from the set.
In particular, this permits a subsequent change to allocate vCPUs on
demand when the vCPU may not be allocated until after a STARTUP IPI is
reported to userland.
Robert Wing [Fri, 20 Jan 2023 11:10:53 +0000 (11:10 +0000)]
vmm: take exclusive mem_segs_lock in vm_cleanup()
The consumers of vm_cleanup() are vm_reinit() and vm_destroy().
The vm_reinit() call path is, here vmmdev_ioctl() takes mem_segs_lock:
vmmdev_ioctl()
vm_reinit()
vm_cleanup(destroy=false)
The call path for vm_destroy() is (mem_segs_lock not taken):
sysctl_vmm_destroy()
vmmdev_destroy()
vm_destroy()
vm_cleanup(destroy=true)
Fix this by taking mem_segs_lock in vm_cleanup() when destroy == true.
Reviewed by: corvink, markj, jhb Fixes: 67b69e76e8ee ("vmm: Use an sx lock to protect the memory map.")
Differential Revision: https://reviews.freebsd.org/D38071
John Baldwin [Fri, 18 Nov 2022 18:04:37 +0000 (10:04 -0800)]
vmm: Use an sx lock to protect the memory map.
Previously bhyve obtained a "read lock" on the memory map for ioctls
needing to read the map by locking the last vCPU. This is now
replaced by a new per-VM sx lock. Modifying the map requires
exclusively locking the sx lock as well as locking all existing vCPUs.
Reading the map requires either locking one vCPU or the sx lock.
This permits safely modifying or querying the memory map while some
vCPUs do not exist which will be true in a future commit.
John Baldwin [Fri, 18 Nov 2022 18:04:11 +0000 (10:04 -0800)]
vmm vmx: Allocate vpids on demand as each vCPU is initialized.
Compared to the previous version this does mean that if the system as
a whole runs out of dedicated vPIDs you might end up with some vCPUs
within a single VM using dedicated vPIDs and others using shared
vPIDs, but this should not break anything.
John Baldwin [Fri, 18 Nov 2022 18:03:52 +0000 (10:03 -0800)]
vmm: Lookup vcpu pointers in vmmdev_ioctl.
Centralize mapping vCPU IDs to struct vcpu objects in vmmdev_ioctl and
pass vcpu pointers to the routines in vmm.c. For operations that want
to perform an action on all vCPUs or on a single vCPU, pass pointers
to both the VM and the vCPU using a NULL vCPU pointer to request
global actions.
John Baldwin [Fri, 18 Nov 2022 18:02:09 +0000 (10:02 -0800)]
vmm: Use struct vcpu in the instruction emulation code.
This passes struct vcpu down in place of struct vm and and integer
vcpu index through the in-kernel instruction emulation code. To
minimize userland disruption, helper macros are used for the vCPU
arguments passed into and through the shared instruction emulation
code.
A few other APIs used by the instruction emulation code have also been
updated to accept struct vcpu in the kernel including
vm_get/set_register and vm_inject_fault.
John Baldwin [Fri, 18 Nov 2022 18:01:57 +0000 (10:01 -0800)]
vmm: Add vm_gpa_hold_global wrapper function.
This handles the case that guest pages are being held not on behalf of
a virtual CPU but globally. Previously this was handled by passing a
vcpuid of -1 to vm_gpa_hold, but that will not work in the future when
vm_gpa_hold is changed to accept a struct vcpu pointer.
John Baldwin [Fri, 18 Nov 2022 18:00:00 +0000 (10:00 -0800)]
vmm: Remove the per-vm cookie argument from vmmops taking a vcpu.
This requires storing a reference to the per-vm cookie in the
CPU-specific vCPU structure. Take advantage of this new field to
remove no-longer-needed function arguments in the CPU-specific
backends. In particular, stop passing the per-vm cookie to functions
that either don't use it or only use it for KTR traces.
John Baldwin [Fri, 18 Nov 2022 17:59:21 +0000 (09:59 -0800)]
vmm: Refactor storage of CPU-dependent per-vCPU data.
Rather than storing static arrays of per-vCPU data in the CPU-specific
per-VM structure, adopt a more dynamic model similar to that used to
manage CPU-specific per-VM data.
That is, add new vmmops methods to init and cleanup a single vCPU.
The init method returns a pointer that is stored in 'struct vcpu' as a
cookie pointer. This cookie pointer is now passed to other vmmops
callbacks in place of the integer index. The index is now only used
in KTR traces and when calling back into the CPU-independent layer.