Kristof Provost [Sat, 31 Dec 2022 18:23:15 +0000 (19:23 +0100)]
pf tests: test fast port re-use with syncookies
When a src/dst ip/port tuple is re-used before the pf state fully
expires we clean up the state and create a new one, unless syncookies
are enabled.
Test this, by running two back-to-back nc sessions, with a fixed source
port. Move the interface and IP to a different (vnet) jail, to trick the
network stack into letting us do this.
Kristof Provost [Sat, 31 Dec 2022 14:59:10 +0000 (15:59 +0100)]
pf: fix syncookies in conjunction with tcp fast port reuse
Basic scenario: we have a closed connection (In TCPS_FIN_WAIT_2), and
get a new connection (i.e. SYN) re-using the tuple.
Without syncookies we look at the SYN, and completely unlink the old,
closed state on the SYN.
With syncookies we send a generated SYN|ACK back, and drop the SYN,
never looking at the state table.
So when the ACK (i.e. the third step in the three way handshake for
connection setup) turns up, we’ve not actually removed the old state, so
we find it, and don’t do the syncookie dance, or allow the new
connection to get set up.
Explicitly check for this in pf_test_state_tcp(). If we find a state in
TCPS_FIN_WAIT_2 and the syncookie is valid we delete the existing state
so we can set up the new state.
Note that when we verify the syncookie in pf_test_state_tcp() we don't
decrement the number of half-open connections to avoid an incorrect
double decrement.
Kristof Provost [Fri, 13 Jan 2023 03:34:20 +0000 (04:34 +0100)]
pf: fix panic on deferred packets
The pfsync_defer_tmo() callout needs to set the correct vnet before it
can transmit packets. It used the rcvif in the mbuf to get this vnet,
but that doesn't work for locally originated traffic. In that case the
rcvif pointer is NULL, and the dereference leads to a panic.
Instead use the sc_sync_if, which is always set (if pfsync is enabled,
at least).
Alan Somers [Fri, 13 Jan 2023 20:19:03 +0000 (13:19 -0700)]
cal: don't print terminal control characters unless stdout is a TTY
A similar change was made in svn r223931, but it was incomplete, working
only when the utility was invoked as "ncal". Fix the same issue when
invoking as "cal".
Alan Somers [Sat, 7 Jan 2023 01:54:23 +0000 (18:54 -0700)]
fsx: bounds check the inputs
In particular, don't allow the user to specify a file size that can't be
expressed as an int, since fsx's random-number generator only has a 32
bit range.
Jose Luis Duran [Sun, 20 Nov 2022 02:56:49 +0000 (23:56 -0300)]
ping(8): man page cleanup
* Appease mandoc -T lint and igor
* Use example.com for documentation
* Update the IPv4 TTL section.
Update the IPv4 TTL section specifically for FreeBSD.
FreeBSD changed the default TTL to 64 in 5639e86bdd7ea151958776264bf5a67e60a54d68. NetBSD and OpenBSD still
use 255. Remove some references of extinct operating systems.
Alan Somers [Thu, 1 Dec 2022 16:49:57 +0000 (09:49 -0700)]
improvements to cap_sysctl.3
* Correct some function prototypes which were documented with the wrong
pointer type.
* Clarify return values and requirements for freeing the limit handle.
ena: Switch driver owners from semihalf to amazon in man file
1. Update ena.4 manual file to include amazon owner emails.
2. State that the driver is developed by amazon but leave
that it was originally written by Semihalf, similarly to other
drivers in the /share/man/ directory of the FreeBSD source code.
3. Advance year in copyright notice to 2022.
David Arinzon [Sun, 4 Dec 2022 10:37:32 +0000 (12:37 +0200)]
ena: Remove timer service re-arm on ena_restore_device failure
In case the reset sequence fails (ena_destroy_device() followed by
ena_restore_device() calls) during ena_restore_device(), the driver
resources are being freed. After the clean-up, the timer service is
re-armed in order to try and re-initialize the driver state.
But, such an attempt would fail given that the resources are freed.
Moreover, this would actually cause either the system to fail or a
panic.
When the driver fails in ena_restore_device() procedure, the only
recovery is either unloading and loading the driver or instance
reboot.
This change removes the timer service re-arm in case of failure
in ena_restore_device().
MFC after: 2 weeks
Sponsored by: Amazon, Inc. Fixes: 78554d0c707c ("ena: start timer service on attach")
(cherry picked from commit c4a85b8d684d3db9dc4d3d01d966130e21390529)
Commit [1] first added the ena_tx_buffer.print_once member,
so that a message about a missing tx completion is printed only
once per packet (and not every second when the watchdog runs).
In this commit print_once is initialized to true, and is set back
to false after detecting a missing tx completion and printing
a warning about it to dmesg.
Commit [2] incorrectly reverses the values assigned to print_once.
The variable is initialized to be true but is checked to be false
when a missing tx completion is detected. This is never true, and
therefore the warning print for each missing tx completion is never
printed since this commit.
Commit [3] added time passed since last TX cleanup to the missing
tx completions per-packet print. However, due to the issue in commit
[2], this time is never printed.
This commit reverses back the values assigned to ena_tx_buffer.print_once
erroneously by commit [2], bringing back to life the missing tx
completion per-packet print.
Also add a space after "." in the missing tx completion print.
[1] - 9b8d05b8ac78 ("Add support for Amazon Elastic Network Adapter (ENA) NIC")
[2] - 74dba3ad7851 ("Split function checking for missing TX completion in ENA driver")
[3] - d8aba82b5ca7 ("ena: Store ticks of last Tx cleanup")
Fixes: 74dba3ad7851 ("Split function checking for missing TX completion in ENA driver") Fixes: d8aba82b5ca7 ("ena: Store ticks of last Tx cleanup")
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
Netlink is a communication protocol defined in RFC 3549. It is async,
TLV-based protocol, providing 1-1 and 1-many communications between kernel
and userland. Netlink is currently used in Linux kernel to modify, read and
subscribe for nearly all networking states. Interface state, addresses, routes,
firewall, rules, fibs, etc, are controlled via Netlink.
Netlink support was added in D36002. It has got a number of improvements and
first customers since then:
* net/bird2 got netlink support, enabling route multipath in FreeBSD
* netlink-based devd notifications are being worked on ( D37574 ).
* linux(4) fully supports and depends on Netlink
Enabling Netlink in GENERIC targets two goals.
The first one is to provide stability for the third-party userland applications,
so they can rely on the fact that netlink always exists since 14.0 and potentially 13.2.
Loadable module makes life of the app delepers harder. For example, `net/bird2` can be
either build with netlink or rtsock support, but not both.
The second goal is to enable gradual conversion of the base userland tools
to use netlink(4) interfaces. Converting tools like netstat (D36529), route,
ifconfig one-by-one simplifies testing and addressing the feedback.
Othewise, switching all base to use netlink at once may be too big of a leap.
This change targets amd64, the other architectures will follow soon.
Robert Wing [Mon, 5 Dec 2022 17:22:45 +0000 (08:22 -0900)]
bhyveload: open guest boot disk image O_RDWR
When a boot environment has been booted via the bootonce feature,
userboot clears the bootonce value from an nvlist but fails to write the
updated nvlist back to disk.
The failure occurs because bhyveload opens the guest boot disk image
O_RDONLY, fix this by opening it O_RDWR.
John Baldwin [Fri, 20 Jan 2023 17:58:38 +0000 (09:58 -0800)]
bhyve: Fix a buffer overread in the PCI hda device model.
The sc->codecs array contains HDA_CODEC_MAX (15) entries. The
guest-supplied cad field in the verb provided to hda_send_command is a
4-bit field that was used as an index into sc->codecs without any
bounds checking. The highest value (15) would overflow the array.
Other uses of sc->codecs in the device model used sc->codecs_no to
determine which array indices have been initialized, so use a similar
check to reject requests for uninitialized or invalid cad indices in
hda_send_command.
PR: 264582
Reported by: Robert Morris <rtm@lcs.mit.edu>
Reviewed by: corvink, markj, emaste
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38128
John Baldwin [Fri, 20 Jan 2023 17:57:45 +0000 (09:57 -0800)]
bhyve: Fix a global buffer overread in the PCI hda device model.
hda_write did not validate the relative register offset before using
it as an index into the hda_set_reg_table array to lookup a function
pointer to execute after updating the register's value.
PR: 264435
Reported by: Robert Morris <rtm@lcs.mit.edu>
Reviewed by: corvink, markj, emaste
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38127
John Baldwin [Thu, 19 Jan 2023 18:30:18 +0000 (10:30 -0800)]
bhyve: Remove vmctx argument from PCI device model methods.
Most of these arguments were unused. Device models which do need
access to the vmctx in one of these methods can obtain it from the
pi_vmctx member of the pci_devinst argument instead.
John Baldwin [Thu, 26 Jan 2023 22:24:55 +0000 (14:24 -0800)]
bhyve: Fix a mismerge in the PCI passthrough device model.
This block of code was removed in stable/13 commit 7ea16192a01e. It
was accidentally re-added in commit aa5eea98b99c (probably due to
resolving a conflict during the merge).
XHCI port and slot numbers are 1-based rather than 0-based. To handle
this, bhyve was subtracting one item from the pointers saved in the
softc so that index 1 accessed index 0 of the allocated array.
However, this is UB and confused GCC 12. The compiler noticed that
the calls to free() were using an offset and emitted a warning.
Rather than storing UB pointers in the softc, push the decrement
operation into the existing macros that wrap accesses to the relevant
arrays.
John Baldwin [Wed, 21 Dec 2022 18:33:18 +0000 (10:33 -0800)]
bhyve: Tidy vCPU pthread startup.
Set the thread affinity in fbsdrun_start_thread next to where the
thread name is set. This keeps all the pthread initialization
operations at the start of a thread in one place.
John Baldwin [Wed, 21 Dec 2022 18:32:24 +0000 (10:32 -0800)]
bhyve: Remove some no-op code for setting RIP.
fbsdrun_addcpu() read the current vCPU's RIP register from the kernel
via vm_get_register() to pass along through some layers to vm_loop()
which then set the register via vm_set_register(). However, this is
just always setting the value back to itself.
John Baldwin [Wed, 21 Dec 2022 18:31:16 +0000 (10:31 -0800)]
bhyve: Simplify setting vCPU capabilities.
- Enable VM_CAP_IPI_EXIT in fbsdrun_set_capabilities along with other
capabilities enabled on all vCPUs.
- Don't call fbsdrun_set_capabilities a second time on the BSP in
spinup_vcpu.
- To preserve previous behavior, don't unconditionally enable
unrestricted guest mode on the BSP (this unbreaks single-vCPU guests
on Nehalem systems, though supporting such setups is of dubious
value). Other places that enbale UG on the BSP are careful to check
the result of the operation and fail if it is not available.
- Don't set any capabilities in spinup_ap(). These are now all
redundant with earlier settings from spinup_vcpu().
- While here, axe a stale comment from fbsdrun_addcpu(). This
function is now always called from the main thread for all vCPUs.
John Baldwin [Fri, 9 Dec 2022 18:26:23 +0000 (10:26 -0800)]
vmm: Don't lock a vCPU for VM_PPTDEV_MSI[X].
These are manipulating state in a ppt(4) device none of which is
vCPU-specific. Mark the vcpu fields in the relevant ioctl structures
as unused, but don't remove them for now.
John Baldwin [Wed, 30 Nov 2022 21:06:46 +0000 (13:06 -0800)]
vmm: Remove stale comment for vm_rendezvous.
Support for rendezvous outside of a vcpu context (vcpuid of -1) was
removed in commit 949f0f47a4e7, and the vm, vcpuid argument pair was
replaced by a single struct vcpu pointer in commit d8be3d523dd5.
John Baldwin [Fri, 18 Nov 2022 18:05:35 +0000 (10:05 -0800)]
vmm: Allocate vCPUs on first use of a vCPU.
Convert the vcpu[] array in struct vm to an array of pointers and
allocate vCPUs on first use. This avoids always allocating VM_MAXCPU
vCPUs for each VM, but instead only allocates the vCPUs in use. A new
per-VM sx lock is added to serialize attempts to allocate vCPUs on
first use. However, a given vCPU is never freed while the VM is
active, so the pointer is read via an unlocked read first to avoid the
need for the lock in the common case once the vCPU has been created.
Some ioctls need to lock all vCPUs. To prevent races with ioctls that
want to allocate a new vCPU, these ioctls also lock the sx lock that
protects vCPU creation.
John Baldwin [Fri, 18 Nov 2022 18:05:10 +0000 (10:05 -0800)]
vmm: Use a cpuset_t for vCPUs waiting for STARTUP IPIs.
Retire the boot_state member of struct vlapic and instead use a cpuset
in the VM to track vCPUs waiting for STARTUP IPIs. INIT IPIs add
vCPUs to this set, and STARTUP IPIs remove vCPUs from the set.
STARTUP IPIs are only reported to userland for vCPUs that were removed
from the set.
In particular, this permits a subsequent change to allocate vCPUs on
demand when the vCPU may not be allocated until after a STARTUP IPI is
reported to userland.
Robert Wing [Fri, 20 Jan 2023 11:10:53 +0000 (11:10 +0000)]
vmm: take exclusive mem_segs_lock in vm_cleanup()
The consumers of vm_cleanup() are vm_reinit() and vm_destroy().
The vm_reinit() call path is, here vmmdev_ioctl() takes mem_segs_lock:
vmmdev_ioctl()
vm_reinit()
vm_cleanup(destroy=false)
The call path for vm_destroy() is (mem_segs_lock not taken):
sysctl_vmm_destroy()
vmmdev_destroy()
vm_destroy()
vm_cleanup(destroy=true)
Fix this by taking mem_segs_lock in vm_cleanup() when destroy == true.
Reviewed by: corvink, markj, jhb Fixes: 67b69e76e8ee ("vmm: Use an sx lock to protect the memory map.")
Differential Revision: https://reviews.freebsd.org/D38071
John Baldwin [Fri, 18 Nov 2022 18:04:37 +0000 (10:04 -0800)]
vmm: Use an sx lock to protect the memory map.
Previously bhyve obtained a "read lock" on the memory map for ioctls
needing to read the map by locking the last vCPU. This is now
replaced by a new per-VM sx lock. Modifying the map requires
exclusively locking the sx lock as well as locking all existing vCPUs.
Reading the map requires either locking one vCPU or the sx lock.
This permits safely modifying or querying the memory map while some
vCPUs do not exist which will be true in a future commit.
John Baldwin [Fri, 18 Nov 2022 18:04:11 +0000 (10:04 -0800)]
vmm vmx: Allocate vpids on demand as each vCPU is initialized.
Compared to the previous version this does mean that if the system as
a whole runs out of dedicated vPIDs you might end up with some vCPUs
within a single VM using dedicated vPIDs and others using shared
vPIDs, but this should not break anything.
John Baldwin [Fri, 18 Nov 2022 18:03:52 +0000 (10:03 -0800)]
vmm: Lookup vcpu pointers in vmmdev_ioctl.
Centralize mapping vCPU IDs to struct vcpu objects in vmmdev_ioctl and
pass vcpu pointers to the routines in vmm.c. For operations that want
to perform an action on all vCPUs or on a single vCPU, pass pointers
to both the VM and the vCPU using a NULL vCPU pointer to request
global actions.
John Baldwin [Fri, 18 Nov 2022 18:02:09 +0000 (10:02 -0800)]
vmm: Use struct vcpu in the instruction emulation code.
This passes struct vcpu down in place of struct vm and and integer
vcpu index through the in-kernel instruction emulation code. To
minimize userland disruption, helper macros are used for the vCPU
arguments passed into and through the shared instruction emulation
code.
A few other APIs used by the instruction emulation code have also been
updated to accept struct vcpu in the kernel including
vm_get/set_register and vm_inject_fault.
John Baldwin [Fri, 18 Nov 2022 18:01:57 +0000 (10:01 -0800)]
vmm: Add vm_gpa_hold_global wrapper function.
This handles the case that guest pages are being held not on behalf of
a virtual CPU but globally. Previously this was handled by passing a
vcpuid of -1 to vm_gpa_hold, but that will not work in the future when
vm_gpa_hold is changed to accept a struct vcpu pointer.
John Baldwin [Fri, 18 Nov 2022 18:00:00 +0000 (10:00 -0800)]
vmm: Remove the per-vm cookie argument from vmmops taking a vcpu.
This requires storing a reference to the per-vm cookie in the
CPU-specific vCPU structure. Take advantage of this new field to
remove no-longer-needed function arguments in the CPU-specific
backends. In particular, stop passing the per-vm cookie to functions
that either don't use it or only use it for KTR traces.
John Baldwin [Fri, 18 Nov 2022 17:59:21 +0000 (09:59 -0800)]
vmm: Refactor storage of CPU-dependent per-vCPU data.
Rather than storing static arrays of per-vCPU data in the CPU-specific
per-VM structure, adopt a more dynamic model similar to that used to
manage CPU-specific per-VM data.
That is, add new vmmops methods to init and cleanup a single vCPU.
The init method returns a pointer that is stored in 'struct vcpu' as a
cookie pointer. This cookie pointer is now passed to other vmmops
callbacks in place of the integer index. The index is now only used
in KTR traces and when calling back into the CPU-independent layer.
John Baldwin [Fri, 18 Nov 2022 17:58:56 +0000 (09:58 -0800)]
vmm vmx: Add a global bool to indicate if the host has the TSC_AUX MSR.
A future commit will remove direct access to vCPU structures from
struct vmx, so add a dedicated boolean for this rather than checking
the capabilities for vCPU 0.
John Baldwin [Fri, 18 Nov 2022 17:58:41 +0000 (09:58 -0800)]
vmm: Rework snapshotting of CPU-specific per-vCPU data.
Previously some per-vCPU state was saved in vmmops_snapshot and other
state was saved in vmmops_vcmx_snapshot. Consolidate all per-vCPU
state into the latter routine and rename the hook to the more generic
'vcpu_snapshot'. Note that the CPU-independent per-vCPU data is still
stored in a separate blob as well as the per-vCPU local APIC data.
John Baldwin [Fri, 18 Nov 2022 17:57:29 +0000 (09:57 -0800)]
vmm: Simplify saving of absolute TSC values in snapshots.
Read the current "now" TSC value and use it to compute absolute time
saved value in vm_snapshot_vcpus rather than iterating over vCPUs
multiple times in vm_snapshot_vm.
John Baldwin [Tue, 29 Nov 2022 01:10:30 +0000 (17:10 -0800)]
bhyve: Avoid passing a possible garbage pointer to free().
All of the error paths in pci_vtcon_sock_add free the sock pointer.
However, sock is not initialized until part way through the function.
An early error would pass stack garbage to free().
John Baldwin [Tue, 29 Nov 2022 01:10:07 +0000 (17:10 -0800)]
bhyve: Appease warning about a potentially unaligned pointer.
When initializing the device model for a PCI pass through device that
uses MSI-X, bhyve reads the MSI-X capability from the real device to
save a copy in the emulated PCI config space. It also saves a copy in
a local struct msixcap on the stack. Since struct msixcap is packed,
GCC complains that casting a pointer to the struct to a uint32_t
pointer may result in an unaligned pointer.
This path is not performance critical, so to appease the compiler,
simply change the pointer to a char * and use memcpy to copy the 4
bytes read in each iteration of the loop.
John Baldwin [Tue, 29 Nov 2022 01:09:15 +0000 (17:09 -0800)]
bhyve: Avoid unlikely truncation of the blockif ident strings.
The ident string for NVMe and VirtIO block deivces do not contain the
bus, and the various fields can potentially use up to three characters
when printed as unsigned values (full range of uint8_t) even if not
likely in practice.
John Baldwin [Tue, 29 Nov 2022 01:08:09 +0000 (17:08 -0800)]
bhyve: Fix sign compare warnings in the e1000 device model.
Adding a bare constant to a uint16_t promotes to a signed int which
triggers these warnings. Changing the constant to be explicitly
unsigned instead promotes the expression to unsigned int.
Mark Johnston [Fri, 18 Nov 2022 19:10:33 +0000 (14:10 -0500)]
bhyve: Enable the default compiler warnings
Disable -Wcast-align for now since we have many instances of that
warning (I fixed some but not most of them) and platforms on which bhyve
runs don't particularly care about unaligned accesses.
John Baldwin [Thu, 26 Jan 2023 20:19:12 +0000 (12:19 -0800)]
bhyve: Mark pci_de_vinput as static.
Originally in main this was fixed in commit 37045dfa891a. However,
when that commit was merged to stable/13 in commit 976ed044fbbb,
pci_virtio_input was not yet merged to stable/13. This is a direct
commit to complete the earlier MFC.
Mark Johnston [Fri, 18 Nov 2022 19:07:20 +0000 (14:07 -0500)]
bhyve: Let BASL compile with raised warnings
- Make basl_dump() as unused.
- Avoid arithmetic on a void pointer.
- Avoid a signed/unsigned comparison with
BASL_TABLE_CHECKSUM_LEN_FULL_TABLE.
- Ignore warnings about unused parameters from stuff pulled in by
acpi.h. In particular, any prototype wrapped by
ACPI_DBG_DEPENDENT_RETURN_VOID() will raise such parameters unless
ACPI_DEBUG_OUTPUT is defined.
Mark Johnston [Fri, 18 Nov 2022 19:07:38 +0000 (14:07 -0500)]
bhyve: Avoid using a packed struct for xhci port registers
I believe the __packed annotation is there only because
pci_xhci_portregs_read() is treating the register set as an array of
uint32_t. clang warns about taking the address of portregs->portsc
because it is a packed member and thus might not have expected
alignment.
Fix the problem by simply selecting the field to read with a switch
statement. This mimics pci_xhci_portregs_write(). While here, switch
to using some symbolic constants.
There is a small semantic change here in that pci_xhci_portregs_read()
would silently truncate unaligned offsets. For consistency with
pci_xhci_portregs_write(), which does not do that, return all ones for
unaligned reads instead.