Warner Losh [Wed, 16 Aug 2023 07:42:14 +0000 (01:42 -0600)]
glob.h: Remove $FreeBSD$
This likely documented where this file was copied, but the $FreeBSD$
tag was lost as soon as it was committed. Just remove it. Also remove
the one that looked like it was intended to track versions. That will
simplify the MFC.
Dmitry Chagin [Mon, 14 Aug 2023 12:46:12 +0000 (15:46 +0300)]
linux(4): Fix MSG_CTRUNC handling in recvmsg()
The MSG_CTRUNC flag of the msg_flags member of the message header is
set uppon successful completition if the control data was truncated.
Upon return from a successful call msg_controllen should contain the
length of the control message sequence.
Dmitry Chagin [Mon, 14 Aug 2023 12:46:11 +0000 (15:46 +0300)]
linux(4): Fix control message size calculation again
It looks Linux recvmsg allows msg_controllen size less then CMSG_SPACE
buffer, at least for case with one cmsghdr. Glibc misc/tst-scm_rights
test succed on Ubuntu 23.04
Ed Maste [Mon, 14 Aug 2023 20:35:34 +0000 (16:35 -0400)]
pci: return 0 for pci_remap_intr_method MSI-X non-error case
When remapping a MSI-X vector, we would always return ENOENT, even if
successful. This didn't really matter, as the sole caller of
BUS_REMAP_INTR also didn't check for errors.
Return 0 if there's no error, so that we can start handling (or at least
warning about) actual failures.
Reviewed by: jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D41449
Historically, tftpd disallowed write requests to existing files
that are not publicly writable. Such requirement is questionable at least.
Let us make it possible to run tftpd in chrooted environment
keeping files non-world writable.
New option -S enables write requests to existing files
for chrooted run according to generic file permissions.
It is ignored unless tftpd runs chrooted.
Shailend Chand [Fri, 2 Jun 2023 18:58:24 +0000 (11:58 -0700)]
Add gve, the driver for Google Virtual NIC (gVNIC)
gVNIC is a virtual network interface designed specifically for
Google Compute Engine (GCE). It is required to support per-VM Tier_1
networking performance, and for using certain VM shapes on GCE.
The NIC supports TSO, Rx and Tx checksum offloads, and RSS.
It does not currently do hardware LRO, and thus the software-LRO
in the host is used instead. It also supports jumbo frames.
For each queue, the driver negotiates a set of pages with the NIC to
serve as a fixed bounce buffer, this precludes the use of iflib.
Mike Karels [Tue, 8 Aug 2023 14:09:03 +0000 (09:09 -0500)]
md driver compat32: fix structure padding for arm, powerpc
Because the 32-bit md_ioctl structure contains 64-bit members, arm
and powerpc add padding to a multiple of 8. i386 doesn't do this.
The md_ioctl32 definition was correct for amd64/i386 without padding,
but wrong for arm64 and powerpc64. Make __packed__ conditional on
__amd64__, and test for the expected size on non-amd64. Note that
mdconfig is used in the ATF test suite. Note, I verified the
structure size for powerpc, but was unable to test.
Corvin Köhne [Mon, 16 Aug 2021 07:50:15 +0000 (09:50 +0200)]
bhyve: add bootindex option for several devices
The bootindex option creates an entry in the "bootorder" fwcfg file.
This file can be picked up by the guest firmware to determine the
bootorder. Nevertheless, it's not guaranteed that the guest firmware
uses the bootorder. At the moment, our OVMF ignores the bootorder. This
will change in the future.
If guest firmware supports the "bootorder" fwcfg file and no device uses
the bootindex option, the boot order is determined by the firmware
itself. If one or more devices specify a bootindex, the first bootable
device with the lowest bootindex will be booted. It's not garanteed that
devices without a bootindex will be recognized as bootable from the
firmware in that case.
Corvin Köhne [Mon, 16 Aug 2021 07:47:53 +0000 (09:47 +0200)]
bhyve: add helper to create a bootorder
Qemu's fwcfg allows to define a bootorder. Therefore, the hypervisor has
to create a fwcfg item named bootorder, which has a newline seperated
list of boot entries. Qemu's OVMF will pick up the bootorder and applies
it.
Add the moment, bhyve's OVMF doesn't support a custom bootorder by
qemu's fwcfg. However, in the future bhyve will gain support for qemu's
OVMF. Additonally, we can port relevant parts from qemu's to bhyve's
OVMF implementation.
Corvin Köhne [Wed, 10 May 2023 11:44:28 +0000 (13:44 +0200)]
bhyve: pass address of OpRegion to the guest
Don't allow access to the physical ASLS register. It contains a host
address which is meaningless for the guest. Additionally, it allows the
guest to safely rewrite this register.
This is the last commit required for GVT-d. Nevertheless, it might not
work due to missing firmware support.
Corvin Köhne [Wed, 10 May 2023 11:39:56 +0000 (13:39 +0200)]
bhyve: copy OpRegion into guest memory
This makes the OpRegion accessible by the guest. However, the guest
doesn't know the address of the OpRegion. This will be fixed by an
upcoming commit.
The range of the OpRegion is added to the e820 table. This allows the
guest firmware to easily pick up this range and to reserve it properly.
Corvin Köhne [Wed, 10 May 2023 11:38:02 +0000 (13:38 +0200)]
bhyve: read OpRegion address and size for GVT-d
The OpRegion provides some configuration bits and ACPI methods used by
some Intel drivers. The guest needs access to it. In the first step,
we're reading it's address and size.
Corvin Köhne [Thu, 11 May 2023 09:18:56 +0000 (11:18 +0200)]
bhyve: emulate graphics stolen memory register
This register contains a host physical address. This address is
meaningless for the guest. We have to emulate it and set it to a valid
guest physical address.
Corvin Köhne [Thu, 11 May 2023 09:10:07 +0000 (11:10 +0200)]
bhyve: allocate guest memory for graphics stolen memory
The graphics stolen memory is only GPU accessible. So, we don't have to
copy any data to it as the guest will be unable to access it anyway. We
just have to allocate and reserve some memory. That's done by adding an
E820 entry for the graphics stolen memory. The guest firmware will pick
up the E820 and reserve this range.
Note that we try to reuse the host address as Intel states that newer
Tiger Lake platforms need this [1].
Corvin Köhne [Thu, 11 May 2023 08:53:15 +0000 (10:53 +0200)]
bhyve: read out graphics stolen memory address and size
This is the first step to emulate the graphics stolen memory register.
Note that the graphics stolen memory is somehow confusing. On the one
hand the Intel Open Source HD Graphics Programmers' Reference Manual
states that it's only GPU accessible. As the CPU can't access the area,
the guest shouldn't need it. On the other hand, the Intel GOP driver
refuses to work properly, if it's not set to a proper address.
Intel itself maps it into the guest by EPT [1]. At the moment, we're not
aware of any situation where this EPT mapping is required, so we don't
do it yet.
Intel also states that the Windows driver for Tiger Lake reads the
address of the graphics stolen memory [2]. As the GVT-d code doesn't
support Tiger Lake in its first implementation, we can't check how it
behaves. We should keep an eye on it.
Corvin Köhne [Wed, 10 May 2023 10:22:33 +0000 (12:22 +0200)]
bhyve: add helper for passthru specific mmio ranges
Intel GPUs have two special memory regions. They are called Graphics
Stolen Memory and OpRegion. bhyve has to emulate both of them. In order
to keep track of those special regions, add generic mmio ranges to the
passthru emulation.
A TPM has an event log. Therefore, qemu adds a FwCfg item and adds it to
an ACPI table. We like to use the same OVMF driver as qemu, so we should
do the same. This commit adds the ability to basl to do it.
bhyve: error out if fwcfg user file isn't read completely
At the moment, fwcfg reads the file once at startup and passes these
data to the guest. Therefore, we should always read the whole file.
Otherwise we should error out.
Additionally, GCC12 complains that the comparison whether
fwcfg_file->size is lower than 0 is always false due to the limited
range of data type.
Reviewed by: markj
Fixes: ca14781c8170f3517ae79e198c0c880dbc3142dd ("bhyve: add cmdline option for user defined fw_cfg items")
MFC after: 1 week
Sponsored by: Beckhoff Automation GmbH & Co. KG
Differential Revision: https://reviews.freebsd.org/D40076
vmm and libvmmapi already have handlers for that. When adding debug
cpus, they were only used for the debug stub. Over time, they were
reused by other parts like snapshots or idle APs.
Vitaliy Gusev [Mon, 15 May 2023 14:28:45 +0000 (14:28 +0000)]
bhyve: add bus, slot and func to device name
Each device needs a unique identifier to store and restore snapshots
properly. Adding the pci bsf information to the device name creates a
unique identifier as a bsf can't be occupied twice.
Kyle Evans [Thu, 10 Aug 2023 17:32:33 +0000 (12:32 -0500)]
kern: osd: avoid dereferencing freed slots
If a slot is freed that isn't the last one, we'll set its destructor to
NULL to indicate that it's been freed and leave a hole in the slot map.
Check osd_destructors in osd_call() to avoid dereferencing a method that
is potentially from a module that's been unloaded.
This scenario would most commonly surface when two modules are loaded
that osd_register(), then the earlier one deregisters and an osd_call()
is made after the fact. In the specific report that triggered the
investigation, kldload if_wg -> kldload linux* -> kldunload if_wg ->
destroy a jail -> panic.
Noted in the review, but left for follow-up work, is that the realloc
that may happen in osd_deregister() should likely go away and the
assumption that reallocating to a smaller size cannot fail is actually
not correct.
Most em(4) devices now enjoy TSO and TSO6, matching NetBSD and Linux
defaults.
A prior commit automasks TSO on 10/100 Ethernet due to errata and other
bugs for IPv6 were fixed recently allowing this.
Mike Karels identified a performance anomaly on Intel 82574L devices.
These are multiqueue enabled on FreeBSD since the conversion to
iflib. I am investigating whether this can be fixed, in the mean time
MSI-X with checksum offloads remain default.
i219 SPT devices have an errata that downclocks the DMA engine, which
results in TSO not being able to acheive line rate. Therefore, it is
disabled on:
* Intel(R) I219-LM and I219-V SPT
* Intel(R) I219-LM and I219-V SPT-H (2)
* Intel(R) I219-LM and I219-V LBG (3)
* Intel(R) I219-LM and I219-V SPT (4)
* Intel(R) I219-LM and I219-V SPT (5)
Many lem(4) devices enjoy TSO, exceptions being 82542, 82543, 82547.
TSO6 may be possible for some chipsets but I am still working through
my testing matrix and that is hidden behind hw.em.unsupported_tso.
If you encounter issues, you may disable TSO with for example:
ifconfig em0 -tso -tso6.
I ask to be informed of any deviations from normal operation requiring
this.
Thanks to cc@ for access to emulab.net.
On a sample I219 system it saves about 16% CPU on IPv4 and 19% on IPv6.
Kevin Bowling [Tue, 15 Aug 2023 21:37:43 +0000 (14:37 -0700)]
e1000: Fix off by one ipcse
This has been off by one in the FreeBSD drivers as far back as I've
looked. Emperically HW and SW emulations I have available don't seem to
mind. Noticed while debugging other issues.
Ed Maste [Tue, 8 Aug 2023 23:42:09 +0000 (19:42 -0400)]
msi: report error for attempt to use APIC ID > 255
The MSI/MSI-X address includes 8 bits to encode the Destination ID.
Previously IDs over 255 overlapped with the fixed portion of the
address, resulting in an invalid value (and a nonfunctional interrupt).
Instead, print an error message and return EINVAL. The interrupt will
still not work, but the user will have a clue as to why.
PR: 273022
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D41395
John Baldwin [Mon, 12 Jun 2023 10:47:35 +0000 (12:47 +0200)]
bhyve: Remove vestigial support for setting max vCPUs.
The kernel part of the hypervisor is not going to support per-VM maxcpu
limits. The topology is only used to control the values returned by
CPUID leaves for which max vCPUs is not relevant.
Bojan Novković [Tue, 9 May 2023 07:02:04 +0000 (09:02 +0200)]
bhyve: fix vCPU single-stepping on VMX
This patch fixes virtual machine single stepping on VMX hosts.
Currently, when using bhyve's gdb stub, each attempt at single-stepping
a vCPU lands in a timer interrupt. The current single-stepping mechanism
uses the Monitor Trap Flag feature to cause VMEXIT after a single
instruction is executed. Unfortunately, the SDM states that MTF causes
VMEXITs for the next instruction that gets executed, which is often not
what the person using the debugger expects. [1]
This patch adds a new VM capability that masks interrupts on a vCPU by
blocking interrupt injection and modifies the gdb stub to use the newly
added capability while single-stepping a vCPU.
Corvin Köhne [Tue, 9 May 2023 12:32:33 +0000 (14:32 +0200)]
bhyve: don't panic if e820 finds no available memory
The GVT-d emulation tries to allocate some specific memory. It could
happen that this address doesn't exist. In that case, GVT-d will fall
back to allocate any address. Nevertheless, this only works if the e820
fails with an error instead of exiting on an assertion.
Yan Ka Chiu [Tue, 23 May 2023 20:39:22 +0000 (16:39 -0400)]
ifconfig(8): Teach ifconfig to attach and run itself in a jail
Add -j <jail> flag to ifconfig to allow ifconfig to attach and run inside a
jail. This allow parent to configure network interfaces of its children
even if ifconfig is not available in child's tree (e.g. Linux Jails)