Gleb Smirnoff [Wed, 7 Dec 2022 17:00:48 +0000 (09:00 -0800)]
tcp: embed inpcb into tcpcb
For the TCP protocol inpcb storage specify allocation size that would
provide space to most of the data a TCP connection needs, embedding
into struct tcpcb several structures, that previously were allocated
separately.
The most import one is the inpcb itself. With embedding we can provide
strong guarantee that with a valid TCP inpcb the tcpcb is always valid
and vice versa. Also we reduce number of allocs/frees per connection.
The embedded inpcb is placed in the beginning of the struct tcpcb,
since in_pcballoc() requires that. However, later we may want to move
it around for cache line efficiency, and this can be done with a little
effort. The new intotcpcb() macro is ready for such move.
The congestion algorithm data, the TCP timers and osd(9) data are
also embedded into tcpcb, and temprorary struct tcpcb_mem goes away.
There was no extra allocation here, but we went through extra pointer
every time we accessed this data.
One interesting side effect is that now TCP data is allocated from
SMR-protected zone. Potentially this allows the TCP stacks or other
TCP related modules to utilize that for their own synchronization.
Large part of the change was done with sed script:
Doug Rabson [Wed, 7 Dec 2022 13:40:18 +0000 (13:40 +0000)]
Fix a typo in the binmisc option name
This should be spelt IMGACT_BINMISC to match the filename. The option
name does not appear outside of sys/conf and this module is typically
used via the kernel module imgact_binmisc.ko.
Notable upstream pull request merges:
#13782 Fix setting the large_block feature after receiving a snapshot
#14157 FreeBSD: stop using buffer cache-only routines on sync
#14172 zed: post a udev change event from spa_vdev_attach()
#14181 zed: unclean disk attachment faults the vdev
#14190 Bump checksum error counter before reporting to ZED
#14196 Remove atomics from zh_refcount
#14197 Don't leak packed recieved proprties
#14198 Switch dnode stats to wmsums
#14199 Remove few pointer dereferences in dbuf_read()
#14200 Micro-optimize zrl_remove()
#14204 Lua: Fix bad bitshift in lua_strx2number()
#14212 Zstd fixes
#14218 Avoid a null pointer dereference in zfs_mount() on FreeBSD
#14235 nopwrites on dmu_sync-ed blocks can result in a panic
#14236 zio can deadlock during device removal
#14247 Micro-optimize fletcher4 calculations
#14261 FreeBSD: zfs_register_callbacks() must implement error check
correctly
Andrew Gallatin [Tue, 6 Dec 2022 16:35:18 +0000 (11:35 -0500)]
ixl: silence runtime warning when PCI_IOV is not enabled
When PCI_IOV is not enabled, do not attempt to call
iflib_softirq_alloc_generic(...IFLIB_INTR_IOV), as it results
in boot-time warnings similar to:
taskqgroup_attach_cpu: qid not found for iov cpu=2
ixl2: taskqgroup_attach_cpu failed 22
Instead, make it conditional on PCI_IOV like the other
SR-IOV related code.
Austin Shafer [Tue, 6 Dec 2022 15:25:53 +0000 (16:25 +0100)]
linuxkpi: Fix return value of dma_map_sgtable
dma_map_sgtable internally uses the dma_map_sg_attrs helper. The problem is
that dma_map_sg_attrs returns the number of entries mapped, whereas
dma_map_sgtable returns nonzero on failure. This leads to dma_map_sgtable
returning non-zero-but-positive values which tricks other areas of the stack
into thinking nents is a valid pointer.
This checks if nents is valid and returns zero if so, updating the nents field
in sgt. This fixes PRIME render offload with nvidia-drm.
Corvin Köhne [Wed, 30 Nov 2022 14:46:19 +0000 (15:46 +0100)]
bhyve: build SPCR ACPI table
OVMF ships some static ACPI tables. This worked in the past but won't
work in the future when we support devices like tpms. They require a TPM
ACPI table. So, we have to dynamically create ACPI tables depending on
the bhyve configuration.
Bhyve has much more information about the system than OVMF. Therefore,
it's easier for bhyve to build up some ACPI tables. For that reason, it
would be much better to use the ACPI tables provided by bhyve instead of
building some tables by OVMF.
At the moment, OVMF always creates a SPCR table. Maybe someone depends
on it. So, we have to build it by bhyve too before we can patch OVMF to
install the tables provided by bhyve.
Warner Losh [Mon, 5 Dec 2022 23:57:58 +0000 (16:57 -0700)]
Revert "newbus: Change attach failure behavior"
This reverts commit 68c3f0302106643207dcdfe3b414810e245228e5. There are
some weird crashes when KVMs switch caused by this, so revert this
commit until they are sorted out.
Warner Losh [Mon, 5 Dec 2022 17:40:15 +0000 (10:40 -0700)]
stand: update prototypes for md_load and md_load64
These are declared as extern in a number of files (some with the wrong
return type). Centralize this in modinfo.h and remove a few extra stray
declarations as well that are no longer used. No functional change.
Note: I've not tried to cope with the bi_load() functions which are the
same logical thing. These will be handled separately.
Richard Yao [Mon, 5 Dec 2022 19:00:34 +0000 (14:00 -0500)]
Micro-optimize fletcher4 calculations
When processing abds, we execute 1 `kfpu_begin()`/`kfpu_end()` pair on
every page in the abd. This is wasteful and slows down checksum
performance versus what the benchmark claimed. We correct this by moving
those calls to the init and fini functions.
Also, we always check the buffer length against 0 before calling the
non-scalar checksum functions. This means that we do not need to execute
the loop condition for the first loop iteration. That allows us to
micro-optimize the checksum calculations by switching to do-while loops.
Note that we do not apply that micro-optimization to the scalar
implementation because there is no check in
`fletcher_4_incremental_native()`/`fletcher_4_incremental_byteswap()`
against 0 sized buffers being passed.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14247
Richard Yao [Mon, 5 Dec 2022 18:16:50 +0000 (13:16 -0500)]
FreeBSD: zfs_register_callbacks() must implement error check correctly
I read the following article and noticed a couple of ZFS bugs mentioned:
https://pvs-studio.com/en/blog/posts/cpp/0377/
I decided to search for them in the modern OpenZFS codebase and then
found one that matched the description of the first one:
V593 Consider reviewing the expression of the 'A = B != C' kind. The
expression is calculated as following: 'A = (B != C)'. zfs_vfsops.c 498
The consequence of this is that the error value is replaced with `1`
when there is an error. When there is no error, 0 is correctly passed.
This is a very minor issue that is unlikely to cause any real problems.
The incorrect error code would either be returned to the mount command
on a failure or any of `zfs receive`, `zfs recv`, `zfs rollback` or `zfs
upgrade`.
The second one has already been fixed.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14261
Robert Wing [Mon, 5 Dec 2022 17:22:45 +0000 (08:22 -0900)]
bhyveload: open guest boot disk image O_RDWR
When a boot environment has been booted via the bootonce feature,
userboot clears the bootonce value from an nvlist but fails to write the
updated nvlist back to disk.
The failure occurs because bhyveload opens the guest boot disk image
O_RDONLY, fix this by opening it O_RDWR.
Kristof Provost [Thu, 1 Dec 2022 15:20:24 +0000 (16:20 +0100)]
if_ovpn: extend notifications with a reason
Extend peer deleted notifications (which are the only type right now) to
include the reason the peer was deleted. This can be either because
userspace requested it, or because the peer timed out.
John Baldwin [Mon, 5 Dec 2022 00:31:05 +0000 (16:31 -0800)]
posixshm_test: Fix sign mismatches in ?: results.
GCC 12's -Wsign-compare complains if the two alternative results of
the ?: operator are differently signed. Cast the small, sub-page
off_t values to size_t to quiet the warning.
John Baldwin [Mon, 5 Dec 2022 00:27:22 +0000 (16:27 -0800)]
libsa: Disable -Wdangling-pointer for zfs.c.
GCC 12 warns about a dangling pointer to 'objid' in
zfs_bootenv_initial(). However, this appears to be a false positive
as the pointer to 'objid' is only passed to zfs_lookup_dataset() but
not saved anywhere that outlives the lifetime of the
zfs_bootenv_initial() function.
John Baldwin [Wed, 30 Nov 2022 22:56:19 +0000 (14:56 -0800)]
Explicitly set CXXSTD to c++11 for old C++ code using std::auto_ptr<>.
GCC 12 defaults to C++17 which removes (not just deprecates)
std::auto_ptr<>. Trying to use CXXSTD of c++03 doesn't work with
libc++ headers, but c++11 does.
Warner Losh [Sun, 4 Dec 2022 23:22:43 +0000 (16:22 -0700)]
newbus: Change attach failure behavior
In the rare case that we succeed in probing, but fail to attach, flip
the default to be to disable the
device. hw.bus.disable_failed_devices=false is no required to restore
the old behavior. The old behavior dates form a time when dynamic
control of devices wasn't yet present (devctl didn't exist). Now that
one can retry probe/attach the device with devctl, the default doesn't
make sense: The more desirable behaivor is to have stable device numbers
when one has several instances of the same device in a system (common
for NICs or HBAs).
Warner Losh [Sun, 4 Dec 2022 23:20:24 +0000 (16:20 -0700)]
newbus: Create a knob to disable devices that fail to attach.
Normally, when a device fails to attach, we tear down the newbus state
for that device so that future driver loads can try again (maybe with a
different driver, or maybe with a re-loaded and fixed kld).
Sometimes, however, it is desirable to have the device fail
permanantly. We do this by calling device_disable() on a failed
attached, as well as keeping the device in DS_ATTACHING forever. This
prevents retries on that device. This is enabled via
hw.bus.disable_failed_devices=1 in either a hint via the loader, or at
runtime with a sysctl setting. Setting from 1 -> 0 at runtime will not
affect previously disabled devices, however: they remain disabled.
They can be re-enabled manually with devctl enable, however.
The comment indicated -Wno-deprecated-declarations was used to avoid
warnings about deprecated auto_ptr and various deprecated function
objects from <functional>. libdevdctl (now) does not use auto_ptr,
so don't mention it in the comment.
Warner Losh [Sun, 4 Dec 2022 05:46:21 +0000 (22:46 -0700)]
stand: aarch64 has different nlinks than amd64
Some typedefs are system dependent, so move them into stat_arch.h where
they are used. On amd64, nlinks is a int64_t, while on aarch64 it's an
int (or int32_t).
Alexander Motin [Sat, 3 Dec 2022 17:05:05 +0000 (12:05 -0500)]
CTL: Allow userland supply tags via ioctl frontend.
Before this ioctl frontend always replaced tags with sequential ones.
It was done for ctladm, that can not keep track of global tag list.
But in case of virtio-scsi in bhyve we can pass provided tags as-is.
It should be on virtio-scsi initiator to provide us valid tags. It
should allow proper task management, error reporting, etc. In case
of several virtio-scsi devices, they should use different CTL ports
or initiator IDs to avoid conflicts, but this is expected by design.
Alexander Motin [Sat, 3 Dec 2022 15:23:29 +0000 (10:23 -0500)]
CTL: Increase maximum SCSI tag size from 32 to 64 bits.
SAM-5 specification states maximum size of command identifier (tag),
defined by specific transports, should not be larger than 64 bits.
While most of supported transports use 32 bits or less, it was
reported that virtio-scsi uses 64 bits. Truncation to 32 bits in
bhyve code caused false tag conflict errors reported and possibly
other issues.
This changes CTL ABI and HA protocol, so CTL_HA_VERSION is bumped.
While we make HA protocol incompatible, increase default maximum
number of ports in CTL from 256 to 1024, matching number of LUNs.
There are many reports from people who need many iSCSI targets with
only one LUN each. Increased memory consumption should be less of
a problem these days.
George Wilson [Sat, 3 Dec 2022 01:46:29 +0000 (19:46 -0600)]
zio can deadlock during device removal
When doing a device removal on a pool with gang blocks, the zio pipeline
can deadlock when trying to free blocks from a device which is being
removed with a stack similar to this:
In the illustration above we are processing frees but because of gang
block we have to read the constituents blocks. Once we finish the READ
in the zio pipeline we will execute the parent. In this case the parent
is a FREE but the zio taskq is a READ and we continue to process the
pipeline leading to the stack above. In the stack above, we are blocked
waiting for the svr_lock so as a result a READ interrupt taskq thread
is now consumed. Eventually, all of the READ taskq threads end up
blocked and we're unable to complete any read requests.
In zio_notify_parent there is an optimization to continue to use
the taskq thread to exectue the parent's pipeline. To resolve the
deadlock above, we only allow this optimization if the parent's
zio type matches the child which just completed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: George Wilson <gwilson@delphix.com>
External-issue: DLPX-80130
Closes #14236
George Wilson [Sat, 3 Dec 2022 01:45:33 +0000 (19:45 -0600)]
nopwrites on dmu_sync-ed blocks can result in a panic
After a device has been removed, any nopwrites for blocks on that
indirect vdev should be ignored and a new block should be allocated. The
original code attempted to handle this but used the wrong block pointer
when checking for indirect vdevs and failed to check all DVAs.
This change corrects both of these issues and modifies the test case
to ensure that it properly tests nopwrites with device removal.
Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <gwilson@delphix.com>
Closes #14235
Rob Wing [Mon, 14 Nov 2022 07:57:53 +0000 (07:57 +0000)]
ZTS: test reported checksum errors for ZED
Test checksum error reporting to ZED via the call paths
vdev_raidz_io_done_unrecoverable() and zio_checksum_verify().
Sponsored-by: Seagate Technology LLC Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Closes #14190
Rob Wing [Mon, 14 Nov 2022 07:40:38 +0000 (07:40 +0000)]
Bump checksum error counter before reporting to ZED
The checksum error counter is incremented after reporting to ZED. This
leads ZED to receiving a checksum error report with 0 checksum errors.
To avoid this, bump the checksum error counter before reporting to ZED.
Sponsored-by: Seagate Technology LLC Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Closes #14190
Xinliang Liu [Sat, 3 Dec 2022 01:39:48 +0000 (09:39 +0800)]
autoconf: add support for openEuler
Add config support for openEuler, so that it set the right sysconfig
dir for openEuler.
And DEFAULT_INIT_SCRIPT is no longer needed since commit "2a34db1bd
Base init scripts for SYSV systems".
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Xinliang Liu <xinliang.liu@linaro.org>
Closes #14241
libpmc used -Wno-deprecated-declarations to silence warnings about usage
of deprecated std::auto_ptr, but there is (now) now use of auto_ptr in
libpmc.
Reviewed by: mhorne
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D37576
Gleb Smirnoff [Fri, 2 Dec 2022 22:10:55 +0000 (14:10 -0800)]
Retire trpt(8).
trpt(8) was utility to pull TCP debugging data from the kernel
originating back from 4.2BSD. It is not used nowadays by TCP
developers. We have more powerful debugging facilities, e.g.
the Dtrace probing, the TCP black box logging and siftr.
Discussed with: rscheff, tuexen, rrs, jtl and others
Store user-supplied source protocol in the nexthops and nexthop groups.
Protocol specification help routing daemons like bird to quickly
identify self-originated routes after the crash or restart.
Example:
```
10.2.0.0/24 via 10.0.0.2 dev vtnet0 proto bird
10.3.0.0/24 proto bird
nexthop via 10.0.0.2 dev vtnet0 weight 3
nexthop via 10.0.0.3 dev vtnet0 weight 4
```
Warner Losh [Fri, 2 Dec 2022 19:41:01 +0000 (12:41 -0700)]
kboot: Use unsigned long long.
For the 64-bit platforms, this is a nop. Currently kboot only supports
64-bit platforms, though. If we support 32-bit in the future, this will
become important.
Warner Losh [Fri, 2 Dec 2022 18:28:08 +0000 (11:28 -0700)]
kboot: Enhance hostdisk
Added missing functionality to allow us to boot off of things like
/dev/nvme0n1p2 successfully. And to list all available devices and
partitions with 'lsdev'.
Warner Losh [Fri, 2 Dec 2022 18:10:42 +0000 (11:10 -0700)]
kboot: amd64 use /sys/firmware/memmap to find free memory
Use the system's firmware memory map to find a good place to put the
kernel that won't stomp on anything else. While this uses obstensibly MI
interfaces to get this data, arm64 doesn't have this, nor does
powerpc64, so place it here.
Warner Losh [Fri, 2 Dec 2022 17:47:32 +0000 (10:47 -0700)]
full-test: Start of a full testing suite.
full-test.sh aims to be a test suite generator for the boot loader. It
tries to grab artifacts from the web and then constructs minimal boot
environments from that as well as writing qemu-system-* using scripts
that facilitates testing all the ways we can boot... At least all the
ways that we an boot that qemu can emulate.
This is very much a work in progress, and likely could use a good
cleanup at some point.
Warner Losh [Fri, 2 Dec 2022 17:47:22 +0000 (10:47 -0700)]
devd: Warn for deprecated 'kern' system type
One year ago, I deprecated 'kern' in favor of 'kernel' for the system
name for some power events. I'm about to remove it from the kernel, but
realized there's been no warning generated for users. Preserve POLA by
converting on the fly here and issuing a warning for 14.x, and an fatal
error after we branch 15. Make compiling it an error on 16 to remove
the gross hack after we branch.
Cy Schubert [Thu, 1 Dec 2022 00:11:18 +0000 (16:11 -0800)]
heimdal: Fix bus fault when zero-length request received
Zero length client requests result in a bus fault when attempting to
free malloc()ed pointers within the requests softc. Return an error
when the request is zero length.
PR: 268062
Reported by: Robert Morris <rtm@lcs.mit.edu>
MFC after: 3 days
linuxkpi: Add `seqcount_mutex_t` support in <linux/seqlock.h>
To achieve that, the header uses the C11 type generic selection keyboard
_Generic() because the macros are supposed to work with seqcount_t
and seqcount_mutex_t.
Jakub Kolodziej [Thu, 1 Dec 2022 08:02:52 +0000 (09:02 +0100)]
lm75: Refactor code to fix io error
Use correct resolution by compat table. If dtb is not defined use default 9 bit mode.
11 bit detection is called if 9 bit mode is used.
Sysctl resolution variable is added to change resolution in case.
Some sensors didn't pull ACK while reading from nonexistent registers and it caused I2C
read error and detect failure, so now detect failure does not cause driver break.
lib/googletest used -Wno-deprecated-declarations to silence warnings
about usage of deprecated std::auto_ptr, but auto_ptr is not (now) used
anywhere in googletest.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D37561
Warner Losh [Wed, 30 Nov 2022 23:28:51 +0000 (16:28 -0700)]
arm64/machdep: Reserve memory when we find Linux EFI reserved memory table
When Linux loads a new kernel via kexec, somtiems it must reserve memory
for devices that are still active (and typically can't be reset or
shutdown). When present, this table is a linked list of ranges that are
still in use that the OS must avoid using.
Mark these areas as reserved.
This is part of the GICv3 workaround code where we must use the PA
addresses already programmed into the GICv3 when we take over. This part
ensure we don't allocate the mmeory for anything else.