Bjoern A. Zeeb [Thu, 5 May 2022 20:43:34 +0000 (20:43 +0000)]
LinuxKPI: skbuff: add memlimit tunable for 64bit systems
Some drivers, such as Realtek's rtw88, require 32bit DMA in
a single segment. busdma(9) has a hard time providing this
currently for 3-ish pages at large quantities
(see lkpi_pci_nseg1_fail in linux_pci.c e86707418c8e8).
Work around this for now by allowing a tunable to enforce
physical addresses allocation limits on 64bit platforms (ignoring PAE)
using "old-school" contigmalloc(9) to avoid bouncing.
A patch needing a custom kernel compiled was tested in the last weeks
by rtw88 users providing the 32bit limit only hardcoded. The 36bit
limit can be found in iwlwifi so is added as a testing option along.
This is put in as a bandaid for now, so people no longer need to patch
and compile their own kernels to use rtw88 and to allow us to MFC the
driver as well before the amounts of commits to track increases by
much more.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Kristof Provost [Thu, 27 Jan 2022 21:01:09 +0000 (22:01 +0100)]
mbuf: do not restore dying interfaces
When we remove an interface it is first removed from the interface list
V_ifnet (by if_unlink_ifnet()) and marked as IFF_DYING. We then wait for
any possible references to stop being used (i.e.
epoch_wait/epoch_drain_callbacks) before we tear it fully down.
However, the index in ifindex_table is not removed, so m_rcvif_restore()
can still find the (now dying) interface.
This results in panics, for example when dummynet restores the rcvif
pointer and passes a packet to ip6_input() we can panic because the
AF_INET6 domain has already been removed (so we end up dereferencing a
NULL pointer there).
Check that the interface is not dying before we restore it, which is
equivalent to checking its presence in V_ifnet, and thus ensures that
future accesses (while in NET_EPOCH) are safe.
Gleb Smirnoff [Thu, 27 Jan 2022 05:58:50 +0000 (21:58 -0800)]
dummynet: use m_rcvif_serialize/restore when queueing packets
This fixed panic with interface being removed while packet
was sitting on a queue. This allows to pass all dummynet
tests including forthcoming dummynet:ipfw_interface_removal
and dummynet:pf_interface_removal and demonstrates use of
m_rcvif_serialize() and m_rcvif_restore().
Gleb Smirnoff [Thu, 27 Jan 2022 05:58:44 +0000 (21:58 -0800)]
ifnet: make if_index global
Now that ifindex is static to if.c we can unvirtualize it. For lifetime
of an ifnet its index never changes. To avoid leaking foreign interfaces
the net.link.generic.system.ifcount sysctl and the ifnet_byindex() KPI
filter their returned value on curvnet. Since if_vmove() no longer
changes the if_index, inline ifindex_alloc() and ifindex_free() into
if_alloc() and if_free() respectively.
API wise the only change is that now minimum interface index can be
greater than 1. The holes in interface indexes were always allowed.
Jessica Clarke [Thu, 5 May 2022 18:07:54 +0000 (19:07 +0100)]
release: Use full window size for installer over serial lines
When running over a serial line we end up defaulting to 80x24, which is
rather cramped for many dialog boxes and occupies very little screen
space for most modern terminals. Thus, run resizewin -z to set the
terminal size if not already known before starting the installer, just
as we do for csh and sh login shells already in their default dotfiles.
Kristof Provost [Thu, 5 May 2022 07:21:32 +0000 (09:21 +0200)]
pf: clear PF_TAG_DUMMYNET for dummynet fast path
ip_dn_io_ptr() (i.e. dummynet_io()) can return the mbuf immediately (as
opposed to owning it and later passing it through dummynet_send(), which
returns it to pf_test()). In that case we must clear the PF_TAG_DUMMYNET
flag to ensure we don't skip any subsequent firewall passes.
This can happen if we process a packet in PFIL_IN, set PF_TAG_DUMMYNET
on it, pass it to ip_dn_io_ptr() but have it returned immediately. The
packet continues its normal path, eventually hitting
pf_test(dir=PFIL_OUT), where we'd skip when we're not supposed to.
Warner Losh [Thu, 5 May 2022 02:28:00 +0000 (20:28 -0600)]
iosched: remove stray debug
This printf was designed to catch misqueued bio requests. Prior to
supporting read_bias == 0, we couldn't get anything but reads and writes
in this queue. However, for read_bias == 0 we queue everything except
BIO_DELETE to this queue, so remove the printf. We don't need to update
any statistics.
Rick Macklem [Wed, 4 May 2022 20:58:22 +0000 (13:58 -0700)]
nfsd: Add a sanity check for Owner/OwnerGroup string length
Robert Morris reported that, if a client sends an absurdly
large Owner/OwnerGroup string, the kernel malloc() for the
large size string can block forever.
This patch adds a sanity limit for Owner/OwnerGroup string
length. Since the RFCs do not specify any limit and FreeBSD
can handle a group name greater than 1Kbyte, the limit is
set at a generous 10Kbytes.
Rick Macklem [Wed, 4 May 2022 20:52:33 +0000 (13:52 -0700)]
nfsd: Fix handling of Open/Create for the pNFS server
When the MDS of a pNFS service receives an Open/Create
and the file already exists, it must do a Setattr of
size == 0. Without this patch, this was eroneously
done via a VOP_SETAATR() call, which would set the
length of the MDS file to 0 (which is already is,
since all data lives on the DSs).
This patch fixes the problem by doing a nfsvno_setattr()
instead of VOP_SETATTR(), which knows to do a proxied
Setattr on the DSs.
For a non-pNFS server, the change has no effect, since
nfsvno_setattr() only does a VOP_SETATTR() for that case.
This was found during a recent IETF NFSv4 testing event.
John Baldwin [Wed, 4 May 2022 20:08:36 +0000 (13:08 -0700)]
OpenSSL: KTLS: Enable KTLS for receiving as well in TLS 1.3
This removes a guard condition that prevents KTLS being enabled for
receiving in TLS 1.3. Use the correct sequence number and BIO for
receive vs transmit offload.
John Baldwin [Wed, 4 May 2022 20:08:27 +0000 (13:08 -0700)]
OpenSSL: KTLS: Handle TLS 1.3 in ssl3_get_record.
- Don't unpad records, check the outer record type, or extract the
inner record type from TLS 1.3 records handled by the kernel. KTLS
performs all of these steps and returns the inner record type in the
TLS header.
- When checking the length of a received TLS 1.3 record don't allow
for the extra byte for the nested record type when KTLS is used.
- Pass a pointer to the record type in the TLS header to the
SSL3_RT_INNER_CONTENT_TYPE message callback. For KTLS, the old
pointer pointed to the last byte of payload rather than the record
type. For the non-KTLS case, the TLS header has been updated with
the inner type before this callback is invoked.
John Baldwin [Wed, 4 May 2022 20:08:17 +0000 (13:08 -0700)]
OpenSSL: KTLS: Add using_ktls helper variable in ssl3_get_record().
When KTLS receive is enabled, pending data may still be present due to
read ahead. This data must still be processed the same as records
received without KTLS. To ease readability (especially in
consideration of additional checks which will be added for TLS 1.3),
add a helper variable 'using_ktls' that is true when the KTLS receive
path is being used to receive a record.
John Baldwin [Wed, 4 May 2022 20:08:03 +0000 (13:08 -0700)]
OpenSSL: KTLS: Check for unprocessed receive records in ktls_configure_crypto.
KTLS implementations currently assume that the start of the in-kernel
socket buffer is aligned with the start of a TLS record for the
receive side. The socket option to enable KTLS specifies the TLS
sequence number of this initial record.
When read ahead is enabled, data can be pending in the SSL read buffer
after negotiating session keys. This pending data must be examined to
ensurs that the kernel's socket buffer does not contain a partial TLS
record as well as to determine the correct sequence number of the
first TLS record to be processed by the kernel.
In preparation for enabling receive kernel offload for TLS 1.3, move
the existing logic to handle read ahead from t1_enc.c into ktls.c and
invoke it from ktls_configure_crypto().
John Baldwin [Wed, 4 May 2022 20:07:36 +0000 (13:07 -0700)]
OpenSSL: Cleanup record length checks for KTLS
In some corner cases the check for packets
which exceed the allowed record length was missing
when KTLS is initially enabled, when some
unprocessed packets are still pending.
Marko Zec [Wed, 4 May 2022 04:19:46 +0000 (06:19 +0200)]
tests: vnet tests started failing in CI, disable temporarily
As a fallout of backing out 91f44749c6fe, vnet tests started
failing in CI. Temporarily broadly disable vnet tests until
specific cases can be resolved, and file a bug.
Devirtualization of V_if_index and V_ifindex_table was rushed into
the tree lacking proper context, discussion, and declaration of intent,
so I'm backing it out as harmful to VNET on the following grounds:
1) The change repurposed the decades-old and stable if_index KBI for
new, unclear goals which were omitted from the commit note.
2) The change opened up a new resource exhaustion vector where any vnet
could starve the system of ifnet indices, including vnet0.
3) To circumvent the newly introduced problem of separating ifnets
belonging to different vnets from the globalized ifindex_table, the
author introduced sysctl_ifcount() which does a linear traversal over
the (potentially huge) global ifnet list just to return a simple upper
bound on existing ifnet indices.
4) The change effectively led to nonuniform ifnet index allocation
among vnets.
5) The commit note clearly stated that the patch changed the implicit
if_index ABI contract where ifnet indices were assumed to be starting
from one. The commit note also included a correct observation that
holes in interface indices were always allowed, but failed to declare
that the userland-observable ifindex tables could now include huge
empty spans even under modest operating conditions.
6) The author had an earlier proposal in the works which did not
affect per-vnet ifnet lists (D33265) but which he abandoned without
providing the rationale behind his decision to do so, at the expense
of sacrificing the vnet isolation contract and if_index ABI / KBI.
Furthermore, the author agreed to back out his changes himself and
to follow up with a proposal for a less intrusive alternative, but
later silently declined to act. Therefore, I decided to resolve the
status-quo by backing this out myself. This in no way precludes a
future proposal aiming to mitigate ifnet-removal related system
crashes or panics to be accepted, provided it would not unnecessarily
compromise the goal of as strict as possible isolation between vnets.
Rick Macklem [Tue, 3 May 2022 14:22:15 +0000 (07:22 -0700)]
nfscl: Acquire a refcount on "cred" for mirrored pNFS RPCs
When the NFSv4.1/4.2 client is doing a pnfs mount to
mirrored DS(s), asynchronous threads are used to do the
RPCs against the DS(s) concurrently. If a DS is slow
to reply, it is possible for the "cred" to be free'd
before the asynchronous thread is done with it, causing
a panic/crash.
This patch fixes the problem by acquiring a refcount on
the "cred" while it is being used by the asynchronous thread
for a DS RPC. This bug was found during a recent IETF
NFSv4 testing event.
This bug only affects "pnfs" mounts to mirrored pNFS
servers.
Corvin Köhne [Tue, 3 May 2022 14:01:22 +0000 (16:01 +0200)]
bsdinstall/script: umount before zpool export
When running zpool export first, boot/efi and dev is still mounted so
zpool export fails. By running bsdinstall umount first the pool can be
cleanly exported.
Reviewed by: emaste
Differential Revision: https://reviews.freebsd.org/D35114
Sponsored by: Beckhoff Automation GmbH & Co. KG
MFC After: 3 days
Corvin Köhne [Tue, 3 May 2022 14:00:09 +0000 (16:00 +0200)]
bsdinstall: stop messing with file descriptors
Throughout the bsdinstall script fd 3 is used by f_dprintf (set through
$TERMINAL_STDOUT_PASSTHRU). By closing file descriptor 3 here, the
final f_dprintf "Installation Completed ... does not work anymore.
By putting the code into a subshell, file descriptors can be edited
without interference with the calling script.
Reviewed by: emaste
Differential Revision: https://reviews.freebsd.org/D35113
Sponsored by: Beckhoff Automation GmbH & Co. KG
MFC after: 3 days
mlx5en(4): Use hard-coded 4K page size for RQ/SQ/CQ.
The page size specified for RQ, SQ and CQ is always in units of 4KBytes.
Make sure we subtract MLX5_ADAPTER_PAGE_SHIFT, 12, instead of PAGE_SHIFT
which may vary. This fixes support for using the mlx5en driver on systems
having non-4K page size.
build target triple variable from sys/conf/newvers.sh
Retrieve FreeBSD revision number directly from sys/conf/newvers.sh
when building the compiler target triple value, avoiding manual
intervention on other files every new release.
Reviewed by: imp
MFC after: 2 months
Sponsored by: Instituto de Pesquisas Eldorado (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D34429
Andrew Turner [Thu, 7 Apr 2022 16:24:46 +0000 (17:24 +0100)]
Remove PAGE_SIZE from libthr
In libthr we use PAGE_SIZE when allocating memory with mmap and to check
various structs will fit into a single page so we can use this allocator
for them.
Ask the kernel for the page size on init for use by the page allcator
and add a new machine dependent macro to hold the smallest page size
the architecture supports to check the structure is small enough.
This allows us to use the same libthr on arm64 with either 4k or 16k
pages.
Reviewed by: kib, markj, imp
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34984
Rick Macklem [Mon, 2 May 2022 19:45:42 +0000 (12:45 -0700)]
nfsd: Fix session slot freeing for NFSv4.1/4.2
Without this patch the NFSv4.1/4.2 server erroneously
always frees session slot zero for callbacks. This only
affects 4.1/4.2 mounts if the server has delegations
enabled or is a pNFS configuration. Even for those
cases, the effect is mainly to only use slot 0 for
callbacks, serializing all of them. There is a slight
chance that callbacks will fail if the client performs
them in a different order than received on the TCP
connection.
If this bug affects your server, you will see console
messages like:
newnfs_request: Bad session slot
This patch fixes the problem. Found during a recent
IETF NFSv4 testing event.
New sysctl allows to mark transmitted PPPoE LCP Control
ethernet frames with needed 3-bit Priority Code Point (PCP) value.
Confirming driver like if_vlan(4) uses the value to fill
IEEE 802.1p class of service field.
This is similar to Cisco IOS "control-packets vlan cos priority"
command.
It helps to avoid premature disconnection of user sessions
due to control frame drops (LCP Echo etc.)
if network infrastructure has a botteleck at a switch
or the xdsl DSLAM.
See also:
https://sourceforge.net/p/mpd/discussion/44692/thread/c7abe70e3a/
Tested by: Klaus Fokuhl at SourceForge
MFC after: 2 weeks
Warner Losh [Sun, 1 May 2022 18:04:13 +0000 (12:04 -0600)]
cam_xpt: Prefer bool to int where it's a boolean
In the places where we set an integer to 0 or 1 and then use it like a
boolean, replace int with bool and 0/1 with false/true. Left alone
places where this is a function argument or return value. No functional
changes intended.
Warner Losh [Sun, 1 May 2022 17:18:27 +0000 (11:18 -0600)]
cam: add hw.cam.iosched.read_bias
Allow a global setting for the read_bias for the dynamic io
scheduler. This allows global policy to be set, in addition to the
existing per-drive policy. kern.cam.iosched.read_bias is a new tunable.
Warner Losh [Sun, 1 May 2022 17:18:23 +0000 (11:18 -0600)]
cam iosched: default to no read bias in dynamic ioscheduling
When we're doing dynamic I/O scheduling, don't default to a read bias of
100. Default it to 0 so turning on dynamic scheduling only does
scheduling tweaks that are requested. The other limiters are off by
default, and need no further adjustment.
Warner Losh [Sun, 1 May 2022 17:18:18 +0000 (11:18 -0600)]
cam iosched: Remove write bias when read bias = 0
Change the meaning of read bias == 0 in the dynamic I/O scheduler. Prior
to this change, a read bias of 0 would mean prefer writes. Now, when
read bias is 0, we queue all requests to the same queue removing the
bias. When it's non-zero, we still separate the queues we use so we can
bias reads vs writes for workloads that are read centric. These changes
restore the typical bias you get from disksort or ordered insertion at
the end of the list.
Warner Losh [Sun, 1 May 2022 17:13:18 +0000 (11:13 -0600)]
stand: Initial kboot support on amd64
Get amd64 compiling. However, the current kboot supports an old way of
enumerating memory and the new way needs to be incorporated as well. The
powerpc folks could use either, it seems and newer powerpc platforms
need some changes for kboot to work anyway.
This commit includes the linker script, trampoline code to start the new
kernel, Linux system calls and the necessary configuration glue needed
to build the binaries.
This includes a quick hack to get multiboot support, but we need to
really share these defines. The multiiboot2.h is the minimum needed to
build. We have multiboot information in three places now, so a
refactoring is in order.
This should be considered, at best, preliminary and experimental for
anybody wishing to try it out.
Warner Losh [Sun, 1 May 2022 16:39:04 +0000 (10:39 -0600)]
ada: Retry commands with retries left on CAM_SEL_TIMEOUT
The AHCI and ATA SIMs will return CAM_SEL_TIMEOUT when an underlying
device has stopped responding. This is usually seen after a timeouted
out command and can be a transient event. Rather than fail the
peripheral immediately after seeing this, queue a retry. For transient
events, this allows drives to continue to provide data, though with some
added latency, just like we do when we have some other kind of retriable
error. If the error isn't transient (the drive is truly gone), then
we'll discover that eventually and fail the transaction and invalidate
the drive like we do today.
This helps us avoid a panic at the end of camperiphfree when
CAM_PERIPH_NEW_DEV_FOUND is set. However, the deferred callback should
be queued to xpt_async_td instead of being made inline there. This issue
will be solved in a different patch that does that. PR 263703.
This also helps us avoid another bug where we can drop all references to
the device (causing us to go through camperiphfree and destroy the path)
while we have an I/O pending in the ata_da state machine (usually in
state ADA_STATE_RAHEAD with ATA_SETFEATURES ATA_SF_ENAB_RCACHE
command). It's not clear why the reference that we take out to do the
reprobe isn't effective at blocking this. By retrying this condition,
though we avoid this bug (at least more often, I don't have a good
reproduction test case, I just see this panic a few times a month at
work on systems that have transient disk errors on ahci connected SATA
SSDs). PR 263704. It's too soon to know how much this helps us avoid
this bug.
Eugene Grosbein [Sun, 1 May 2022 16:34:08 +0000 (23:34 +0700)]
ng_pppoe: introduce new sysctl net.graph.pppoe.lcp_pcp
New sysctl allows to mark transmitted PPPoE LCP Control
ethernet frames with needed 3-bit Priority Code Point (PCP) value.
Confirming driver like if_vlan(4) uses the value to fill
IEEE 802.1p class of service field.
This is similar to Cisco IOS "control-packets vlan cos priority"
command.
It helps to avoid premature disconnection of user sessions
due to control frame drops (LCP Echo etc.)
if network infrastructure has a botteleck at a switch
or the xdsl DSLAM.
See also:
https://sourceforge.net/p/mpd/discussion/44692/thread/c7abe70e3a/
Tested by: Klaus Fokuhl at SourceForge
MFC after: 2 weeks
Rick Macklem [Sat, 30 Apr 2022 20:49:23 +0000 (13:49 -0700)]
nfscl: Add support for a NFSv4 AppendWrite RPC
For IO_APPEND VOP_WRITE()s, the code first does a
Getattr RPC to acquire the file's size, before it
can do the Write RPC.
Although NFS does not have an append write operation,
an NFSv4 compound can use a Verify operation to check
that the client's notion of the file's size is
correct, followed by the Write operation.
This patch modifies the NFSv4 client to use an Appendwrite
RPC, which does a Verify to check the file's size before
doing the Write. This avoids the need for a Getattr RPC
to preceed this RPC and reduces the RPC count by half for
IO_APPEND writes, so long as the client knows the file's
size.
The nfsd structure was moved from the stack to be malloc()'d,
since the kernel stack limit was being exceeded.
While here, fix the types of a few variables, although
there should not be any semantics change caused by these
type changes.
Ed Maste [Sat, 30 Apr 2022 19:25:35 +0000 (15:25 -0400)]
Update WITH_PROFILE description as it is yet removed
GCC still wants to link against (for example) libc_p.a when -pg is in
use, and it's unclear when and how this will be addressed. Change the
WITH_PROFILE option description to claim that it may be removed from an
unspecified future version of FreeBSD, rather than FreeBSD 14.
Reported by: Steve Kargl
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Warner Losh [Sat, 30 Apr 2022 18:51:19 +0000 (12:51 -0600)]
stand: Install libsa.3
Turns out there is a libsa.3. It's a bit out of date, but we reference
it in a number of places so we should install it. We need to do the DO32
dance because this Makefile is included twice and we don't want it
installing twice.
After an installation restart (for error or choice) dhclient does not
rebuild resolv.conf so `dialog --mixedform' of "Resolver Configuration"
in bsdinstall/scripts/netconfig draws empty forms. It causes a bad UX,
to see PR262262. Fixed resetting the interface before to run dhclient.