Gleb Smirnoff [Sat, 5 Feb 2022 21:25:38 +0000 (13:25 -0800)]
dmesg: detect wrapped msgbuf on the kernel side and if so, skip first line
Since 59f256ec35d3 dmesg(8) will always skip first line of the message
buffer, cause it might be incomplete. The problem is that in most cases
it is complete, valid and contains the "---<<BOOT>>---" marker. This
skip can be disabled with '-a', but that would also unhide all non-kernel
messages. Move this functionality from dmesg(8) to kernel, since kernel
actually knows if wrap has happened or not.
The main motivation for the change is not actually the value of the
"---<<BOOT>>---" marker. The problem breaks unit tests, that clear
message buffer, perform a test and then check the message buffer for
a result. Example of such test is sys/kern/sonewconn_overflow.
Stefan Eßer [Sat, 5 Feb 2022 12:33:53 +0000 (13:33 +0100)]
libc: add helper furnction to set sysctl() user.* variables
Testing had revealed that trying to retrieve the user.localbase
variable into to small a buffer would return the correct error code,
but would not fill the available buffer space with a partial result.
A partial result is of no use, but this is still a violation of the
documented behavior, which has been fixed in the previous commit to
this function.
I just checked the code for "user.cs_path" and found that it had the
same issue.
Instead of fixing the logic for each user.* sysctl string variable
individually, this commit adds a helper function set_user_str() that
implements the semantics specified in the sysctl() man page.
It is currently only used for "user.cs_path" and "user.localbase",
but it will offer a significant simplification when further such
variables will be added (as I intend to do).
Kristof Provost [Tue, 1 Feb 2022 17:33:42 +0000 (18:33 +0100)]
pf tests: Only do post-test logging when specifically enabled
The pf tests have the ability to log state information (pf rules, pf
states, interfaces, ...) on exit (i.e. on success or on error).
This is useful, but only in specific cases. When it's not needed it may
get in the way of clear output.
Test scripts can add 'debug' to the pft_init call to enable this for the
specified test.
Kristof Provost [Tue, 1 Feb 2022 17:25:57 +0000 (18:25 +0100)]
pf: deal with tables gaining or losing counters
When we create a table without counters, add an entry and later
re-define the table to have counters we wound up trying to read
non-existent counters.
We now cope with this by attempting to add them if needed, removing them
when they're no longer needed and not trying to read from counters that
are not present.
Ed Maste [Sat, 5 Feb 2022 02:02:44 +0000 (21:02 -0500)]
elfctl: update man page example for 'no' prefix
Reported by: Mark Millard on freebsd-current@
Fixes: c763f99d11fd ("elfctl: prefix disable flags with "no"")
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
John Baldwin [Fri, 4 Feb 2022 23:38:49 +0000 (15:38 -0800)]
cxgbei: Rework parsing of pre-offload PDUs.
sbcut() returns mbufs in reverse order so is not suitable for reading
data from the socket buffer. Instead, check for already-received data
in the receive worker thread before passing offload PDUs up to the
iSCSI layer. This uses soreceive() to read data from the socket and
is also to use M_WAITOK since it now runs from a worker thread instead
of an interrupt thread.
Also, fix decoding of the data segment length for pre-offload PDUs.
Reported by: Jithesh Arakkan @ Chelsio
Fixes: a8c4147edcdc cxgbei: Parse all PDUs received prior to enabling offload mode.
Sponsored by: Chelsio Communications
Alan Somers [Mon, 3 Jan 2022 00:16:09 +0000 (17:16 -0700)]
fusefs: require FUSE_NO_OPENDIR_SUPPORT for NFS exporting
FUSE file systems that do not set FUSE_NO_OPENDIR_SUPPORT do not
guarantee that d_off will be valid after closing and reopening a
directory. That conflicts with NFS's statelessness, that results in
unresolvable bugs when NFS reads large directories, if:
* The file system _does_ change the d_off field for the last directory
entry previously returned by VOP_READDIR, or
* The file system deletes the last directory entry previously seen by
NFS.
Rather than doing a poor job of exporting such file systems, it's better
just to refuse.
Even though this is technically a breaking change, 13.0-RELEASE's
NFS-FUSE support was bad enough that an MFC should be allowed.
Alan Somers [Sun, 2 Jan 2022 22:29:50 +0000 (15:29 -0700)]
fusefs: optimize NFS readdir for FUSE_NO_OPENDIR_SUPPORT
In its lowest common denominator, FUSE does not require that a directory
entry's d_off field is valid outside of the lifetime of the directory's
FUSE file handle. But since NFS is stateless, it must reopen the
directory on every call to VOP_READDIR. That means reading the
directory all the way from the first entry. Not only does this create
an O(n^2) condition for large directories, but it can also result in
incorrect behavior if either:
* The file system _does_ change the d_off field for the last directory
entry previously seen by NFS, or
* The file system deletes the last directory entry previously seen by
NFS.
Handily, for file systems that set FUSE_NO_OPENDIR_SUPPORT d_off is
guaranteed to be valid for the lifetime of the directory entry, there is
no need to read the directory from the start.
Alan Somers [Sun, 2 Jan 2022 17:18:47 +0000 (10:18 -0700)]
Fix NFS exports of FUSE file systems for big directories
The FUSE protocol does not require that a directory entry's d_off field
outlive the lifetime of its directory's file handle. Since the NFS
server must reopen the directory on every VOP_READDIR call, that means
it can't pass uio->uio_offset down to the FUSE server. Instead, it must
read the directory from 0 each time. It may need to issue multiple
FUSE_READDIR operations until it finds the d_off field that it's looking
for. That was the intention behind SVN r348209 and r297887, but a logic
bug prevented subsequent FUSE_READDIR operations from ever being issued,
rendering large directories incompletely browseable.
Stefan Eßer [Fri, 4 Feb 2022 22:37:12 +0000 (23:37 +0100)]
whereis: fix fetching of user.cs_path sysctl variable
The current implementation of sysctlbyname() does not support the user
sub-tree. This function exits with a return value of 0, but sets the
passed string buffer to an empty string.
As a result, the whereis program did not use the value of the sysctl
variable "user.cs_path", but only the value of the environment
variable "PATH".
This update makes whereis use the sysctl function with a fixed OID,
which already supports the user sub-tree.
Warner Losh [Fri, 4 Feb 2022 22:38:15 +0000 (15:38 -0700)]
style(9): Default to omitting $FreeBSD$
Advise people to omit $FreeBSD$ (in both comments and macros) unless the
code is definitely going to be merged to stable/12. This strengthens
previous statements and is appropriate now that stable/11 is no longer
supported. If people are wrong and things are unexpected merged to 12,
tags can be added before that merge. No sense adding a tag that will
never be expanded and removed later on the off chance it might wind up
in stable/12.
The next step is likely to weaken this to apply just to mergemaster
managed files, but not today.
Kirk McKusick [Fri, 4 Feb 2022 19:46:36 +0000 (11:46 -0800)]
Have fsck_ffs(8) properly correct superblock check-hash failures.
Part of the problem was that fsck_ffs would read the superblock
multiple times complaining and repairing the superblock check hash
each time and then at the end failing to write out the superblock
with the corrected check hash. This fix reads the superblock just
once and if the check hash is corrected ensures that the fixed
superblock gets written.
Ed Maste [Sun, 16 Jan 2022 19:22:05 +0000 (14:22 -0500)]
compiler-rt: re-exec with ASLR disabled when necessary
Some sanitizers (at least msan) currently require ASLR to be disabled.
When we detect that ASLR is enabled, re-exec with it disabled rather
than exiting with an error. See LLVM GitHub issue 53256 for more
detail: https://github.com/llvm/llvm-project/issues/53256
No objection: dim
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33934
Ed Maste [Wed, 19 Jan 2022 18:08:18 +0000 (13:08 -0500)]
compiler-rt: support ReExec() on FreeBSD
Based on getMainExecutable() in llvm/lib/Support/Unix/Path.inc.
This will need a little more work for an upstream change as it must
support older FreeBSD releases that lack elf_aux_info() / AT_EXEC_PATH.
No objection: dim
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33934
Stefan Eßer [Fri, 4 Feb 2022 12:44:20 +0000 (13:44 +0100)]
libc: return partial sysctl() result if buffer is too small
Testing of a new feature revealed that calling sysctl() to retrieve
the value of the user.localbase variable passing too low a buffer size
could leave the result buffer unchanged.
The behavior in the normal case of a sufficiently large buffer was
correct.
All known callers pass a sufficiently large buffer and have thus not
been affected by this issue. If a non-default value had been assigned
to this variable, the result was as documented, too.
Fix the function to fill the buffer with a partial result, if the
passed in buffer size is too low to hold the full result.
Atomics have significant other use besides providing in-system
primitives for safe memory updates. They are used for implementing
communication with out of system software or hardware following some
protocols.
For instance, even UP kernel might require a protocol using atomics to
communicate with the software-emulated device on SMP hypervisor. Or
real hardware might need atomic accesses as part of the proper
management protocol.
Another point is that UP configurations on x86 are extinct, so slight
performance hit by unconditionally use proper atomics is not important.
It is compensated by less code clutter, which in fact improves the
UP/i386 lifetime expectations.
Adrian Chadd [Sun, 30 Jan 2022 03:04:19 +0000 (19:04 -0800)]
ar40xx_switch: add initial switch for the IPQ4018/IPQ4019.
Summary:
This switch is based off of the AR8327/AR8337 external switch/PHY.
However unlike the AR8327/AR8337 it itself doesn't have any PHYs;
instead an external PHY connects to it using the PSGMII port.
Differential Revision: https://reviews.freebsd.org/D34112
Reviewed by: manu
This code is inspired by the ar40xx code in openwrt, which itself
is based on the Qualcomm QCA-SSDK. Both of these sources are, amusingly,
BSD licenced - and thus I have included some of the comments in the
hardware workaround paths to document some of the magic numbers.
Adrian Chadd [Sun, 30 Jan 2022 02:27:58 +0000 (18:27 -0800)]
qcom_mdio: add initial IPQ4018 MDIO support
This adds support for the IPQ4018/IPQ4019 MDIO bus. This is used to
talk to external PHYs and switches. (There's an internal switch
in the IPQ4018/IPQ4019 as well, but it's accessible via MMIO/AXI.)
Differential Revision: https://reviews.freebsd.org/D34110
Reviewed by: manu
Justin Hibbits [Thu, 3 Feb 2022 23:20:36 +0000 (17:20 -0600)]
powerpc/atomic: Fix atomic_testand_*_long on powerpc64
After b5d227b0 FreeBSD was panicking on boot with "Duplicate free" in
UMA. Analyzing the asm, the '1' mask was treated as an integer, rather
than a long, causing 'slw' (shift left word) to be used for the shifting
instruction, not 'sld' (shift left double). This means the upper bits
of the bitfield were not getting used, resulting in corruption of the
bitfield.
While fixing this, the 'and' check of the mask does not need to be
recorded, so don't record (drop the '.').
John Baldwin [Thu, 3 Feb 2022 18:48:18 +0000 (10:48 -0800)]
iwlwifi: Disable -Wformat when building with GCC.
GCC's -Wformat complains about NULL format strings passed to
iwl_fw_dbg_collect_trig (though the function handles NULL format
strings). Curious that upstream iwlwifi in Linux is built with GCC
and explicitly opts into this warning via the __printf() attribute.
Alexander Motin [Thu, 3 Feb 2022 15:48:19 +0000 (10:48 -0500)]
CTL: Fix mode page trucation on HA synchronization.
Due to variable size of struct ctl_ha_msg_mode ctl_isc_announce_mode()
sent only first 4 bytes of modified mode page to the other HA side,
that caused its corruption there, noticeable only after failover.
I've found alike bug also in ctl_isc_announce_lun(), but there it was
sending slightly more than needed, that is a smaller problem.
Kyle Evans [Thu, 3 Feb 2022 16:05:06 +0000 (10:05 -0600)]
kern: harvest entropy from callouts
74cf7cae4d22 ("softclock: Use dedicated ithreads for running callouts.")
switched callouts away from the swi infrastructure. It turns out that
this was a major source of entropy in early boot, which we've now lost.
As a result, first boot on hardware without a 'fast' entropy source
would block waiting for fortuna to be seeded with little hope of
progressing without manual intervention.
Let's resolve it by explicitly harvesting entropy in callout_process()
if we've handled any callouts. cc/curthread/now seem to be reasonable
sources of entropy, so use those.
TCP RACK can cache the IP header while preparing
a new TCP packet for transmission. Thus all the
IP ECN codepoint bits need to be assigned, without
assuming a clear field beforehand.
Reviewed By: tuexen, kbowling, #transport
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34148
tcp: Access all 12 TCP header flags via inline function
In order to consistently provide access to all
(including reserved) TCP header flag bits,
use an accessor function tcp_get_flags and
tcp_set_flags. Also expand any flag variable from
uint8_t / char to uint16_t.
Mark Johnston [Thu, 3 Feb 2022 14:41:17 +0000 (09:41 -0500)]
filemon: Reject FILEMON_SET_FD commands when the fd is a kqueue
When FILEMON_SET_FD is used, the filemon handle effectively wraps the
passed file. In particular, the handle may be inherited by a child
process, or transferred over a unix domain socket, so we must verify
that the backing file permits this.
Reported by: syzbot+36e6be9e02735fe66ca8@syzkaller.appspotmail.com
Reviewed by: emaste
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34128
Michael Tuexen [Wed, 2 Feb 2022 08:20:43 +0000 (09:20 +0100)]
tcp: cleanup functions related to socket option handling
Consistently only pass the inp and the sopt around. Don't pass the
so around, since in a upcoming commit tcp_ctloutput_set() will be
called from a context different from setsockopt(). Also expect
the inp to be locked when calling tcp_ctloutput_[gs]et(), this is
also required for the upcoming use by tcpsso, a command line tool
to set socket options.
Reviewed by: glebius, rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D34151
Warner Losh [Wed, 2 Feb 2022 21:36:49 +0000 (14:36 -0700)]
mps: Use 64-bit chain structures
According to Broadcom, mixing 64-bit SGEs with 32-bit chain entries can
lead to IOC Fault code 0x40000d04. This fault code has been observed to
suddenly increase on certain machines when the OCA firmware images are
deployed. The hardware interprets all elements of a 64-bit SGE, even
ones marked as 32-bit. Depending on the other bits, this will just work,
but sometimes generate the above fault. Broadcom recommends this
practice, and the Linux and NetBSD drivers follow it.
Rework the chaining code to use MPI2_SGE_CHAIN64 instead of
MPI2_SGE_CHAIN32. Adjust MPS_SGC_SIZE from 8 to 12 to match the size of
the new structure. Flag the structure as being 64-bits now. Since
MPS_SGE64_SIZE and MPS_SGC_SIZE are the same now, mps_push_sge could be
simplified (after the same fashion of mpr). The different number of
cases collapse to whether or not there's room for the segments and if
not we need a chain, however these changes haven't been made yet as the
current code handles those cases properly with the new defines.
Made chain_busaddr 64-bits, even though we ask for all allocations to be
below 4GB for this tag. Use it to set both parts of the CHAIN64 address
rather than baking the 4GB assumption. Add asserts around the allocation
to detect and BUSDMA bugs in allocation.
Remove asserts and associated comment in mpi_pre_fw_download and
mpi_pre_fw_upload. The code does not, it seems, depend on this
invariant. The mpr driver has similar code, no asserts and also doesn't
depend on this.
unionfs: do not force LK_NOWAIT if VI_OWEINACT is set
I see no apparent need to avoid waiting on the lock just because
vinactive() may be called on another thread while the thread that
cleared the vnode refcount has the lock dropped. In fact, this
can at least lead to a panic of the form "vn_lock: error <errno>
incompatible with flags" if LK_RETRY was passed to VOP_LOCK().
In this case LK_NOWAIT may cause the underlying FS to return an
error which is incompatible with LK_RETRY.
unionfs: allow lock recursion when reclaiming the root vnode
The unionfs root vnode will always share a lock with its lower vnode.
If unionfs was mounted with the 'below' option, this will also be the
vnode covered by the unionfs mount. During unmount, the covered vnode
will be locked by dounmount() while the unionfs root vnode will be
locked by vgone(). This effectively requires recursion on the same
underlying like, albeit through two different vnodes.
VOP_LOCK() may be handed a vnode that is concurrently reclaimed.
unionfs_lock() accounts for this by checking for empty vnode private
data under the interlock. But it incorrectly asserts that the vnode
is using the unionfs dispatch table before making this check.
Reverse the order, and also update KASSERT_UNIONFS_VNODE() to provide
more useful information.