Cy Schubert [Tue, 4 Jan 2022 03:04:04 +0000 (19:04 -0800)]
ipfilter userland: Fix branch mismerge
The work to ANSIfy and adjust returns to style(9) resulted in a mismerge
of a stash when ipfilter was moved from contrib to sbin. An older file
replaced WIP at the time, resulting in a regression.
The majority of this work was done in 2018 saved as git stashes within
a git-svn tree and migrated to the git tree. The regression occurred
when the various stashes were sequentially merged to create individual
commits, following the ipfilter move to netpfil and sbin.
SDT probe frb_natv4in is only available when an error is encountered.
Make it also available when no error is encountered, i.e. NATed and
not translated.
Cy Schubert [Tue, 21 Dec 2021 17:22:10 +0000 (09:22 -0800)]
ipfilter: INLINE --> inline
Replace the INLINE macro with inline. Some ancient compilers supported
__inline__ instead of inline. The INLINE hack compensated for it.
Ancient compilers are history.
Cy Schubert [Mon, 20 Dec 2021 17:07:20 +0000 (09:07 -0800)]
ipflter: ANSIfy userland function declarations
Convert ipfilter userland function declarations from K&R to ANSI. This
syncs our function declarations with NetBSD hg commit 75edcd7552a0
(apply our changes). Though not copied from NetBSD, this change was
partially inspired by NetBSD's work and inspired by style(9).
Cy Schubert [Mon, 20 Dec 2021 16:43:49 +0000 (08:43 -0800)]
ipflter: ANSIfy kernel function declarations
Convert ipfilter kernel function declarations from K&R to ANSI. This
syncs our function declarations with NetBSD hg commit 75edcd7552a0
(apply our changes). Though not copied from NetBSD, this change was
partially inspired by NetBSD's work and inspired by style(9).
Robert Wing [Tue, 4 Jan 2022 01:21:58 +0000 (16:21 -0900)]
cam: don't lock while handling an AC_UNIT_ATTENTION
Don't take the device_mtx lock in daasync() when handling an
AC_UNIT_ATTENTION. Instead, assert the lock is held before modifying the
periph's softc flags.
The device_mtx lock is taken in xptdevicetraverse() before daasync()
is eventually called in xpt_async_bcast().
Navdeep Parhar [Mon, 3 Jan 2022 22:35:45 +0000 (14:35 -0800)]
cxgbe(4): Update firmwares to 1.26.6.0.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
CHANGES
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Version : 1.26.6.0
Date : 01/03/2022
================================================================================
Fixes
-----
BASE:
- Fixed one module eeprom read failure.
- Fixed an issue with speed selection when 40G and 25G are advertised and
supported.
- Fixed a random traffic hang when T5 receives invalid ets BW in dcbx
messages from a switch.
- Fixed very long link up time with few switches.
================================================================================
Navdeep Parhar [Mon, 3 Jan 2022 21:31:46 +0000 (13:31 -0800)]
cxgbe(4): Fix stats collection for ports with port_id != tx_chan
This fixes a driver panic during stats collection when a port's id does
not match its tx channel. The bug affected only the T580 card running
with a non-default VPD.
Alan Cox [Wed, 29 Dec 2021 07:50:05 +0000 (01:50 -0600)]
arm64: Implement final level only TLB invalidations
A feature of arm64's instruction for TLB invalidation is the ability
to determine whether cached intermediate entries, i.e., L{0,1,2}_TABLE
entries, are invalidated in addition to the final entry, e.g., an
L3_PAGE entry.
Update pmap_invalidate_{page,range}() to support both types of
invalidation, allowing the caller to determine which type of
invalidation is performed.
Update the callers to request the appropriate type of invalidation.
Eliminate redundant TLB invalidations in pmap_abort_ptp() and
pmap_remove_l3_range().
Add a comment to pmap_invalidate_all() making clear that it always
invalidates entries at all levels.
As expected, these changes result in a tiny yet measurable
performance improvement.
Gleb Smirnoff [Mon, 3 Jan 2022 18:15:22 +0000 (10:15 -0800)]
inpcb: use global UMA zones for protocols
Provide structure inpcbstorage, that holds zones and lock names for
a protocol. Initialize it with global protocol init using macro
INPCBSTORAGE_DEFINE(). Then, at VNET protocol init supply it as
the main argument to the in_pcbinfo_init(). Each VNET pcbinfo uses
its private hash, but they all use same zone to allocate and SMR
section to synchronize.
Note: there is kern.ipc.maxsockets sysctl, which controls UMA limit
on the socket zone, which was always global. Historically same
maxsockets value is applied also to every PCB zone. Important fact:
you can't create a pcb without a socket! A pcb may outlive its socket,
however. Given that there are multiple protocols, and only one socket
zone, the per pcb zone limits seem to have little value. Under very
special conditions it may trigger a little bit earlier than socket zone
limit, but in most setups the socket zone limit will be triggered
earlier. When VIMAGE was added to the kernel PCB zones became per-VNET.
This magnified existing disbalance further: now we have multiple pcb
zones in multiple vnets limited to maxsockets, but every pcb requires a
socket allocated from the global zone also limited by maxsockets.
IMHO, this per pcb zone limit doesn't bring any value, so this patch
drops it. If anybody explains value of this limit, it can be restored
very easy - just 2 lines change to in_pcbstorage_init().
Gleb Smirnoff [Mon, 3 Jan 2022 18:15:21 +0000 (10:15 -0800)]
protocols: init with standard SYSINIT(9) or VNET_SYSINIT
The historical BSD network stack loop that rolls over domains and
over protocols has no advantages over more modern SYSINIT(9).
While doing the sweep, split global and per-VNET initializers.
Getting rid of pr_init allows to achieve several things:
o Get rid of ifdef's that protect against double foo_init() when
both INET and INET6 are compiled in.
o Isolate initializers statically to the module they init.
o Makes code easier to understand and maintain.
largepage_mprotect maps a superpage and later extends the mapping. This
occasionally fails with ASLR disabled. To fix this, first try to
reserve a sufficiently large virtual address region.
Reported by: Jenkins
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Mark Johnston [Mon, 3 Jan 2022 15:14:41 +0000 (10:14 -0500)]
x86: Skip late calibration if our reference timer has low quality
Some AMD Geode-based systems end up using the 8254 PIT to calibrate the
TSC during late calibration, which doesn't work because that
timecounter's mask (65535) is much smaller than its frequency (1193182).
Moreover, early calibration is done against the 8254 timer anyway.
Work around the problem by simply using early calibration results if no
high-quality timecounters exist.
PR: 260868
Fixes: 22875f88799e ("x86: Implement deferred TSC calibration")
Reported and tested by: mike@sentex.net, Stefan Hegnauer <stefan.hegnauer@gmx.ch>
Reviewed by: imp, kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33730
Jessica Clarke [Mon, 3 Jan 2022 17:09:42 +0000 (17:09 +0000)]
arm64: Check for intrng-reported errors in gicv3_its
Currently, any errors when adding a PIC child handler are ignored,
instead just continuing on to registering that PIC as an MSI, and
ignoring any errors that occur for that too.
Reviewed by: andrew
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D33342
Jessica Clarke [Mon, 3 Jan 2022 17:08:44 +0000 (17:08 +0000)]
intrng: Use less confusing return value for intr_pic_add_handler
Currently intr_pic_add_handler either returns the PIC you gave it (which
is useless and risks causing confusion about whether it's creating
another PIC) or, on error, NULL. Instead, convert it to return an int
error code as one would expect.
Note that the only consumer of this API, arm64's gicv3_its, does not use
the return value, so no uses need updating to work with the revised API.
Ed Maste [Mon, 3 Jan 2022 16:32:52 +0000 (11:32 -0500)]
ar: accept but ignore 'T' option
In previous versions of BSD ar -T was an alias for -f -- use only the
first 15 characters of archive member names. In GNU ar and LLVM ar -T
creates a thin archive.
The -f / old BSD ar -T functionality is not particularly useful, and
ignoring -T still results in a usable and compatible (but not thin)
archive.
An exp-run found a few ports invoking ar -T but they all expect thin
archives. In addition, -T will be used to specify thin archives after
a migration to LLVM-ar.
PR: 260523 [exp-run]
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33553
Corvin Köhne [Mon, 3 Jan 2022 13:19:39 +0000 (14:19 +0100)]
bhyve: add more slop to 64 bit BARs
Bhyve allocates small 64 bit BARs below 4 GB and generates ACPI tables
based on this allocation. If the guest decides to relocate those BARs
above 4 GB, it could lead to mismatching ACPI tables. Especially
when using OVMF with enabled bus enumeration it could cause
issues. OVMF relocates all 64 bit BARs above 4 GB. The guest OS
may be unable to recover from this situation and disables some PCI
devices because their BARs are located outside of the MMIO space
reported by ACPI. Avoid this situation by giving the guest more
space for relocating BARs.
Let's be paranoid. The available space for BARs below 4 GB is 512 MB
large. Use a slop of 512 MB. It'll allow the guest to relocate all
BARs below 4 GB to an address above 4 GB. We could run into issues
when we exceeding the memlimit above 4 GB. However, this space has
a size of 32 GB. Even when using many PCI device with large BARs
like framebuffer or when using multiple PCI busses, it's very
unlikely that we run out of space due to the large slop.
Additionally, this situation will occur on startup and not at runtime
which is much better.
Corvin Köhne [Mon, 3 Jan 2022 13:18:31 +0000 (14:18 +0100)]
bhyve: allow reading of fwctl signature multiple times
At the moment, you only have one single chance to read the fwctl
signature. At boot bhyve is in the state IDENT_WAIT. It's then
possible to switch to IDENT_SEND. After bhyve sends the signature,
it switches to REQ. From now on it's impossible to switch back to
IDENT_SEND to read the signature. For that reason, only a single
driver can read the signature. A guest can't use two drivers to
identify that fwctl is present. It gets even worse when using
OVMF. OVMF uses a library to access fwctl. Therefore, every single
OVMF driver would try to read the signature. Currently, only a
single OVMF driver accesses the fwctl. So, there's no issue with
it yet. However, no OS driver would have a chance to detect fwctl when
using OVMF because it's signature was already consumed by OVMF.
Corvin Köhne [Mon, 3 Jan 2022 13:16:59 +0000 (14:16 +0100)]
bhyve: enumerate BARs by size
E.g. Framebuffers can require large space and BARs need to be aligned
by their size. If BARs aren't allocated by size, it'll cause much
fragmentation of the MMIO space. Reduce fragmentation by ordering
the BAR allocation on their size to reduce the risk of
OUT_OF_MMIO_SPACE issues.
Corvin Köhne [Mon, 3 Jan 2022 14:48:10 +0000 (14:48 +0000)]
bhyve: only init MSI-X table if passthru device supports it
Some passthru devices only support MSI instead of MSI-X. For those
devices the initialization of MSI-X table will fail. Re-add the
check erroneously removed in f1442847c9404d4bc5f5524a0c3362dd39cb14f9.
Warner Losh [Sun, 2 Jan 2022 01:08:21 +0000 (18:08 -0700)]
src.opts.mk: Remove most of the mips support
Mips had a number of special cases that disabled features that didn't
work. Remove them all. However, retain the llvm mips bits because that
requires a lot more effort to unwind and will be done separately.
The implementation simply passes the text ref to the appropriate
underlying vnode. Without this, the default [un]set_text
implementation will only manage the text ref on the unionfs vnode,
causing it to be out of sync with the underlying filesystems and
potentially allowing corruption of executable file contents.
On INVARIANTS kernels, it also readily produces a panic on process
termination because the VM object representing the executable mapping
is backed by the underlying vnode, not the unionfs vnode.
Use atomics to track the writecount granted to the underlying FS,
and avoid holding the vnode interlock while calling the underling FS'
VOP_ADD_WRITECOUNT(). This also fixes a WITNESS warning about nesting
the same lock type. Also add comments explaining why we need to track
the writecount on the unionfs vnode in the first place. Finally,
simplify writecount management to only use the upper vnode and assert
that we shouldn't have an active writecount on the lower vnode through
unionfs.
Gleb Smirnoff [Mon, 3 Jan 2022 02:32:30 +0000 (18:32 -0800)]
sshd: update the libwrap patch to drop connections early
OpenSSH has dropped libwrap support in OpenSSH 6.7p in 2014
(f2719b7c in github.com/openssh/openssh-portable) and we
maintain the patch ourselves since 2016 (a0ee8cc636cd).
Over the years, the libwrap support has deteriotated and probably
that was reason for removal upstream. Original idea of libwrap was
to drop illegitimate connection as soon as possible, but over the
years the code was pushed further down and down and ended in the
forked client connection handler.
The negative effects of late dropping is increasing attack surface
for hosts that are to be dropped anyway. Apart from hypothetical
future vulnerabilities in connection handling, today a malicious
host listed in /etc/hosts.allow still can trigger sshd to enter
connection throttling mode, which is enabled by default (see
MaxStartups in sshd_config(5)), effectively casting DoS attack.
Note that on OpenBSD this attack isn't possible, since they enable
MaxStartups together with UseBlacklist.
A only negative effect from early drop, that I can imagine, is that
now main listener parses file in /etc, and if our root filesystems
goes bad, it would get stuck. But unlikely you'd be able to login
in that case anyway.
Implementation details:
- For brevity we reuse the same struct request_info. This isn't
a documented feature of libwrap, but code review, viewing data
in a debugger and real life testing shows that if we clear
RQ_CLIENT_NAME and RQ_CLIENT_ADDR every time, it works as intended.
- We set SO_LINGER on the socket to force immediate connection reset.
- We log message exactly as libwrap's refuse() would do.
sched_get/setaffinity(): try to be more compatible with Linux
in handling the cpuset sizes different from sizeof(cpuset_t).
For both cases, cpuset size shorter than sizeof(cpuset_t) results
in EINVAL on Linux.
For sched_getaffinity(), be more permissive and accept cpuset size
larger than our cpuset_t, by clipping the syscall argument and zeroing
the rest of the output buffer. For sched_setaffinity(), we should allow
shorter cpusets than current ABI size, again zeroing the rest of the bits.
With this change, python os.sched_get/setaffinity functions work.
Reported by: se
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Alan Somers [Mon, 3 Jan 2022 01:00:30 +0000 (18:00 -0700)]
geom_gate: ensure readprov is null-terminated
With crafted input to the G_GATE_CMD_CREATE ioctl, geom_gate can be made
to print kernel memory to the system console, potentially revealing
sensitive data from whatever was previously in that memory page.
But but but: this is a case of the sys admin misconfiguring, and you'd
need root privileges to do this.
Robert Wing [Sun, 2 Jan 2022 21:07:18 +0000 (12:07 -0900)]
skip test case nvlist_send_recv__send_many_fds__dgram
If I'm not mistaken, the underlying sendmsg() for nvlist_send() is
failing with ENOBUFS. In turn, nvlist_recv() returns NULL because it
didn't receive the expected number of file descriptors.
Adjusting net.local.dgram.recvspace worked on my local machine, but on
CI the test still fails consistently.
Colin Percival [Thu, 30 Dec 2021 19:47:50 +0000 (11:47 -0800)]
Fix variable name: freq_khz -> freq
An earlier version of this code computed the TSC frequency in kHz.
When the code was changed to compute the frequency more accurately,
the variable name was not updated.
Jessica Clarke [Sun, 2 Jan 2022 20:55:49 +0000 (20:55 +0000)]
ufs: Avoid subobject overflow in snapshot expunge code
The code here tries to be smart and zeroes out both di_db and di_ib with
a single bzero call, thereby overrunning the di_db subobject. This is
fine on most architectures, if a little dodgy. However, on CHERI, the
compiler can optionally restrict the bounds on pointers to subobjects to
just that subobject, in order to mitigate intra-object buffer overflows,
and this is enabled in CheriBSD's pure-capability kernels.
Instead, use separate bzero calls for each array, and let the compiler
optimise it as it sees fit; even if it's not generating inline zeroing
code, Clang will happily optimise two consecutive bzero's to a single
larger call.
Jessica Clarke [Sun, 2 Jan 2022 20:55:36 +0000 (20:55 +0000)]
ufs: Rework shortlink handling to avoid subobject overflows
Shortlinks occupy the space of both di_db and di_ib when used. However,
everywhere that wants to read or write a shortlink takes a pointer do
di_db and promptly runs off the end of it into di_ib. This is fine on
most architectures, if a little dodgy. However, on CHERI, the compiler
can optionally restrict the bounds on pointers to subobjects to just
that subobject, in order to mitigate intra-object buffer overflows, and
this is enabled in CheriBSD's pure-capability kernels.
Instead, clean this up by inserting a union such that a new di_shortlink
can be added with the right size and element type, avoiding the need to
cast and allowing the use of the DIP macro to access the field. This
also mirrors how the ext2fs code implements extents support, with the
exact same structure other than having a uint32_t i_data[] instead of a
char di_shortlink[].
Doug Moore [Sun, 2 Jan 2022 18:37:05 +0000 (12:37 -0600)]
busdma: _bus_dmamap_addseg repaired
A recent change introduced a one-off error into a test allowing
coalescing chunks into segments. This fixes that error.
broke a check in _bus_dmamap_addseg on many architectures. This change makes it clear that it is not a particular range that is being boundary-checked, but the proposed union of the two adjacent ranges.
Reported by: se
Reviewed by: se
Fixes: c606ab59e7f9 vm_extern: use standard address checkers everywhere
Differential Revision: https://reviews.freebsd.org/D33715
Warner Losh [Sun, 2 Jan 2022 07:32:30 +0000 (00:32 -0700)]
iicbb: Always build ofw_bus_if.h
Always make ofw_bus_if.h. While it's only used when option FDT is in the
kernel, it can always be generated. In theory we could omit it if option
FDT isn't present, but none of the rest of sys/modules does that. That
fine-grained control likely won't be reliable w/o a redesign of the
kernel/module config system.
Bjoern A. Zeeb [Sat, 1 Jan 2022 18:08:31 +0000 (18:08 +0000)]
iwlwifi: clarify page update
Based on some feedback clarify the man page for
- how to load the driver currently
- status of the driver with respect to iwm(4)
and leave a comment to (automatically) add a full list of chipsets
to the man page.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Reviewed by: debdrup
Differential Revision: https://reviews.freebsd.org/D33713
Alan Somers [Thu, 2 Dec 2021 02:50:47 +0000 (19:50 -0700)]
fusefs: fix .. lookups when the parent has been reclaimed.
By default, FUSE file systems are assumed not to support lookups for "."
and "..". They must opt-in to that. To cope with this limitation, the
fusefs kernel module caches every fuse vnode's parent's inode number,
and uses that during VOP_LOOKUP for "..". But if the parent's vnode has
been reclaimed that won't be possible. Previously we paniced in this
situation. Now, we'll return ESTALE instead. Or, if the file system
has opted into ".." lookups, we'll just do that instead.
This commit also fixes VOP_LOOKUP to respect the cache timeout for ".."
lookups, if the FUSE file system specified a finite timeout.
Alan Somers [Thu, 2 Dec 2021 02:38:04 +0000 (19:38 -0700)]
fusefs: in the tests, always assume debug.try_reclaim_vnode is available
In an earlier version of the revision that created that sysctl (D20519)
the sysctl was gated by INVARIANTS, so the test had to check for it.
But in the committed version it is always available.
Alan Somers [Mon, 29 Nov 2021 02:17:34 +0000 (19:17 -0700)]
Fix a race in fusefs that can corrupt a file's size.
VOPs like VOP_SETATTR can change a file's size, with the vnode
exclusively locked. But VOPs like VOP_LOOKUP look up the file size from
the server without the vnode locked. So a race is possible. For
example:
1) One thread calls VOP_SETATTR to truncate a file. It locks the vnode
and sends FUSE_SETATTR to the server.
2) A second thread calls VOP_LOOKUP and fetches the file's attributes from
the server. Then it blocks trying to acquire the vnode lock.
3) FUSE_SETATTR returns and the first thread releases the vnode lock.
4) The second thread acquires the vnode lock and caches the file's
attributes, which are now out-of-date.
Fix this race by recording a timestamp in the vnode of the last time
that its filesize was modified. Check that timestamp during VOP_LOOKUP
and VFS_VGET. If it's newer than the time at which FUSE_LOOKUP was
issued to the server, ignore the attributes returned by FUSE_LOOKUP.