Fix error return of kern.ipc.posix_shm_list, which caused it (and thus
"posixshmcontrol ls") to fail for all jails that didn't happen to own
the last shm object in the list.
Andrew Turner [Thu, 23 Sep 2021 15:00:55 +0000 (15:00 +0000)]
Add the arm64 table attributes and use them
Add the table page table attributes on arm64 and use them to add
restrictions to the block and page entries below them. This ensures
we are unable to increase the permissions in these last level entries
without also changing them in the upper levels.
Use the attributes to ensure the kernel can't execute from userspace
memory and vice versa, userspace has no access to read or write kernel
memory, and that the DMAP region is non-executable.
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32081
Warner Losh [Wed, 29 Sep 2021 16:19:51 +0000 (10:19 -0600)]
f00f: We don't need giant to create IDT for workaround.
We don't need to assert we have Giant here. All machines that require
the F00F workaround are UP and interrupts are disabled. Since we are
single threaded, it's safe to allocate the IDT area with pmap_trm_alloc,
interact with the current idt table and replace the IDT table without
any Giant locking.
Warner Losh [Wed, 29 Sep 2021 15:21:17 +0000 (09:21 -0600)]
loader: create separate man pages for each of the loaders
Create a man page per loader. Loader(8) will have information common to
all of them, while loader_${INTERP}(8) will have information relevant to
that specific loader. Rewrite loader(8) to give an overview and point to
the appropriate man page. Rewrite each of the loader_${INTER}(8) man
pages to contain only the relevant information to that loader. Put all
the common commands, environment variables, etc in loader_simp(8) and
refernce that from the loader_lua or loader_4th man pages. The
loader_lua(8) could use more details about the Lua
integration. Additional organization may be benefitial.
sdhci_xenon: split driver file into generic file and fdt parts
This patch splits driver code into two seperate files sdhci_xenon.c
and sdhci_xenon_fdt.c. This will allow future implementation of ACPI
discovery of sdhci on Xenon chips.
Ed Maste [Wed, 29 Sep 2021 13:59:10 +0000 (09:59 -0400)]
mgb: Use MGB_DEBUG instead of DEBUG
The debug register dump routine is not hooked up and is really only
useful to driver developers, so put it under an mgb-specific MGB_DEBUG
rather than general DEBUG.
MFC after: 1 week
Fixes: 8890ab7758b8 ("Introduce if_mgb driver...")
Sponsored by: The FreeBSD Foundation
Use atomic counters to ensure that we correctly track the number of half
open states and syncookie responses in-flight.
This determines if we activate or deactivate syncookies in adaptive
mode.
mmc: Fix regression in 8a8166e5bcfb breaking Stratix 10 boot
The refactoring in 8a8166e5bcfb introduced a functional change that
breaks booting on the Stratix 10, hanging when it should be attaching
da0. Previously OF_getencprop was called with a pointer to host->f_max,
so if it wasn't present then the existing value was left untouched, but
after that commit it will instead clobber the value with 0. The dwmmc
driver, as used on the Stratix 10, sets a default value before calling
mmc_fdt_parse and so was broken by this functional change. It appears
that aw_mmc also does the same thing, so was presumably also broken on
some boards.
Coherent is lower 32bit only by default in Linux and our only default
dma mask is 64bit currently which violates expectations unless
dma_set_coherent_mask() was called explicitly with a different mask.
Implement coherent by creating a second tag, and storing the tags in the
objects and use the tag from the object wherever possible.
This currently does not update the scatterlist or pool (both could be
converted but S/G cannot be MFCed as easily).
There is a 2nd change embedded in the updated logic of
linux_dma_alloc_coherent() to always zero the allocation as
otherwise some drivers get cranky on uninialised garbage.
Sponsored by: The FreeBSD Foundation
MFC after: 7 days
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D32164
mvneta_find_ethernet_prop_switch() is file-local static to
if_mvneta_fdt.c. Normally we would not need a function declararion
but in case MVNETA_DEBUG is set it becomes public. Move the
function declaration from if_mvneta.c to if_mvneta_fdt.c to avoid
a warning during each compile.
Warner Losh [Wed, 29 Sep 2021 03:21:50 +0000 (21:21 -0600)]
nvme: Sanity check completion id
Make sure the completion ID is in the range of [0..num_trackers) since
the values past the end of the act_tr array are never going to be valid
trackers and will lead to pain and suffering if we try to dereference
them to get the tracker or to set the tracker back to NULL as we
complete the I/O.
Warner Losh [Wed, 29 Sep 2021 03:11:17 +0000 (21:11 -0600)]
nvme: Add sanity check for phase on startup.
The proper phase for the qpiar right after reset in the first interrupt
is 1. For it, make sure that we're not still in phase 0. This is an
illegal state to be processing interrupts and indicates that we've
failed to properly protect against a race between initializing our state
and processing interrupts. Modify stat resetting code so it resets the
number of interrpts to 1 instead of 0 so we don't trigger a false
positive panic.
Warner Losh [Wed, 29 Sep 2021 03:09:45 +0000 (21:09 -0600)]
nvme: start qpair in state RECOVERY_WAITING
An interrupt happens on the admin queue right away after the reset, so
as soon as we enable interrupts, we'll get a call to our interrupt
handler. It is safe to ignore this interrupt if we're not yet
initialized, or to process it if we are. If we are initialized, we'll
see there's no completion records and return. If we're not, we'll
process no completion records and return. Either way, nothing is
processed and nothing is lost.
Until we've completely setup the qpair, we need to avoid processing
completion records. Start the qpair in the waiting recovery state so we
return immediately when we try to process completions. The code already
sets it to 'NONE' when we're initialization is complete. It's safe to
defer completion processing here because we don't send any commands
before the initialization of the software state of the qpair is
complete. And even if we were to somehow send a command prior to that
completing, the completion record for that command would be processed
when we send commands to the admin qpair after we've setup the software
state. There's no good central point to add an assert for this last
condition.
This fixes an KASSERT "received completion for unknown cmd" panic on
boot.
Colin Percival [Tue, 28 Sep 2021 18:39:59 +0000 (11:39 -0700)]
loader: Set twiddle globaldiv to 16 by default
Booting FreeBSD on an EC2 c5.xlarge instance, the loader "twiddles"
810 times over the course of 510 ms, a rate of 1.59 kHz. Even accepting
that many systems are slower than this particular VM and will take
longer to boot (especially if using spinning-rust disks), this seems
like an unhelpfully large amount of twiddling when compared to the
~60 Hz frame rate of many displays; printing the twiddles also consumes
roughly 10% of the boot time on the aforementioned VM.
Setting the default globaldiv to 16 dramatically reduces the time spent
printing twiddles to the console while still twiddling at roughly 100
Hz; this should be ample even for systems which take longer to boot and
consequently twiddle slower.
Note that this can adjusted via the twiddle_divisor variable in
loader.conf, but that file is not processed until nearly halfway
through the loader's runtime.
Ed Maste [Tue, 28 Sep 2021 20:27:28 +0000 (16:27 -0400)]
mgb: Update man page wrt state of the driver
Be explicit that the driver has caveats and limitations, and remove the
note about not being connected to the build: I plan to connect it soon.
(Also the note serves no real purpose in a man page that is not
installed.)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Ian Lepore [Tue, 28 Sep 2021 19:29:10 +0000 (13:29 -0600)]
Fix busdma resource leak on usb device detach.
When a usb device is detached, usb_pc_dmamap_destroy() called
bus_dmamap_destroy() while the map was still loaded. That's harmless on x86
architectures, but on all other platforms it causes bus_dmamap_destroy() to
return EBUSY and leak away any memory resources (including bounce buffers)
associated with the mapping, as well as any allocated map structure itself.
This change introduces a new is_loaded flag to the usb_page_cache struct to
track whether a map is loaded or not. If the map is loaded,
bus_dmamap_unload() is called before bus_dmamap_destroy() to avoid leaking
away resources.
MFC after: 7 days
Differential Revision: https://reviews.freebsd.org/D32208
Fix memory deadlock when GELI partition is used for swap.
When we get low on memory, the VM system tries to free some by swapping
pages. However, if we are so low on free pages that GELI allocations block,
then the swapout operation cannot complete. This keeps the VM system from
being able to free enough memory so the allocation can complete.
To alleviate this, keep a UMA pool at the GELI layer which is used for data
buffer allocation in the fast path, and reserve some of that memory for swap
operations. If an IO operation is a swap, then use the reserved memory. If
the allocation still fails, return ENOMEM instead of blocking.
For non-swap allocations, change the default to using M_NOWAIT. In general,
this *should* be better, since it gives upper layers a signal of the memory
pressure and a chance to manage their failure strategy appropriately. However,
a user can set the kern.geom.eli.blocking_malloc sysctl/tunable to restore
the previous M_WAITOK strategy.
Externalize nsw_cluster_max and initialize it early.
GEOM_ELI needs to know the value, cause it will soon have special
memory handling for IO operations associated with swap.
Move initialization to swap_pager_init(), which is executed at
SI_SUB_VM, unlike swap_pager_swap_init(), which would be executed
only when a swap is configured. GEOM_ELI might need the value at
SI_SUB_DRIVERS, when disks are tasted by GEOM.
Ian Lepore [Tue, 28 Sep 2021 18:19:44 +0000 (12:19 -0600)]
qoriq_therm.c: avoid a segfault on the error exit path.
If anything goes wrong during attach() it is handled with a 'goto fail'
which calls sysctl_ctx_free(). But the sysctl context doesn't get
initialized until very late in attach(), so almost any error just results
in a segfault. Move the sysctl_ctx_init() call to the beginning of the
attach() function, so that it is done before any errors can happen that
will lead to freeing the context.
Ed Maste [Tue, 28 Sep 2021 16:06:04 +0000 (12:06 -0400)]
mgb: enable multicast in mgb_init
Receive Filtering Engine (RFE) configuration is not yet implemented,
and mgb intended to enable all broadcast, multicast, and unicast.
However, MGB_RFE_ALLOW_MULTICAST was missed (MGB_RFE_ALLOW_UNICAST was
included twice).
MFC after: 1 week
Fixes: 8890ab7758b8 ("Introduce if_mgb driver...")
Sponsored by: The FreeBSD Foundation
This function was renamed to kern_reboot() in 2010, but the man page has
failed to keep in sync. Bring it up to date on the rename, add the
shutdown hooks to the synopsis, and document the (obvious) fact that
kern_reboot() does not return.
Fix an outdated reference to the old name in kern_reboot(), and leave a
reference to the man page so future readers might find it before any
large changes.
Reviewed by: imp, markj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32085
Michael Tuexen [Tue, 28 Sep 2021 03:25:58 +0000 (05:25 +0200)]
sctp: fix usage of stream scheduler functions
sctp_ss_scheduled() should only be called for streams that are
scheduled. So call sctp_ss_remove_from_stream() before it.
This bug was uncovered by the earlier cleanup.
Andrew Turner [Tue, 28 Sep 2021 11:36:42 +0000 (12:36 +0100)]
Use mtx_lock_spin in the gic driver
The mutex was changed to a spin lock when the MSI/MSI-X handling was
moved from the gicv2m to the gic driver. Update the calls to lock
and unlock the mutex to the spin variant.
Submitted by: jrtc27 ("Change all the mtx_(un)lock(&sc->mutex) to be the _spin versions.")
Reported by: mw, antranigv@freebsd.am
Sponsored by: The FreeBSD Foundation
Avoid "consumer not attached in g_io_request" panic when disk lost
while using a UFS snapshot.
The UFS filesystem supports snapshots. Each snapshot is a file whose
contents are a frozen image of the disk partition on which the filesystem
resides. Each time an existing block in the filesystem is modified,
the filesystem checks whether that block was in use at the time that
the snapshot was taken. If so, and if it has not already been copied,
a new block is allocated from among the blocks that were not in use
at the time that the snapshot was taken and placed in the snapshot file
to replace the entry that has not yet been copied. The previous contents
of the block are copied to the newly allocated snapshot file block,
and the write to the original is then allowed to proceed.
The block allocation is done using the usual UFS_BALLOC() routine
which allocates the needed block in the snapshot and returns a
buffer that is set up to write data into the newly allocated block.
In usual filesystem operation, the contents for the new block is
copied from user space into the buffer and the buffer is then written
to the file using bwrite(), bawrite(), or bdwrite(). In the case of a
snapshot the new block must be filled from the disk block that is about
to be rewritten. The snapshot routine has a function readblock() that
it uses to read the `about to be rewritten' disk block.
/*
* Read the specified block into the given buffer.
*/
static int
readblock(snapvp, bp, lbn)
struct vnode *snapvp;
struct buf *bp;
ufs2_daddr_t lbn;
{
struct inode *ip;
struct bio *bip;
struct fs *fs;
When the underlying disk fails, its GEOM module is removed.
Subsequent attempts to access it should return the ENXIO error.
The functionality of checking for the lost disk and returning
ENXIO is handled by the g_vfs_strategy() routine:
Only after checking that the device is present does it construct
the "bio" request and call g_io_request(). When readblock()
constructs its own "bio" request and calls g_io_request() directly
it panics with "consumer not attached in g_io_request" when the
underlying device no longer exists.
The fix is to have readblock() call g_vfs_strategy() rather than
constructing its own "bio" request:
/*
* Read the specified block into the given buffer.
*/
static int
readblock(snapvp, bp, lbn)
struct vnode *snapvp;
struct buf *bp;
ufs2_daddr_t lbn;
{
struct inode *ip;
struct fs *fs;
Here it uses the buffer that will eventually be written to the disk.
The g_vfs_strategy() routine uses four parts of the buffer: b_bcount,
b_iocmd, b_iooffset, and b_data.
The b_bcount field is already correctly set for the buffer. It is
safe to set the b_iocmd and b_iooffset fields as they are set
correctly when the later write is done. The write path will also
clear the B_DONE flag that our use of the buffer will set. The
b_iodone callback has to be set to bdone() which will do just
notification that the I/O is done in bufdone(). The rest of
bufdone() includes things like processing the softdeps associated
with the buffer should not be done until the buffer has been
written. Bufdone() will set b_iodone back to NULL after using it,
so the full bufdone() processing will be done when the buffer is
written. The final change from the previous version of readblock()
is that it used the b_data for the destination of the read while
g_vfs_strategy() uses the bdata2bio() function to take advantage
of VMIO when it is available.
LinuxKPI: dma-mapping.h unify "mask" and "dma_mask"
In some places we are using "mask" and others "dma_mask" for the
same thing. Harmonize the various places to "dma_mask" as used in
linux_pci.c. For the declaration remove the argument names to
avoid the entire problem.
This is in preparation for an upcoming change.
No functional changes intended.
Sponsored by: The FreeBSD Foundation
MFC after: 5 days
David Bright [Mon, 27 Sep 2021 13:18:46 +0000 (06:18 -0700)]
ntb_hw_intel: fix xeon NTB gen3 bar disable logic
In NTB gen3 driver, it was supposed to disable NTB bar access by
default, but due to incorrect register access method, the bar disable
logic does not work as expected. Those registers should be modified
through NTB bar0 rather than PCI configuration space.
Besides, we'd better to protect ourselves from a bad buddy node so
ingress disable logic should be implemented together.
As reported by multiple people testing iwlwifi, device_release_driver()
can lead to a panic on secondary errors (usually during attach).
Disable device_release_driver() for the short-term to prevent the panic
but leave it in place so it can be re-worked and fixed properly for
the long-term more easily.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
rman: fix overflow in rman_reserve_resource_bound()
If the default range of [0, ~0] is given, then (~0 - 0) + 1 == 0. This
in turn will cause any allocation of non-zero size to fail. Zero-sized
allocations are prohibited, so add a KASSERT to this effect.
History indicates it is part of the original rman code. This bug may in
fact be older than some contributors.
In D21096 BUS_TRANSLATE_RESOURCE was introduced to allow LinuxKPI to get
physical addresses in pci_resource_start for PowerPC and implemented
in ofw_pci.
When the translation was implemented in pci_host_generic in 372c142b4fc,
this method was not implemented; instead a local static function was
added for a similar purpose.
Rename the static function to "_common" and implement the bus function
as a wrapper around that. With this a LinuxKPI driver using
physical addresses correctly finds the configuration registers of
the GPU.
This unbreaks amdgpu on NXP Layerscape LX2160A SoC (SolidRun HoneyComb
LX2K workstation) which has a Translation Offset in ACPI for
below-4G PCI addresses.
Drop fpstate only after copying out xfpustate from the thread usermode
save area. Otherwise a context switch between get_fpcontext(), which now
returns the pointer directly into user save area, and copyout, would
cause reinit of the save area, loosing user registers.
Reported, reviewed, and tested by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Differential revision: https://reviews.freebsd.org/D32159
Andrew Turner [Mon, 27 Sep 2021 11:22:15 +0000 (12:22 +0100)]
Check cpu_softc is not NULL before dereferencing
In the acpi_cpu_postattach SYSINIT function cpu_softc may be NULL, e.g.
on arm64 when booting from FDT. Check it is not NULL at the start of
the function so we don't try to dereference a NULL pointer.
Rick Macklem [Mon, 27 Sep 2021 01:37:25 +0000 (18:37 -0700)]
nfscl: Add a check for "has acquired a delegation" to nfscl_removedeleg()
Commit 5e5ca4c8fc53 added a flag to a NFSv4 mount point that is set when
the first delegation is acquired from the NFSv4 server.
For a common case where delegations are not being issued by the
NFSv4 server, the nfscl_removedeleg() code acquires the mutex lock for
open/lock state, finds the delegation list empty, then just unlocks the
mutex and returns. This patch adds a check of the flag to avoid the
need to acquire the mutex for this common case.
This change appears to be performance neutral for a small number
of opens, but should reduce lock contention for a large number of opens
for the common case where server is not issuing delegations.
This commit should not affect the high level semantics of delegation
handling.
Alexander Motin [Sun, 26 Sep 2021 16:03:05 +0000 (12:03 -0400)]
sched_ule(4): Fix hang with steal_thresh < 2.
e745d729be60 caused infinite loop with interrupts disabled in load
stealing code if steal_thresh set below 2. Such configuration should
not generally be used, but appeared some people are using it to
workaround some problems.
To fix the problem explicitly pass to sched_highest() minimum number
of transferrable threads, supported by the caller, instead of guessing.
John Baldwin [Sat, 25 Sep 2021 18:24:35 +0000 (11:24 -0700)]
kernel: Disable errors for -Walloca-larger-than for GCC.
GCC complains about the use of alloca() with variable sizes (for XSAVE
state len) in sendsig() for i386. Modern XSAVE state is probably
getting a bit large for the i386 kstack, but downgrade the error to a
warning.