Colin Percival [Tue, 28 Sep 2021 18:39:59 +0000 (11:39 -0700)]
loader: Set twiddle globaldiv to 16 by default
Booting FreeBSD on an EC2 c5.xlarge instance, the loader "twiddles"
810 times over the course of 510 ms, a rate of 1.59 kHz. Even accepting
that many systems are slower than this particular VM and will take
longer to boot (especially if using spinning-rust disks), this seems
like an unhelpfully large amount of twiddling when compared to the
~60 Hz frame rate of many displays; printing the twiddles also consumes
roughly 10% of the boot time on the aforementioned VM.
Setting the default globaldiv to 16 dramatically reduces the time spent
printing twiddles to the console while still twiddling at roughly 100
Hz; this should be ample even for systems which take longer to boot and
consequently twiddle slower.
Note that this can adjusted via the twiddle_divisor variable in
loader.conf, but that file is not processed until nearly halfway
through the loader's runtime.
Ed Maste [Tue, 28 Sep 2021 20:27:28 +0000 (16:27 -0400)]
mgb: Update man page wrt state of the driver
Be explicit that the driver has caveats and limitations, and remove the
note about not being connected to the build: I plan to connect it soon.
(Also the note serves no real purpose in a man page that is not
installed.)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Ian Lepore [Tue, 28 Sep 2021 19:29:10 +0000 (13:29 -0600)]
Fix busdma resource leak on usb device detach.
When a usb device is detached, usb_pc_dmamap_destroy() called
bus_dmamap_destroy() while the map was still loaded. That's harmless on x86
architectures, but on all other platforms it causes bus_dmamap_destroy() to
return EBUSY and leak away any memory resources (including bounce buffers)
associated with the mapping, as well as any allocated map structure itself.
This change introduces a new is_loaded flag to the usb_page_cache struct to
track whether a map is loaded or not. If the map is loaded,
bus_dmamap_unload() is called before bus_dmamap_destroy() to avoid leaking
away resources.
MFC after: 7 days
Differential Revision: https://reviews.freebsd.org/D32208
Fix memory deadlock when GELI partition is used for swap.
When we get low on memory, the VM system tries to free some by swapping
pages. However, if we are so low on free pages that GELI allocations block,
then the swapout operation cannot complete. This keeps the VM system from
being able to free enough memory so the allocation can complete.
To alleviate this, keep a UMA pool at the GELI layer which is used for data
buffer allocation in the fast path, and reserve some of that memory for swap
operations. If an IO operation is a swap, then use the reserved memory. If
the allocation still fails, return ENOMEM instead of blocking.
For non-swap allocations, change the default to using M_NOWAIT. In general,
this *should* be better, since it gives upper layers a signal of the memory
pressure and a chance to manage their failure strategy appropriately. However,
a user can set the kern.geom.eli.blocking_malloc sysctl/tunable to restore
the previous M_WAITOK strategy.
Externalize nsw_cluster_max and initialize it early.
GEOM_ELI needs to know the value, cause it will soon have special
memory handling for IO operations associated with swap.
Move initialization to swap_pager_init(), which is executed at
SI_SUB_VM, unlike swap_pager_swap_init(), which would be executed
only when a swap is configured. GEOM_ELI might need the value at
SI_SUB_DRIVERS, when disks are tasted by GEOM.
Ian Lepore [Tue, 28 Sep 2021 18:19:44 +0000 (12:19 -0600)]
qoriq_therm.c: avoid a segfault on the error exit path.
If anything goes wrong during attach() it is handled with a 'goto fail'
which calls sysctl_ctx_free(). But the sysctl context doesn't get
initialized until very late in attach(), so almost any error just results
in a segfault. Move the sysctl_ctx_init() call to the beginning of the
attach() function, so that it is done before any errors can happen that
will lead to freeing the context.
Ed Maste [Tue, 28 Sep 2021 16:06:04 +0000 (12:06 -0400)]
mgb: enable multicast in mgb_init
Receive Filtering Engine (RFE) configuration is not yet implemented,
and mgb intended to enable all broadcast, multicast, and unicast.
However, MGB_RFE_ALLOW_MULTICAST was missed (MGB_RFE_ALLOW_UNICAST was
included twice).
MFC after: 1 week
Fixes: 8890ab7758b8 ("Introduce if_mgb driver...")
Sponsored by: The FreeBSD Foundation
This function was renamed to kern_reboot() in 2010, but the man page has
failed to keep in sync. Bring it up to date on the rename, add the
shutdown hooks to the synopsis, and document the (obvious) fact that
kern_reboot() does not return.
Fix an outdated reference to the old name in kern_reboot(), and leave a
reference to the man page so future readers might find it before any
large changes.
Reviewed by: imp, markj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32085
Michael Tuexen [Tue, 28 Sep 2021 03:25:58 +0000 (05:25 +0200)]
sctp: fix usage of stream scheduler functions
sctp_ss_scheduled() should only be called for streams that are
scheduled. So call sctp_ss_remove_from_stream() before it.
This bug was uncovered by the earlier cleanup.
Andrew Turner [Tue, 28 Sep 2021 11:36:42 +0000 (12:36 +0100)]
Use mtx_lock_spin in the gic driver
The mutex was changed to a spin lock when the MSI/MSI-X handling was
moved from the gicv2m to the gic driver. Update the calls to lock
and unlock the mutex to the spin variant.
Submitted by: jrtc27 ("Change all the mtx_(un)lock(&sc->mutex) to be the _spin versions.")
Reported by: mw, antranigv@freebsd.am
Sponsored by: The FreeBSD Foundation
Avoid "consumer not attached in g_io_request" panic when disk lost
while using a UFS snapshot.
The UFS filesystem supports snapshots. Each snapshot is a file whose
contents are a frozen image of the disk partition on which the filesystem
resides. Each time an existing block in the filesystem is modified,
the filesystem checks whether that block was in use at the time that
the snapshot was taken. If so, and if it has not already been copied,
a new block is allocated from among the blocks that were not in use
at the time that the snapshot was taken and placed in the snapshot file
to replace the entry that has not yet been copied. The previous contents
of the block are copied to the newly allocated snapshot file block,
and the write to the original is then allowed to proceed.
The block allocation is done using the usual UFS_BALLOC() routine
which allocates the needed block in the snapshot and returns a
buffer that is set up to write data into the newly allocated block.
In usual filesystem operation, the contents for the new block is
copied from user space into the buffer and the buffer is then written
to the file using bwrite(), bawrite(), or bdwrite(). In the case of a
snapshot the new block must be filled from the disk block that is about
to be rewritten. The snapshot routine has a function readblock() that
it uses to read the `about to be rewritten' disk block.
/*
* Read the specified block into the given buffer.
*/
static int
readblock(snapvp, bp, lbn)
struct vnode *snapvp;
struct buf *bp;
ufs2_daddr_t lbn;
{
struct inode *ip;
struct bio *bip;
struct fs *fs;
When the underlying disk fails, its GEOM module is removed.
Subsequent attempts to access it should return the ENXIO error.
The functionality of checking for the lost disk and returning
ENXIO is handled by the g_vfs_strategy() routine:
Only after checking that the device is present does it construct
the "bio" request and call g_io_request(). When readblock()
constructs its own "bio" request and calls g_io_request() directly
it panics with "consumer not attached in g_io_request" when the
underlying device no longer exists.
The fix is to have readblock() call g_vfs_strategy() rather than
constructing its own "bio" request:
/*
* Read the specified block into the given buffer.
*/
static int
readblock(snapvp, bp, lbn)
struct vnode *snapvp;
struct buf *bp;
ufs2_daddr_t lbn;
{
struct inode *ip;
struct fs *fs;
Here it uses the buffer that will eventually be written to the disk.
The g_vfs_strategy() routine uses four parts of the buffer: b_bcount,
b_iocmd, b_iooffset, and b_data.
The b_bcount field is already correctly set for the buffer. It is
safe to set the b_iocmd and b_iooffset fields as they are set
correctly when the later write is done. The write path will also
clear the B_DONE flag that our use of the buffer will set. The
b_iodone callback has to be set to bdone() which will do just
notification that the I/O is done in bufdone(). The rest of
bufdone() includes things like processing the softdeps associated
with the buffer should not be done until the buffer has been
written. Bufdone() will set b_iodone back to NULL after using it,
so the full bufdone() processing will be done when the buffer is
written. The final change from the previous version of readblock()
is that it used the b_data for the destination of the read while
g_vfs_strategy() uses the bdata2bio() function to take advantage
of VMIO when it is available.
LinuxKPI: dma-mapping.h unify "mask" and "dma_mask"
In some places we are using "mask" and others "dma_mask" for the
same thing. Harmonize the various places to "dma_mask" as used in
linux_pci.c. For the declaration remove the argument names to
avoid the entire problem.
This is in preparation for an upcoming change.
No functional changes intended.
Sponsored by: The FreeBSD Foundation
MFC after: 5 days
David Bright [Mon, 27 Sep 2021 13:18:46 +0000 (06:18 -0700)]
ntb_hw_intel: fix xeon NTB gen3 bar disable logic
In NTB gen3 driver, it was supposed to disable NTB bar access by
default, but due to incorrect register access method, the bar disable
logic does not work as expected. Those registers should be modified
through NTB bar0 rather than PCI configuration space.
Besides, we'd better to protect ourselves from a bad buddy node so
ingress disable logic should be implemented together.
As reported by multiple people testing iwlwifi, device_release_driver()
can lead to a panic on secondary errors (usually during attach).
Disable device_release_driver() for the short-term to prevent the panic
but leave it in place so it can be re-worked and fixed properly for
the long-term more easily.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
rman: fix overflow in rman_reserve_resource_bound()
If the default range of [0, ~0] is given, then (~0 - 0) + 1 == 0. This
in turn will cause any allocation of non-zero size to fail. Zero-sized
allocations are prohibited, so add a KASSERT to this effect.
History indicates it is part of the original rman code. This bug may in
fact be older than some contributors.
In D21096 BUS_TRANSLATE_RESOURCE was introduced to allow LinuxKPI to get
physical addresses in pci_resource_start for PowerPC and implemented
in ofw_pci.
When the translation was implemented in pci_host_generic in 372c142b4fc,
this method was not implemented; instead a local static function was
added for a similar purpose.
Rename the static function to "_common" and implement the bus function
as a wrapper around that. With this a LinuxKPI driver using
physical addresses correctly finds the configuration registers of
the GPU.
This unbreaks amdgpu on NXP Layerscape LX2160A SoC (SolidRun HoneyComb
LX2K workstation) which has a Translation Offset in ACPI for
below-4G PCI addresses.
Drop fpstate only after copying out xfpustate from the thread usermode
save area. Otherwise a context switch between get_fpcontext(), which now
returns the pointer directly into user save area, and copyout, would
cause reinit of the save area, loosing user registers.
Reported, reviewed, and tested by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Differential revision: https://reviews.freebsd.org/D32159
Andrew Turner [Mon, 27 Sep 2021 11:22:15 +0000 (12:22 +0100)]
Check cpu_softc is not NULL before dereferencing
In the acpi_cpu_postattach SYSINIT function cpu_softc may be NULL, e.g.
on arm64 when booting from FDT. Check it is not NULL at the start of
the function so we don't try to dereference a NULL pointer.
Rick Macklem [Mon, 27 Sep 2021 01:37:25 +0000 (18:37 -0700)]
nfscl: Add a check for "has acquired a delegation" to nfscl_removedeleg()
Commit 5e5ca4c8fc53 added a flag to a NFSv4 mount point that is set when
the first delegation is acquired from the NFSv4 server.
For a common case where delegations are not being issued by the
NFSv4 server, the nfscl_removedeleg() code acquires the mutex lock for
open/lock state, finds the delegation list empty, then just unlocks the
mutex and returns. This patch adds a check of the flag to avoid the
need to acquire the mutex for this common case.
This change appears to be performance neutral for a small number
of opens, but should reduce lock contention for a large number of opens
for the common case where server is not issuing delegations.
This commit should not affect the high level semantics of delegation
handling.
Alexander Motin [Sun, 26 Sep 2021 16:03:05 +0000 (12:03 -0400)]
sched_ule(4): Fix hang with steal_thresh < 2.
e745d729be60 caused infinite loop with interrupts disabled in load
stealing code if steal_thresh set below 2. Such configuration should
not generally be used, but appeared some people are using it to
workaround some problems.
To fix the problem explicitly pass to sched_highest() minimum number
of transferrable threads, supported by the caller, instead of guessing.
John Baldwin [Sat, 25 Sep 2021 18:24:35 +0000 (11:24 -0700)]
kernel: Disable errors for -Walloca-larger-than for GCC.
GCC complains about the use of alloca() with variable sizes (for XSAVE
state len) in sendsig() for i386. Modern XSAVE state is probably
getting a bit large for the i386 kstack, but downgrade the error to a
warning.
unzip: sync with NetBSD upstream to add passphrase support
- Add support for password protected zip archives.
We use memset_s() rather than explicit_bzero() for more portable
(See PR).
- Use success/failure macro in exit()
- Mention ZIPX format in unzip(1)
Submitted by: Mingye Wang and Alex Kozlov (ak@)
PR: 244181
Reviewed by: mizhka
Obtained from: NetBSD
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28892
ng_ether: Create netgraph nodes for bridge interfaces.
Create netgraph nodes for bridge interfaces when the ng_ether module
is loaded. If a bridge interface is created after loading the ng_ether
module, a netgraph node is created via ether_ifattach().
mmc: fix 1-byte reallocs (when it should have been sizeof device_t)
Reported by KASAN:
panic: ASan: Invalid access, 8-byte write at 0xfffffe00f0992610, RedZonePartial(1)
panic() at panic+0xb5/frame 0xffffffff86a595b0
__asan_store8_noabort() at __asan_store8_noabort+0x376/frame 0xffffffff86a59670
mmc_go_discovery() at mmc_go_discovery+0x6c61/frame 0xffffffff86a5a790
mmc_delayed_attach() at mmc_delayed_attach+0x35/frame 0xffffffff86a5a7b0
[snip]
Mark Johnston [Sat, 25 Sep 2021 14:15:31 +0000 (10:15 -0400)]
amd64: Avoid copying td_frame from kernel procs
When creating a new thread, we unconditionally copy td_frame from the
creating thread. For threads which never return to user mode, this is
unnecessary since td_frame just points to the base of the stack or a
random interrupt frame.
If KASAN is configured this copying may also trigger false positives
since the td_frame region may contain poisoned stack regions. It was
not noticed before since thread0 used a dummy proc0_tf trapframe, and
kernel procs are generally created by thread0. Since commit df8dd6025af88a99d34f549fa9591a9b8f9b75b1, though, we call
cpu_thread_alloc(&thread0) when initializing FPU state, which
reinitializes thread0.td_frame.
Work around the problem by not copying the frame unless the copying
thread came from user mode. While here, de-duplicate the copying and
remove redundant re(initialization) of td_frame.
Reported by: syzbot+2ec89312bffbf38d9aec@syzkaller.appspotmail.com
Reviewed by: kib
Fixes: df8dd6025af8
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32057
Mark Johnston [Sat, 25 Sep 2021 14:13:56 +0000 (10:13 -0400)]
cam: Avoiding waking up doneq threads if we're dumping
Depending on the state of the target doneq thread at the time of the
panic, the wakeup can hang indefinitely in thread_lock_block_wait().
That function should likely be modified to return immediately if the
scheduler is stopped, but it is also preferable to avoid wakeups in
general after a panic.
Reported by: pho
Reviewed by: mav, imp
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32126
x86 bounce_bus_dmamem_alloc(): use malloc_aligned() only when possible
malloc_domainset_aligned() requires that alignment is less than
page size. Fall back to other allocation methods, most likely
kmem_alloc_contig(), when malloc_aligned() cannot fullfill the driver
request.
Reported by: Loic F <loic.f@hardenedbsd.org>
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32127
For alignment we do not need to do anything to make it operational.
For size, upgrade zero sized request to one byte so that we do not
request insane amount of memory for placeholder.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32127
Marko Zec [Sat, 25 Sep 2021 04:29:48 +0000 (06:29 +0200)]
[fib_algo][dxr] Split unused range chunk list in multiple buckets
Traversing a single list of unused range chunks in search for a block
of optimal size was suboptimal.
The experience with real-world BGP workloads has shown that on average
unused range chunks are tiny, mostly in length from 1 to 4 or 5, when
DXR is configured with K = 20 which is the current default (D16X4R).
Therefore, introduce a limited amount of buckets to accomodate descriptors
of empty blocks of fixed (small) size, so that those can be found in O(1)
time. If no empty chunks of the requested size can be found in fixed-size
buckets, the search continues in an unsorted list of empty chunks of
variable lengths, which should only happen infrequently.
This change should permit us to manage significantly more empty range
chunks without sacrifying the speed of incremental range table updating.
Alexander Motin [Sat, 25 Sep 2021 03:25:46 +0000 (23:25 -0400)]
Make CPU children explicitly share parent unit numbers.
Before this device unit number match was coincidental and broke if I
disabled some CPU device(s). Aside of cosmetics, for some drivers
(may be considered broken) it caused talking to wrong CPUs.
Colin Percival [Sat, 25 Sep 2021 03:20:33 +0000 (20:20 -0700)]
loader printf: Profile with TSLOG
Now that the loader tslog code doesn't call printf, we can profile
printf using TSLOG. On an EC2 c5.xlarge instance, we spend roughly
45 ms here (out of roughly 500 ms), presumably due to the time spent
writing output to the console.
Kyle Evans [Wed, 27 Jan 2021 18:12:33 +0000 (12:12 -0600)]
makesyscalls: sprinkle some assert() on standard function calls
Improves our error reporting, ensuring that we aren't just ignoring
errors in the common case.
Note specifically the boundary where we have to change up our error
handling approach. It's fine to error() out up until we create the
tempdir, then the rest should try to handle it gracefully and abort().
A future change will clean this up further by pcall'ing all of the bits
that cannot currently error() without cleaning up.
This was previously needed only for CloudABI, which used it to generate
its capenabled from syscalls.master. CloudABI was removed in cf0ee8738e31, so we don't need to support this anymore. Others looking
to do similar things should come up with a more integrated technique,
such as a .conf flag or pattern/glob support. brooks suggests that it
could be done in modern makesyscalls.lua by adding a config flag to
specify always-on/initial flags (CAPENABLED).
Reviewed by: brooks, imp
MFC after: never
Differential Revision: https://reviews.freebsd.org/D32095
Alexander Motin [Sat, 25 Sep 2021 01:03:02 +0000 (21:03 -0400)]
acpi_cpu: Make device unit numbers match OS CPU IDs.
There are already APIC ID, ACPI ID and OS ID for each CPU. In perfect
world all of those may match, but at least for SuperMicro server boards
none of them do. Plus none of them match the CPU devices listing order
by ACPI. Previous code used the ACPI device listing order to number
cpuX devices. It looked nice from NewBus perspective, but introduced
4th different set of IDs. Extremely confusing one, since in some places
the device unit numbers were treated as OS CPU IDs (coretemp), but not
in others (sysctl dev.cpu.X.%location).
Alexander Motin [Sat, 25 Sep 2021 00:27:10 +0000 (20:27 -0400)]
bus: Cleanup device_probe_child()
When device driver probe method returns 0, i.e. absolute priority, do
not remove its class from the device just to set it back few lines
later, that may change the device unit number, etc. and after which
we'd better call the probe again.
If during search we found some driver with absolute priority, we do
not need to set device driver and class since we haven't removed them
before.
It should not happen, but if second probe method call failed, remove
the driver and possibly the class from the device as it was when we
started.
When WITHOUT_INET6 is selected we generate a null if-then-else blocks
due to incorrect placment of #if statments. Move the #if statements
reducing unnecessary runtime comparisons WITHOUT_INET6.