The current value causes significant artificial slowdown during mass
parallel file removal, which can be observed both on FreeBSD and Linux
when running real workloads.
Sample results from Linux doing make -j 96 clean after an allyesconfig
modules build:
before: 4.14s user 6.79s system 48% cpu 22.631 total
after: 4.17s user 6.44s system 153% cpu 6.927 total
Apply llvm fix for assertion/crash building math/vtk
Merge commit 307ace7f20d5 from llvm git (by David Sherwood):
[LoopVectorize] Ensure the VPReductionRecipe is placed after all it's inputs
When vectorising ordered reductions we call a function
LoopVectorizationPlanner::adjustRecipesForReductions to replace the
existing VPWidenRecipe for the fadd instruction with a new
VPReductionRecipe. We attempt to insert the new recipe in the same
place, but this is wrong because createBlockInMask may have
generated new recipes that VPReductionRecipe now depends upon. I
have changed the insertion code to append the recipe to the
VPBasicBlock instead.
Added a new RUN with tail-folding enabled to the existing test:
The SAN interceptors simply take a bus_space_tag_t, so we're
dropping qualifiers here. const semantics with a ptr typedef mean we'd
have to drop or change the bus_space_tag_t abstraction used in the SAN
sanitizers in order to make a compatible change there, which likely
isn't worth it.
Reviewed by: andrew, markj
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D36764
Doug Moore [Tue, 27 Sep 2022 18:30:31 +0000 (13:30 -0500)]
sysctl_search_oid: remove useless tests
sysctl_search_old makes several tests in a loop that can be removed.
The first test in the loop is only ever true on the first loop
iteration, and is always true on that iteration, so its work can be
done before the loop begins.
The upper and lower bounds on the loop variable 'indx' are each tested
on each iteration, but 'indx' is changed in one direction or the other
only once within the loop, so only one bound needs to be checked.
Two ways remain in the loop that nodes[indx] can change (after one of
them is put before the loop start), and one of them applies exactly
when indx has been incremented, so no separate test for that case
requires testing.
Restructure and add comments that makes clearer that this is a basic
depth-first search.
Randall Stewart [Tue, 27 Sep 2022 17:38:20 +0000 (13:38 -0400)]
Tcp progress timeout
Rack has had the ability to timeout connections that just sit idle automatically. This
feature of course is off by default and requires the user set it on (though the socket option
has been missing in tcp_usrreq.c). Lets get the progress timeout fully supported in
the base stack as well as rack.
Specifically, when we receive a violation and we're configured to panic,
kasan_enabled gets unset before we descend into panic(). At this point,
there's no longer any reason to allow marking as kasan_shadow_check() is
disabled -- we have some inherent risk of faulting or panicking if the
system's in a bad enough state with no benefit.
Reviewed by: markj
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D36742
When taking a snapshot on a UFS/FFS filesystem, it must be mounted.
The "update" mount option must be specified when the "snapshot"
mount option is used. Return EINVAL if the "snapshot" option is
specified without the "update" option also requested.
Reported by: Robert Morris
Reviewed by: kib
PR: 265362
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Andrew Turner [Tue, 23 Aug 2022 09:50:18 +0000 (10:50 +0100)]
arm64 pmap: batch chunk removal in pmap_remove_pages
As with amd64 batch chunk removal in pmap_remove_pages to move it out
of the pv list lock. This is one of the main contested locks when
running poudriere on a 160 core Ampere Altra server.
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D36305
if_epair: refactor interface creation and enqueue code.
* Factor out queue selection (epair_select_queue()) and mbuf
preparation (epair_prepare_mbuf()) from epair_menq(). It simplifies
epair_menq() implementation and reduces the amount of dependencies
on the neighbouring epair.
* Use dedicated epair_set_state() instead of 2-lines copy-paste
* Factor out unit selection code (epair_handle_unit()) from
epair_clone_create(). It simplifies the clone creation logic.
netinet6: factor interface addition code to the dedicated function
Summary:
Move SIOCAIFADDR_IN6 (current "primary" ioctl to add an IPv6
interface address) handling code to the dedicated in6_addifaddr()
function and make it a part of KPI. This allows in-kernel users to
add/delete interfaces addresses without relying on ioctl interface.
Andrew Turner [Mon, 26 Sep 2022 13:56:09 +0000 (14:56 +0100)]
Allow the arm64 pmap table bootstrap to work in more places
Rework the pmap_bootstrap table generation so we can use it with
partially filled out page tables after the DMAP has been bootstrapped.
This allows it to be reused by the later bootstrap code.
* change current NTP services offered by the FreeBSD Installer;
* no longer offer ntpdate to be enabled and started on boot;
* start offering the option to make ntpd set the date and time on boot itself.
The motivation for this change comes from the ntpdate(8) manpage:
Note: The functionality of this program is now available in the ntpd(8)
program. See the -q command line option in the ntpd(8) page. After a
suitable period of mourning, the ntpdate utility is to be retired from
this distribution.
Alan Somers [Wed, 7 Sep 2022 14:12:49 +0000 (08:12 -0600)]
Fix O(n^2) behavior in sysctl
Sysctl OIDs were internally stored in linked lists, triggering O(n^2)
behavior when userland iterates over many of them. The slowdown is
noticeable for MIBs that have > 100 children (for example, vm.uma). But
it's unignorable for kstat.zfs when a pool has > 1000 datasets.
Convert the linked lists into RB trees. This produces a ~25x speedup
for listing kstat.zfs with 4100 datasets, and no measurable penalty for
small dataset counts.
John Baldwin [Mon, 26 Sep 2022 21:58:06 +0000 (14:58 -0700)]
cxgbe: Use secq(9) to manage the timestamp generations.
This is mostly cosmetic, but it also doesn't leave a gap of time where
no structures are valid. Instead, we permit the ISR to continue to
use the previous structure if the write to update cal_current isn't
yet visible.
John Baldwin [Mon, 26 Sep 2022 21:57:26 +0000 (14:57 -0700)]
cxgbe: Compute timestamps via sbintime_t.
This uses fixed-point math already used elsewhere in the kernel for
sub-second time values. To avoid overflows this does require updating
the calibration once a second rather than once every 30 seconds. Note
that the cxgbe driver already queries multiple registers once a second
for the statistics timers. This version also uses fewer instructions
with no branches (for the math portion) in the per-packet fast path.
LinuxKPI: add struct dmi_header and unsupported dmi_walk()
Add a structure definition as well as a dummy dmi_walk for now
which returns an error as not supported. Our current dmi implementation
is special but does not give access to all details but rather only
information from kenv which does not suffice all use cases.
LinuxKPI: add the "dummy" includes directory to builds
While we could add the dummy includes directory manually to only the
drivers needing it, it seems a lot easier to simply add it to all
without any expected harm.
This is needed for more drivers (and to remove some #ifdef in current
ones) with empty header files being present not yielding errors.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Reviewed by: hselasky, imp
Differential Revision: https://reviews.freebsd.org/D36684
LinuxKPI: define LINUXKPI_INCLUDES for module builds as well
While for in-kernel we already have LINUXKPI_INCLUDES in kern.pre.mk
for kmod builds we've not had a common define to use leading to various
spellings of include paths.
In order for the include list to be expanded more easily in the future,
e.g., adding the "dummy" includes (for all) and to harmonize code,
duplicate LINUXKPI_INCLUDES to kmod.mk and use it for all module Makefiles.
pci_host_generic: stop address translation in bus_alloc_resource
Translating the provided range prior to rman_reserve_resource(9) is
decidedly wrong; the caller may be trying to do a wildcard allocation,
for which the implementation is expected to DTRT and clamp the range to
what's actually feasible.
We don't use the resulting translation here anyways, so just remove it
entirely -- the rman in the default implementation is derived from
sc->ranges, so the translation should trivially succeed every time as
long as the reservation succeeded. If something has gone awry in a
derived driver, we'll detect it when we translate prior to activation,
so there's likely no diagnostic value in retaining the translation after
reservation either.
Reviewed by: andrew
Noticed by: jhb
Differential Revision: https://reviews.freebsd.org/D36618
Randall Stewart [Mon, 26 Sep 2022 19:20:18 +0000 (15:20 -0400)]
TCP complete end status work.
The ending of a connection can tell us a lot about what happened i.e. did
it fail to setup, did it timeout, was it a normal close. Often times this is
useful information to help analyze and debug issues. Rack has had
end status for some time but the base stack as not. Lets go a ahead
and add in the missing bits to populate the end status.
Randall Stewart [Mon, 26 Sep 2022 19:12:03 +0000 (15:12 -0400)]
TCP rack does not work properly with cubic.
Right now if you use rack with cubic (the new default cc) you will have
improper results. This is because rack uses different variables than
the base stack (or bbr) and thus tcp_compute_pipe() always returns
so that cubic will choose a 30% backoff not the 50% backoff it should
when it is newreno compatibility mode. The fix is to allow a stack (rack)
to override its own compute_pipe.
Brooks Davis [Mon, 26 Sep 2022 17:56:51 +0000 (18:56 +0100)]
telnetd: fix two-byte input crash
Move initialization of the slc table earlier so it doesn't get
accessed before that happens.
For details on the issue, see:
https://pierrekim.github.io/blog/2022-08-24-2-byte-dos-freebsd-netbsd-telnetd-netkit-telnetd-inetutils-telnetd-kerberos-telnetd.html
if_ovpn: fix address family check when traffic class bits are set
When the tunneled (IPv6) traffic had traffic class bits set (but only >=
16) the packet got lost on the receive side.
This happened because the address family check in ovpn_get_af() failed
to mask correctly, so the version check didn't match, causing us to drop
the packet.
While here also extend the existing 6-in-6 test case to trigger this
issue.
Michael Tuexen [Mon, 26 Sep 2022 10:30:50 +0000 (12:30 +0200)]
tcp: make RACK loadable again using the default configuration
Without this patch, loading the RACK stack required the newreno
CC module to be compiled into the kernel. This is not the case
anymore since CUBIC is the default now.
Reviewed by: rscheff@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D36707
Wei Hu [Mon, 26 Sep 2022 06:30:37 +0000 (06:30 +0000)]
arm64: Enabling new hypercalls using HvCallSetVpRegisters and HvCallGetVpRegisters
Enabling HvCallSetVpRegisters and HvCallGetVpRegisters for hypercalls to
read and write to specific MSRs. This is required for implementing wrmsr
and rdmsr, which is required for Hyper-V vmbus driver for ARM64.
Also we need to use arm smccc hvc 1.2 version as we need to access
registers beyond X0-X3 for HvCallGetVpRegisters. Currently scoping
it only for Hyper-V.
In f697b9432d9c7aa4c5ab5f5445ef5dc1bd40ce00 the name of the PSEUDOFS
was changed from debugsfs to lindebugfs but the in-tree consumers
were not updated now leaving the drivers not loading if compiled
with debugfs support due to missing dependencies.
iwlwifi: enforce FreeBSD specific (expected) behaviour
iwlwifi can return early from probe (in FreeBSD attach) while a separate
thread is still grinding loading the firmware and setting things up.
For us this means that kldload succeeded but we may not have a physical
wireless interface (com) yet but the rc framework might already try to
configure a vap on one.
Wait until we get a firmware completion event from the other thread
(on success or error) and block returning. That way we can ensure that
the "hw" (or com in net80211 terms) is there when we return from attach
matching the expected FreeBSD driver behaviour.
Reported by: J.R. Oldroyd (jr opal.com)
Reported by: probably inderectly showing as other problem
Tested by: J.R. Oldroyd (jr opal.com)
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Improvements and changes to integrate bsddialog(1) with scripts in BASE.
Overview:
* New options. --and-widget, --keep-tite, --calendar.
* Change output format. Menus and --print-maxsize.
* Redefine sizing. Fixed rows, cols and menurows became at the most.
* Add DIAGNOSTICS. Error messages for bad arguments and options.
* Add keys. Space for --menu, fast keys for --msgbox and --yesno.
* Text. Change default text modification, add --cr-wrap.
See /usr/src/contrib/bsddialog/CHANGELOG '2022-09-24 Version 0.4'
for more detailed information.
Improvements and changes to integrate bsddialog(1) with scripts in BASE.
Overview:
* New options. --and-widget, --keep-tite, --calendar.
* Change output format. Menus and --print-maxsize.
* Redefine sizing. Fixed rows, cols and menurows became at the most.
* Add DIAGNOSTICS. Error messages for bad arguments and options.
* Add keys. Space for --menu, fast keys for --msgbox and --yesno.
* Text. Change default text modification, add --cr-wrap.
See /usr/src/contrib/bsddialog/CHANGELOG '2022-09-24 Version 0.4'
for more detailed information.
Philip Paeps [Sun, 25 Sep 2022 05:42:18 +0000 (13:42 +0800)]
share/zoneinfo: don't build obsolete SystemV zones
The /usr/share/zoneinfo/SystemV directory has been empty on FreeBSD
since 2006. The upstream source file was removed in 2020. Also stop
passing yearisdate to zic(8). This has not been necessary for years.
The script has been removed upstream since 2020.
if_clone: add ifc_link_ifp() / ifc_unlink_ifp() to the KPI
Factor cloner ifp addition/deletion into separate functions and
make them public. This change simlifies the current cloner code
and paves the way to the other upcoming cloner / epair changes.
The vn_rlimit_fsizex() function:
- checks that the write does not exceed RLIMIT_FSIZE limit and fs
maximum supported file size
- truncates write length if it exceeds the RLIMIT_FSIZE or max file size,
but there are some bytes to write
- sends SIGXFSZ if RLIMIT_FSIZE would be exceed otherwise
POSIX mandates the truncated write in case when some bytes can be
written but whole write request fails the RLIMIT_FSIZE check.
The function is supposed to be used from VOP_WRITE()s. Due to
pecularity in the VFS generic write syscall layer, uio_resid must
correctly reflect the written amount (noted by markj). Provide the dual
vn_rlimit_fsizex_res() function to correct uio_resid after the clamp
done in vn_rlimit_fsizex() on VOP_WRITE() return.
Mark Johnston [Sat, 24 Sep 2022 13:20:48 +0000 (09:20 -0400)]
amd64: Ignore 1GB mappings in pmap_advise()
This assertion can be triggered by usermode since vm_map_madvise()
doesn't force advice to be applied to an entire largepage mapping. I
can't see any reason not to permit it, however, since MADV_DONTNEED and
_FREE are advisory and we can simply do nothing when a 1GB mapping is
encountered.
Mark Johnston [Sat, 24 Sep 2022 13:19:21 +0000 (09:19 -0400)]
amd64: Make it possible to grow the KERNBASE region of KVA
pmap_growkernel() may be called when mapping a region above KERNBASE,
typically for a kernel module. If we have enough PTPs left over from
bootstrap, pmap_growkernel() does nothing. However, it's possible to
run out, and in this case pmap_growkernel() will try to grow the kernel
map all the way from kernel_vm_end to somewhere past KERNBASE, which can
easily run the system out of memory. This happens with large kernel
modules such as the nvidia GPU driver. There is also a WIP dtrace
provider which needs to map KVA in the region above KERNBASE (to provide
trampolines which allow a copy of traced kernel instruction to be
executed), and its allocations could potentially trigger this scenario.
This change modifies pmap_growkernel() to manage the two regions
separately, allowing them to grow independently. The end of the
KERNBASE region is tracked by modifying "nkpt".
Mark Johnston [Sat, 24 Sep 2022 13:18:04 +0000 (09:18 -0400)]
smr: Fix synchronization in smr_enter()
smr_enter() must publish its observed read sequence number before
issuing any subsequent memory operations. The ordering provided by
atomic_add_acq_int() is insufficient on some platforms, at least on
arm64, because it permits reordering of subsequent loads with the store
to c_seq.
Thus, use atomic_thread_fence_seq_cst() to issue a store-load barrier
after publishing the read sequence number.
On x86, take advantage of the fact that memory operations are not
reordered with locked instructions to improve code density: we can store
the observed read sequence and provide a store-load barrier with a
single operation.
Based on a patch from Pierre Habouzit <pierre@habouzit.net>.
Mark Johnston [Fri, 23 Sep 2022 23:41:30 +0000 (19:41 -0400)]
sched_4bsd: Fix a racy thread state modification
When a thread switching off-CPU is migrating to a remote CPU,
sched_switch() may trigger a rescheduling of the thread currently
running on that CPU. When doing so, it must ensure that that thread is
locked before modifying thread state. If the thread's lock is not the
scheduler lock, then the thread is in the process of switching off-CPU
and no extra effort is needed, and the initiator does not hold the
thread's lock and thus should not modify any thread state.
Reported and tested by: Steve Kargl
MFC after: 1 week
Warner Losh [Fri, 23 Sep 2022 21:02:53 +0000 (15:02 -0600)]
arm64: don't loop forever if first option in kern.cfg.order not available
strchr returns a pointer to the ',', so if the first option in the list
isn't available, we need to step over the , to look at the next
option. So if kern.cfg.order="acpi,fdt" and we have no acpi, we'd loop
forever with order=',fdt'.
Brooks Davis [Fri, 23 Sep 2022 20:20:52 +0000 (21:20 +0100)]
cpuset(9): Refer to CPU_SETSIZE not MAXCPU
The maximum CPU number of a cpuset_t is determined by CPU_SETSIZE. In
the kernel this is MAXCPU, but in userspace it is CPU_MAXSIZE unless
CPU_SETSIZE is defined before including sys/_cpuset.h. CPU_MAXSIZE is
256 and in userspace MAXCPU is generally 1 because it being set to a
larger MD value is gated on SMP being defined (not generally the case in
userspace).
Warner Losh [Fri, 23 Sep 2022 15:48:26 +0000 (09:48 -0600)]
release: Use standard mount points for arm MBR boot images
Traditionally, we've used /boot/msdos for the MBR mount point for the SD
images that we produced. For GPT and bsdinstall, we've used
/boot/efi. Migrate to using /boot/efi for MBR as well and add a
/boot/msdos -> /boot/efi symlink for compatibility (which may disappear
before 14.0, but will remain on the stable branches).
When we first created the arm images, there was no EFI booting and the
FAT partion on an MBR image was used to hold the firmware, uboot.bin,
SoC config files and ubldr. When we transitioned to uboot with EFI, we
put the loader files in the same partition. Later we standardized on
/boot/efi at about the same time we added GPT support to the RE produced
images. We left the MRB case as /boot/msdos for legacy reasons and since
it wasn't always EFI. Later, we dropped support of non-EFI booting on
the RE produced images, so the duality of /boot/msdos diminished even
more. Since so little secondary meaning remains, putting it all in
/boot/efi standardizes the location and reflects the RE images
better as using efi-only booting.
In addition, always label the msdosfs partion 'efi'. While a small
misnomer on some systems that store other files in the ESP, it was
requested in review for more consistency for similar reasons to the
mountpoint rename. There was no way to have an 'alias' or 'second label'
here, so this breaks compatibility. Since the images are self-contained,
this was judged to be an acceptable change.
Andrew Turner [Thu, 22 Sep 2022 12:09:02 +0000 (13:09 +0100)]
Teach the GICv3 driver to translate memory ranges
As with the GICv1/2 driver teach the GICv3 driver to translate memory
ranges of children. This allows us to create a common
bus_alloc_resource implementation for bot hACPI and FDT attachments.