Alexander Motin [Mon, 2 Aug 2021 02:42:01 +0000 (22:42 -0400)]
sched_ule(4): Use trylock when stealing load.
On some load patterns it is possible for several CPUs to try steal
thread from the same CPU despite randomization introduced. It may
cause significant lock contention when holding one queue lock idle
thread tries to acquire another one. Use of trylock on the remote
queue allows both reduce the contention and handle lock ordering
easier. If we can't get lock inside tdq_trysteal() we just return,
allowing tdq_idled() handle it. If it happens in tdq_idled(), then
we repeat search for load skipping this CPU.
On 2-socket 80-thread Xeon system I am observing dramatic reduction
of the lock spinning time when doing random uncached 4KB reads from
12 ZVOLs, while IOPS increase from 327K to 403K.
Alexander Motin [Mon, 2 Aug 2021 02:07:51 +0000 (22:07 -0400)]
sched_ule(4): Reduce duplicate search for load.
When sched_highest() called for some CPU group returns nothing, idle
thread calls it for the parent CPU group. But the parent CPU group
also includes the CPU group we've just searched, and unless there is
a race going on, it is unlikely we find anything new this time.
Avoid the double search in case of parent group having only two sub-
groups (the most prominent case). Instead of escalating to the parent
group run the next search over the sibling subgroup and escalate two
levels up after if that fail too. In case of more than two siblings
the difference is less significant, while searching the parent group
can result in better decision if we find several candidate CPUs.
On 2-socket 40-core Xeon system I am measuring ~25% reduction of CPU
time spent inside cpu_search_highest() in both SMT (2x20x2) and non-
SMT (2x20) cases.
Alexander Motin [Thu, 29 Jul 2021 01:18:50 +0000 (21:18 -0400)]
Refactor/optimize cpu_search_*().
Remove cpu_search_both(), unused for many years. Without it there is
less sense for the trick of compiling common cpu_search() into separate
cpu_search_lowest() and cpu_search_highest(), so split them completely,
making code more readable. While there, split iteration over children
groups and CPUs, complicating code for very small deduplication.
Stop passing cpuset_t arguments by value and avoid some manipulations.
Since MAXCPU bump from 64 to 256, what was a single register turned
into 32-byte memory array, requiring memory allocation and accesses.
Splitting struct cpu_search into parameter and result parts allows to
even more reduce stack usage, since the first can be passed through
on recursion.
Remove CPU_FFS() from the hot paths, precalculating first and last CPU
for each CPU group in advance during initialization. Again, it was
not a problem for 64 CPUs before, but for 256 FFS needs much more code.
With these changes on 80-thread system doing ~260K uncached ZFS reads
per second I observe ~30% reduction of time spent in cpu_search_*().
Ka Ho Ng [Mon, 19 Apr 2021 08:07:03 +0000 (16:07 +0800)]
AMD-vi: Fortify IVHD device_identify process
- Use malloc(9) to allocate ivhd_hdrs list. The previous assumption
that there are at most 10 IVHDs in a system is not true. A counter
example would be a system with 4 IOMMUs, and each IOMMU is related
to IVHDs type 10h, 11h and 40h in the ACPI IVRS table.
- Always scan through the whole ivhd_hdrs list to find IVHDs that has
the same DeviceId but less prioritized IVHD type.
Sponsored by: The FreeBSD Foundation
MFC with: 74ada297e897
Reviewed by: grehan
Approved by: lwhsu (mentor)
Differential Revision: https://reviews.freebsd.org/D29525
Ka Ho Ng [Mon, 2 Aug 2021 09:54:40 +0000 (17:54 +0800)]
vmm: Bump vmname buffer in struct vm to VM_MAX_NAMELEN + 1
In hw.vmm.create sysctl handler the maximum length of vm name is
VM_MAX_NAMELEN. However in vm_create() the maximum length allowed is
only VM_MAX_NAMELEN - 1 chars. Bump the length of the internal buffer to
allow the length of VM_MAX_NAMELEN for vm name.
Reviewed by: grehan
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31372
Goran Mekić [Wed, 4 Aug 2021 10:04:54 +0000 (18:04 +0800)]
sound: Add an example of basic sound application
This is an example demonstrating the usage of the OSS-compatible APIs
provided by the sound(4) subsystem. It reads frames from a dsp node and
writes them to the same dsp node.
Rick Macklem [Thu, 12 Aug 2021 23:48:28 +0000 (16:48 -0700)]
nfsd: Fix sanity check for NFSv4.2 Allocate operations
The NFSv4.2 Allocate operation sanity checks the aa_offset
and aa_length arguments. Since they are assigned to variables
of type off_t (signed) it was possible for them to be negative.
It was also possible for aa_offset+aa_length to exceed OFF_MAX
when stored in lo_end, which is uint64_t.
This patch adds checks for these cases to the sanity check.
Dimitry Andric [Sat, 21 Aug 2021 21:03:37 +0000 (23:03 +0200)]
Apply clang fix for assertion failure compiling multimedia/minitube
Merge commit 79f9cfbc21e0 from llvm git (by Yaxun (Sam) Liu):
Do not merge LocalInstantiationScope for template specialization
A lambda in a function template may be recursively instantiated. The recursive
lambda will cause a lambda function instantiated multiple times, one inside another.
The inner LocalInstantiationScope should not be marked as MergeWithParentScope
since it already has references to locals properly substituted, otherwise it causes
assertion due to the check for duplicate locals in merged LocalInstantiationScope.
Kyle Evans [Thu, 18 Feb 2021 04:10:46 +0000 (22:10 -0600)]
pkg: use specific CONFSNAME_${file} for FreeBSD.conf
Setting CONFSNAME directly is a little more complicated for downstream
consumers, as any additional CONFS that are added here will inherit the
group name by default. This is perhaps arguably a design flaw in CONFS
because inheriting NAME will never give a good result when additional
files are added, but this is a low-effort change.
While we're here, pull FreeBSD.conf.${branch} out into a PKGCONF
variable so one can just drop a new repo config in entirely with a new
naming scheme. CONFSNAME gets set based on chopping anything off after
".conf", so that, e.g.:
Kyle Evans [Thu, 18 Feb 2021 03:41:53 +0000 (21:41 -0600)]
pkg: allow multiple add arguments again
While pkg(7) add only handles a single 'add' argument, pkg-add(8) fully
handles multiple arguments.
Stop rejecting it, just turn off local-bootstrap mode and proceed to
remote bootstrap if we need it.
While we're here, check if the first argument to pkg add is even a pkg
package. If it's not, also do remote bootstrap instead. Future work
could improve this altogether by picking out a pkg package out of many
and local bootstrap then pass the rest through to the newly installed
pkg.
Adam Fenn [Mon, 2 Aug 2021 16:27:17 +0000 (11:27 -0500)]
devclass_alloc_unit: move "at" hint test to after device-in-use test
Only perform this expensive operation when the unit number is a
potential candidate (i.e. not already in use), thereby reducing device
scan time on systems with many devices, unit numbers, and drivers.
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
X-NetApp-PR: #61
Toomas Soome [Sat, 31 Jul 2021 08:09:48 +0000 (11:09 +0300)]
loader: open file list should be dynamic
Summary:
Open file list is currently created as statically allocated array
(64 items). Once this array is filled up, loader will not be able
to operate with files. In most cases, this mechanism is good enough,
but the problem appears, when we have many disks with zfs pool(s).
In current loader implementation, all discovered zfs pool
configurations are kept in memory and disk devices open - consuming
the open file array. Rewrite the open file mechanism to use
dynamically allocated list.
makesyscalls was rewritten in Lua and introduced in d3276301ab. In the
time since, no objections have risen and a warning was introduced long
ago on invocation of makesyscalls.sh that it would be removed before
FreeBSD 13. Belatedly follow through on that.
Kyle Evans [Sun, 20 Jun 2021 19:36:10 +0000 (14:36 -0500)]
kenv: allow listing of static kernel environments
The early environment is typically cleared, so these new options
need the PRESERVE_EARLY_KENV kernel config(8) option. These environments
are reported as missing by kenv(1) if the option is not present in the
running kernel.
Kyle Evans [Sun, 20 Jun 2021 19:29:31 +0000 (14:29 -0500)]
kern: add an option for preserving the early kenv
Some downstream configurations do not store secrets in the
early (loader/static) environments and desire a way to preserve these
for diagnostic reasons. Provide an option to do so.
Patrick Kelsey [Mon, 26 Apr 2021 04:25:59 +0000 (00:25 -0400)]
iflib: Improve mapping of TX/RX queues to CPUs
iflib now supports mapping each (TX,RX) queue pair to the same CPU
(default), to separate CPUs, or to a pair of physical and logical CPUs
that share the same L2 cache. The mapping mechanism supports unequal
numbers of TX and RX queues, with the excess queues always being
mapped to consecutive physical CPUs. When the platform cannot
distinguish between physical and logical CPUs, all are treated as
physical CPUs. See the comment on get_cpuid_for_queue() for the
entire matrix.
The following device-specific tunables influence the mapping process:
dev.<device>.<unit>.iflib.core_offset (existing)
dev.<device>.<unit>.iflib.separate_txrx (existing)
dev.<device>.<unit>.iflib.use_logical_cores (new)
The following new, read-only sysctls provide visibility of the mapping
results:
dev.<device>.<unit>.iflib.{t,r}xq<n>.cpu
When an iflib driver allocates TX softirqs without providing reference
RX IRQs, iflib now binds those TX softirqs to CPUs using the above
mapping mechanism (that is, treats them as if they were TX IRQs).
Previously, such bindings were left up to the grouptaskqueue code and
thus fell outside of the iflib CPU mapping strategy.
ipfw: fix possible data race between jump cache reading and updating.
Jump cache is used to reduce the cost of rule lookup for O_SKIPTO and
O_CALLRETURN actions. It uses rules chain id to check correctness of
cached value. But due to the possible race, there is the chance that
one thread can read invalid value. In some cases this can lead to out
of bounds access and panic.
Use thread fence operations to constrain the reordering of accesses.
Also rename jump_fast and jump_linear functions to jump_cached and
jump_lookup_pos respectively.
John Baldwin [Tue, 17 Aug 2021 21:39:58 +0000 (14:39 -0700)]
OpenSSL: Refactor KTLS tests to better support TLS 1.3.
Most of this upstream commit touched tests not included in the
vendor import. The one change merged in is to remove a constant
only present in an internal header to appease the older tests.
John Baldwin [Tue, 17 Aug 2021 21:39:32 +0000 (14:39 -0700)]
OpenSSL: Update KTLS documentation
KTLS support has been changed to be off by default, and configuration is
via a single "option" rather two "modes". Documentation is updated
accordingly.
John Baldwin [Tue, 17 Aug 2021 21:39:03 +0000 (14:39 -0700)]
OpenSSL: Only enable KTLS if it is explicitly configured
It has always been the case that KTLS is not compiled by default. However
if it is compiled then it was automatically used unless specifically
configured not to. This is problematic because it avoids any crypto
implementations from providers. A user who configures all crypto to use
the FIPS provider may unexpectedly find that TLS related crypto is actually
being performed outside of the FIPS boundary.
Instead we change KTLS so that it is disabled by default.
We also swap to using a single "option" (i.e. SSL_OP_ENABLE_KTLS) rather
than two separate "modes", (i.e. SSL_MODE_NO_KTLS_RX and
SSL_MODE_NO_KTLS_TX).
John Baldwin [Tue, 17 Aug 2021 21:37:47 +0000 (14:37 -0700)]
OpenSSL: Correct the return value of BIO_get_ktls_*().
BIO_get_ktls_send() and BIO_get_ktls_recv() are documented as
returning either 0 or 1. However, they were actually returning the
internal value of the associated BIO flag for the true case instead of
1.
John Baldwin [Tue, 10 Aug 2021 21:18:43 +0000 (14:18 -0700)]
nfs tls: Update for SSL_OP_ENABLE_KTLS.
Upstream OpenSSL (and the KTLS backport) have switched to an opt-in
option (SSL_OP_ENABLE_KTLS) in place of opt-out modes
(SSL_MODE_NO_KTLS_TX and SSL_MODE_NO_KTLS_RX) for controlling kernel
TLS.
Kevin Bowling [Mon, 23 Aug 2021 16:21:39 +0000 (09:21 -0700)]
ixgbe: Avoid sbuf_trim(9) in sysctl handler
This was an error, we cannot use sbuf_trim(9) in the
ixgbe_sbuf_fw_version function because it also gets called in
the context of sbuf_new_for_sysctl(9). sbuf(9) explains the interaction
with drain functions as used by sbuf_new_for_sysctl(9).
The macro bit_foreach() traverses all set bits in the bitstring in the
forward direction, assigning each location in turn to variable.
The macro bit_foreach_at() traverses all set bits in the bitstring in
the forward direction at or after the zero-based bit index, assigning
each location in turn to variable.
The bit_foreach_unset() and bit_foreach_unset_at() macros which
traverses unset bits are implemented for completeness.
Kyle Evans [Wed, 18 Aug 2021 17:31:45 +0000 (12:31 -0500)]
uipc: avoid circular pr_{slow,fast}timos
domain_init() gets reinvoked for each vnet on a system, so we must not
alter global state. Practically speaking, we were creating circular
lists and tying up a softclock thread into an infinite loop.
The breakage here was most easily observed by simply creating a jail
in a new vnet and watching the system suddenly become erratic.
Reported by: markj
Fixes: e0a17c3f063f ("uipc: create dedicated lists for fast ...")
Pointy hat: kevans
Alexander Motin [Mon, 9 Aug 2021 01:34:33 +0000 (21:34 -0400)]
Optimize res_find().
When the device name is provided, we can simply run strncmp() for each
line to quickly skip unrelated ones, that is much faster than sscanf()
and only then strcmp().
Mark Johnston [Mon, 16 Aug 2021 17:15:25 +0000 (13:15 -0400)]
sigtimedwait: Use a unique wait channel for sleeping
When a sigtimedwait(2) caller goes to sleep, it uses a wait channel of
p->p_sigacts with the proc lock as the interlock. However, p_sigacts
can be shared between processes if a child is created with
rfork(RFSIGSHARE | RFPROC). Thus we can end up with two threads
sleeping on the same wait channel using different locks, which is not
permitted.
Fix the problem simply by using a process-unique wait channel, following
the example of sigsuspend. The actual wait channel value is irrelevant
here, sleeping threads are awoken using sleepq_abort().
Reported by: syzbot+8c417afabadb50bb8827@syzkaller.appspotmail.com
Reported by: syzbot+1d89fc2a9ef92ef64fa8@syzkaller.appspotmail.com
Reviewed by: kib
Sponsored by: The FreeBSD Foundation