Alexander Motin [Sat, 2 Oct 2021 03:47:18 +0000 (23:47 -0400)]
sched_ule(4): Fix possible significance loss.
Before this change kern.sched.interact sysctl setting above 32 gave
all interactive threads identical priority of PRI_MIN_INTERACT due to
((PRI_MAX_INTERACT - PRI_MIN_INTERACT + 1) / sched_interact) turning
zero. Setting the sysctl lower reduced the range of used priority
levels up to half, that is not great either.
Change of the operations order should fix the issue, always using full
range of priorities, while overflow is impossible there since both
score and priority values are small. While there, make the variables
unsigned as they really are.
Alexander Motin [Sun, 26 Sep 2021 16:03:05 +0000 (12:03 -0400)]
sched_ule(4): Fix hang with steal_thresh < 2.
e745d729be60 caused infinite loop with interrupts disabled in load
stealing code if steal_thresh set below 2. Such configuration should
not generally be used, but appeared some people are using it to
workaround some problems.
To fix the problem explicitly pass to sched_highest() minimum number
of transferrable threads, supported by the caller, instead of guessing.
Alexander Motin [Thu, 23 Sep 2021 17:41:02 +0000 (13:41 -0400)]
x86: Add NUMA nodes into CPU topology.
Depending on hardware, NUMA nodes may match last level caches, or
they may be above them (AMD Zen 2/3) or below (Intel Xeon w/ SNC).
This information is provided by ACPI instead of CPUID, and it is
provided for each CPU individually instead of mask widths, but
this code should be able to properly handle all the above cases.
This change should immediately allow idle stealing in sched_ule(4)
to prefer load from NUMA-local CPUs to remote ones when the node
does not match LLC. Later we may think of how to better handle it
on sched_pickcpu() side.
Alexander Motin [Tue, 21 Sep 2021 22:14:22 +0000 (18:14 -0400)]
sched_ule(4): Improve long-term load balancer.
Before this change long-term load balancer was unable to migrate
running threads, only ones waiting on run queues. But with growing
number of CPU cores it is quite typical now for system to not have
many waiting threads. But same time if due to some coincidence two
long-running CPU-bound threads ended up sharing same physical CPU
core, they could suffer from the SMT penalty indefinitely, and the
load balancer couldn't help.
Improve that by teaching the load balancer to hint running threads
to migrate by marking them with TDF_NEEDRESCHED and new TDF_PICKCPU
flag, making sched_pickcpu() to search for better CPU later, when
it is convenient.
Fix CPU search logic when balancing to limit round-robin migrations
in case of almost equal load to the group of physical cores. The
previous code bounced threads across all the system, that should be
pretty bad for caches and NUMA affinity, while additional fairness
was almost invisible, diminishing with number of cores in the group.
John Baldwin [Wed, 6 Oct 2021 21:08:49 +0000 (14:08 -0700)]
crypto: Support Chacha20-Poly1305 with a nonce size of 8 bytes.
This is useful for WireGuard which uses a nonce of 8 bytes rather
than the 12 bytes used for IPsec and TLS.
Note that this also fixes a (should be) harmless bug in ossl(4) where
the counter was incorrectly treated as a 64-bit counter instead of a
32-bit counter in terms of wrapping when using a 12 byte nonce.
However, this required a single message (TLS record) longer than 64 *
(2^32 - 1) bytes (about 256 GB) to trigger.
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32122
John Baldwin [Wed, 6 Oct 2021 21:08:48 +0000 (14:08 -0700)]
crypto: Test all of the AES-CCM KAT vectors.
Previously, only test vectors which used the default nonce and tag
sizes (12 and 16, respectively) were tested. This now tests all of
the vectors. This exposed some additional issues around requests with
an empty payload (which wasn't supported) and an empty AAD (which
falls back to CIOCCRYPT instead of CIOCCRYPTAEAD).
- Make use of the 'ivlen' and 'maclen' fields for CIOGSESSION2 to
test AES-CCM vectors with non-default nonce and tag lengths.
- Permit requests with an empty payload.
- Permit an input MAC for requests without AAD.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32121
John Baldwin [Wed, 6 Oct 2021 21:08:48 +0000 (14:08 -0700)]
cryptosoft: Fix support for variable tag lengths in AES-CCM.
The tag length is included as one of the values in the flags byte of
block 0 passed to CBC_MAC, so merely copying the first N bytes is
insufficient.
To avoid adding more sideband data to the CBC MAC software context,
pull the generation of block 0, the AAD length, and AAD padding out of
cbc_mac.c and into cryptosoft.c. This matches how GCM/GMAC are
handled where the length block is constructed in cryptosoft.c and
passed as an input to the Update callback. As a result, the CBC MAC
Update() routine is now much simpler and simply performs the
XOR-and-encrypt step on each input block.
While here, avoid a copy to the staging block in the Update routine
when one or more full blocks are passed as input to the Update
callback.
Reviewed by: sef
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32120
John Baldwin [Wed, 6 Oct 2021 21:08:47 +0000 (14:08 -0700)]
crypto: Support multiple nonce lengths for AES-CCM.
Permit nonces of lengths 7 through 13 in the OCF framework and the
cryptosoft driver. A helper function (ccm_max_payload_length) can be
used in OCF drivers to reject CCM requests which are too large for the
specified nonce length.
Reviewed by: sef
Sponsored by: Chelsio Communications, The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32111
John Baldwin [Wed, 6 Oct 2021 21:08:47 +0000 (14:08 -0700)]
cryptocheck: Support multiple IV sizes for AES-CCM.
By default, the "normal" IV size (12) is used, but it can be overriden
via -I. If -I is not specified and -z is specified, issue requests
for all possible IV sizes.
Reviewed by: markj
Sponsored by: Chelsio Communications, The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32110
John Baldwin [Wed, 6 Oct 2021 21:08:47 +0000 (14:08 -0700)]
cryptodev: Allow some CIOCCRYPT operations with an empty payload.
If an operation would generate a MAC output (e.g. for digest operation
or for an AEAD or EtA operation), then an empty payload buffer is
valid. Only reject requests with an empty buffer for "plain" cipher
sessions.
Some of the AES-CCM NIST KAT vectors use an empty payload.
While here, don't advance crp_payload_start for requests that use an
empty payload with an inline IV. (*)
Reported by: syzbot+d4b94fbd9a44b032f428@syzkaller.appspotmail.com (*)
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32109
John Baldwin [Wed, 6 Oct 2021 21:08:46 +0000 (14:08 -0700)]
cryptodev: Permit explicit IV/nonce and MAC/tag lengths.
Add 'ivlen' and 'maclen' fields to the structure used for CIOGSESSION2
to specify the explicit IV/nonce and MAC/tag lengths for crypto
sessions. If these fields are zero, the default lengths are used.
This permits selecting an alternate nonce length for AEAD ciphers such
as AES-CCM which support multiple nonce leengths. It also supports
truncated MACs as input to AEAD or ETA requests.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32107
John Baldwin [Wed, 6 Oct 2021 21:08:46 +0000 (14:08 -0700)]
crypto: Permit variable-sized IVs for ciphers with a reinit hook.
Add a 'len' argument to the reinit hook in 'struct enc_xform' to
permit support for AEAD ciphers such as AES-CCM and Chacha20-Poly1305
which support different nonce lengths.
Reviewed by: markj
Sponsored by: Chelsio Communications, The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32105
John Baldwin [Tue, 25 May 2021 23:59:19 +0000 (16:59 -0700)]
crypto: Add crypto_cursor_segment() to fetch both base and length.
This function combines crypto_cursor_segbase() and
crypto_cursor_seglen() into a single function. This is mostly
beneficial in the unmapped mbuf case where back to back calls of these
two functions have to iterate over the sub-components of unmapped
mbufs twice.
Bump __FreeBSD_version for crypto drivers in ports.
John Baldwin [Tue, 25 May 2021 23:59:18 +0000 (16:59 -0700)]
crypto: Add a new type of crypto buffer for a single mbuf.
This is intended for use in KTLS transmit where each TLS record is
described by a single mbuf that is itself queued in the socket buffer.
Using the existing CRYPTO_BUF_MBUF would result in
bus_dmamap_load_crp() walking additional mbufs in the socket buffer
that are not relevant, but generating a S/G list that potentially
exceeds the limit of the tag (while also wasting CPU cycles).
John Baldwin [Wed, 6 Oct 2021 21:08:46 +0000 (14:08 -0700)]
cryptodev: Use 'csp' in the handlers for requests.
- Retire cse->mode and use csp->csp_mode instead.
- Use csp->csp_cipher_algorithm instead of the ivsize when checking
for the fixup for the IV length for AES-XTS.
Reviewed by: markj
Sponsored by: Chelsio Communications, The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32103
John Baldwin [Thu, 1 Apr 2021 22:42:30 +0000 (15:42 -0700)]
cryptocheck: Expand the set of sizes tested by -z.
Test individual sizes up to the max encryption block length as well as
a few sizes that include 1 full block and a partial block before
doubling the size.
John Baldwin [Thu, 1 Apr 2021 22:42:18 +0000 (15:42 -0700)]
ossl: Don't encryt/decrypt too much data for chacha20.
The loops for Chacha20 and Chacha20+Poly1305 which encrypted/decrypted
full blocks of data used the minimum of the input and output segment
lengths to determine the size of the next chunk ('todo') to pass to
Chacha20_ctr32(). However, the input and output segments could extend
past the end of the ciphertext region into the tag (e.g. if a "plain"
single mbuf contained an entire TLS record). If the length of the tag
plus the length of the last partial block together were at least as
large as a full Chacha20 block (64 bytes), then an extra block was
encrypted/decrypted overlapping with the tag. Fix this by also
capping the amount of data to encrypt/decrypt by the amount of
remaining data in the ciphertext region ('resid').
John Baldwin [Fri, 5 Mar 2021 17:47:58 +0000 (09:47 -0800)]
poly1305: Don't export generic Poly1305_* symbols from xform_poly1305.c.
There currently isn't a need to provide a public interface to a
software Poly1305 implementation beyond what is already available via
libsodium's APIs and these symbols conflict with symbols shared within
the ossl.ko module between ossl_poly1305.c and ossl_chacha20.c.
Reported by: se, kp
Fixes: 78991a93eb9d
Sponsored by: Netflix
John Baldwin [Thu, 18 Feb 2021 17:22:18 +0000 (09:22 -0800)]
Add an implementation of CHACHA20_POLY1305 to cryptosoft.
This uses the chacha20 IETF and poly1305 implementations from
libsodium. A seperate auth_hash is created for the auth side whose
Setkey method derives the poly1305 key from the AEAD key and nonce as
described in RFC 8439.
Philip Paeps [Mon, 18 Oct 2021 06:19:42 +0000 (14:19 +0800)]
contrib/tzdata: correct DST in Fiji
Direct commit to stable/13.
Unfortunately, there is still no clear consensus on the tz mailing list
about some of the changes introduced by tzdata 2021b and later releases.
Pending consensus, only merge the recently announced DST transition date
for Fiji and corrections to commentary from tzdata 2021d. This corrects
future timestamps in Fiji.
Navdeep Parhar [Thu, 24 Jun 2021 20:05:57 +0000 (13:05 -0700)]
cxgbe(4): Do not configure traffic classes automatically on attach.
The driver used to configure all available classes with some default
parameters on attach and the rest of t4_sched.c was written with the
assumption that all traffic classes are always valid in the hardware.
But this resulted in a lot of informational messages being logged in the
firmware's circular log, crowding out other more useful messages.
This change leaves the tx scheduler alone during attach to reduce the
spam in the devlog. The state of every class is now tracked separately
from its flags and there is support for an 'uninitialized' state.
Navdeep Parhar [Tue, 22 Jun 2021 05:07:56 +0000 (22:07 -0700)]
cxgbe(4): Get the number of usable traffic classes from the firmware.
Recent firmwares are able to utilize the traffic classes of tx channels
that were previously unused. This effectively doubles the number of
traffic classes available per port for 2 port cards. Stop using the raw
per-channel value in the driver and ask the firmware for the number of
usable traffic classes instead.
Navdeep Parhar [Tue, 25 May 2021 20:47:06 +0000 (13:47 -0700)]
cxgbe(4): Update firmwares to 1.25.6.0.
Changes since 1.25.0.0 are listed here. This list comes from the
Release Notes for the "Chelsio Unified Wire v3.14.0.3 for Linux"
release dated 2021-05-21.
Fixes
-----
BASE:
- Fixed Back to back T6 100G-CR4 link coming up with NO FEC sometimes.
- [T5] Try to bring up link in 1G speed if link doesn't come up on 10G.
- Fixed a bug to not allow BaseR fec in 100G speed.
- Fixed linkup issues on BT adapter in 1G and 100M speed.
- Fixed an issue to allow driver to send VI_ENABLE multiple times (once
with rx disable and then later rx enable).
- Fixed rate limiting not working on class number 16 to 30.
- Fixed backward compatibility issue in port type interpretation with vpd
version 0x80.
ETH:
- Fixed a case when firmware failed to deliver NIC WR completion to host.
- No rate limit support for WR ETH_TX_PKTS2 due to performance reasons.
OFLD
- Fixed a connection hang in SO adapters when tp_plen_max (set by driver)
is more than the window size.
- Added fw_filter_vnic_mode to firmware API file (t4fw_interface.h)
- Use correct rx channel in coprocessor crypto completion (CPL_FW6_PLD). This
was causing out of order completion to host.
FOiSCSI
- Fixed a crash due to unaligned access of ipv6 address.
- Fixed a crash during lun reset.
Enhancements
------------
ETH:
- Rate limiting support added for encapsulated (vxlan, nvgre, geneve) NIC TCP
packets.
OFLD:
- More than 128 SGLs supported in FW_RI_FR_NSMR_WR. Now, more than 16GB
(upto 64GB) of PBLs can be written with single FW_RI_FR_NSMR_WR.
Navdeep Parhar [Thu, 27 May 2021 02:18:42 +0000 (19:18 -0700)]
cxgbe(4): Fix an incorrect assert.
CTRL and OFLD tx queues do not have automatic tx credit flush enabled so
it is okay for the cidx not to be the same as the pidx when the queue is
destroyed.
Navdeep Parhar [Sun, 23 May 2021 21:58:29 +0000 (14:58 -0700)]
cxgbe(4): Overhaul CLIP (Compressed Local IPv6) table management.
- Process the list of local IPs once instead of once per adapter. Add
addresses from all VNETs to the driver's list but leave hardware
updates for later when the global VNET/IFADDR list locks have been
released.
- Add address to the hardware table synchronously when a CLIP entry is
requested for an address that's not already in there.
- Provide ioctls that allow userspace tools to manage addresses in the
CLIP table.
- Add a knob (hw.cxgbe.clip_db_auto) that controls whether local IPs are
automatically added to the CLIP table or not.
cxgbe(4): Add support for NIC suspend/resume and live reset.
Add suspend/resume callbacks to the driver and a live reset built around
them. This commit covers the basic NIC and future commits will expand
this functionality to other stateful parts of the chip. Suspend and
resume operate on the chip (the t?nex nexus device) and affect all its
ports. It is not possible to suspend/resume or reset individual ports.
All these operations can be performed on a running NIC. A reset will
look like a link bounce to the networking stack.
Here are some ways to exercise this functionality:
/* Manual reset with driver sysctl. */
# sysctl dev.t6nex.0.reset=1
/* Automatic adapter reset on any fatal error. */
# hw.cxgbe.reset_on_fatal_err=1
Suspend disables the adapter (DMA, interrupts, and the port PHYs) and
marks the hardware as unavailable to the driver. All ifnets associated
with the adapter are still visible to the kernel but operations that
require hardware interaction will fail with ENXIO. All ifnets report
link-down while the adapter is suspended.
Resume will reattach to the card, reconfigure it as before, and recreate
the queues servicing the existing ifnets. The ifnets are able to send
and receive traffic as soon as the link comes back up.
Reset is roughly the same as a suspend and a resume with at least one of
these events in between: D0->D3Hot->D0, FLR, PCIe link retrain.
cxgbe(4): Separate the sw- and hw-specific parts of resource allocations
The driver uses both software resources (locks, callouts, memory for
descriptors and for bookkeeping, sysctls, etc.) and hardware resources
(VIs, DMA queues, TCAM entries, etc.) to operate the NIC. This commit
splits the single *_ALLOCATED flag used to track all these resources
into separate *_SW_ALLOCATED and *_HW_ALLOCATED flags.
This is the simplified pseudocode that now applies to most queues (foo
can be ctrlq/txq/rxq/ofld_txq/ofld_rxq):
/* Idempotent */
alloc_foo
{
if (!SW_ALLOCATED)
init_iq/init_eq/init_fl no-fail sw init
alloc_iq_fl/alloc_eq/alloc_wrq may-fail sw alloc
add_foo_sysctls, etc. no-fail post-alloc items
if (!HW_ALLOCATED)
alloc_iq_fl_hwq/alloc_eq_hwq hw resource allocation
}
/* Idempotent */
free_foo
{
if (!HW_ALLOCATED)
free_iq_fl_hwq/free_eq_hwq release hw resources
if (!SW_ALLOCATED)
free_iq_fl/free_eq/free_wrq release sw resources
}
The routines that take the driver to FULL_INIT_DONE and VI_INIT_DONE and
back are now all idempotent. The quiesce routines pay attention to the
HW_ALLOCATED flag and will not wait on the hardware for pidx/cidx
updates and other completions if this flag is not set.
Rick Macklem [Wed, 13 Oct 2021 00:21:01 +0000 (17:21 -0700)]
nfscl: Fix another deadlock related to the NFSv4 clientID lock
Without this patch, it is possible to hang the NFSv4 client,
when a rename/remove is being done on a file where the client
holds a delegation, if pNFS is being used. For a delegation
to be returned, dirty data blocks must be flushed to the NFSv4
server. When pNFS is in use, a shared lock on the clientID
must be acquired while doing a write to the DS(s).
However, if rename/remove is doing the delegation return
an exclusive lock will be acquired on the clientID, preventing
the write to the DS(s) from acquiring a shared lock on the clientID.
This patch stops rename/remove from doing a delegation return
if pNFS is enabled. Since doing delegation return in the same
compound as rename/remove is only an optimization, not doing
so should not cause problems.
This problem was detected during a recent NFSv4 interoperability
testing event held by the IETF working group.
Rick Macklem [Tue, 12 Oct 2021 04:58:24 +0000 (21:58 -0700)]
nfscl: Fix a deadlock related to the NFSv4 clientID lock
Without this patch, it is possible for a process doing an NFSv4
Open/create of a file to block to allow another process
to acquire the exclusive lock on the clientID when holding
a shared lock on the clientID. As such, both processes
deadlock, with one wanting the exclusive lock, while the
other holds the shared lock. This deadlock is unlikely to occur
unless delegations are in use on the NFSv4 mount.
This patch fixes the problem by not deferring to the process
waiting for the exclusive lock when a shared lock (reference cnt)
is already held by the process.
This problem was detected during a recent NFSv4 interoperability
testing event held by the IETF working group.
Mark Johnston [Wed, 13 Oct 2021 00:11:02 +0000 (20:11 -0400)]
mount: Check for !VDIR mount points before handling -o emptydir
To implement -o emptydir, vfs_emptydir() checks that the passed
directory is empty. This should be done after checking whether the
vnode is of type VDIR, though, or vfs_emptydir() may end up calling
VOP_READDIR on a non-directory.
Reported by: syzbot+4006732c69fb0f792b2c@syzkaller.appspotmail.com
Reviewed by: kib, imp
Sponsored by: The FreeBSD Foundation
Mark Johnston [Fri, 17 Sep 2021 14:44:23 +0000 (10:44 -0400)]
libc/locale: Fix races between localeconv(3) and setlocale(3)
Each locale embeds a lazily initialized lconv which is populated by
localeconv(3) and localeconv_l(3). When setlocale(3) updates the global
locale, the lconv needs to be (lazily) reinitialized. To signal this,
we set flag variables in the locale structure. There are two problems:
- The flags are set before the locale is fully updated, so a concurrent
localeconv() call can observe partially initialized locale data.
- No barriers ensure that localeconv() observes a fully initialized
locale if a flag is set.
So, move the flag update appropriately, and use acq/rel barriers to
provide some synchronization. Note that this is inadequate in the face
of multiple concurrent calls to setlocale(3), but this is not expected
to work regardless.
Thanks to Henry Hu <henry.hu.sh@gmail.com> for providing a test case
demonstrating the race.
John Baldwin [Fri, 11 Jun 2021 21:56:28 +0000 (14:56 -0700)]
Remove 'make update'.
In the CVS days this used be a wrapper around either CVS or CVSup and
used to support updating src, doc, and ports checkouts. With the move
to subversion this only supported updating src and was itself a
wrapper around 'svn update'. With Git, users are probably better off
using appropriate Git commands directly to update without needing an
explicit make target as a wrapper.
Rick Macklem [Mon, 27 Sep 2021 01:37:25 +0000 (18:37 -0700)]
nfscl: Add a check for "has acquired a delegation" to nfscl_removedeleg()
Commit 5e5ca4c8fc53 added a flag to a NFSv4 mount point that is set when
the first delegation is acquired from the NFSv4 server.
For a common case where delegations are not being issued by the
NFSv4 server, the nfscl_removedeleg() code acquires the mutex lock for
open/lock state, finds the delegation list empty, then just unlocks the
mutex and returns. This patch adds a check of the flag to avoid the
need to acquire the mutex for this common case.
This change appears to be performance neutral for a small number
of opens, but should reduce lock contention for a large number of opens
for the common case where server is not issuing delegations.
This commit should not affect the high level semantics of delegation
handling.
Alexander Motin [Tue, 5 Oct 2021 19:01:16 +0000 (15:01 -0400)]
cam(4): Limit search for disks in SES enclosure by single bus
At least for SAS that we only support now disks are typically
connected to the same bus as the enclosure. Limiting the search
scope makes it much faster on systems with multiple buses and
thousands of disks.
Alexander Motin [Tue, 5 Oct 2021 18:54:03 +0000 (14:54 -0400)]
cam(4): Improve XPT_DEV_MATCH
Remove *_MATCH_NONE enums, making no sense and so never used. Make
*_MATCH_ANY enums 0 (no any match flags set), previously used by
*_MATCH_NONE. Bump CAM_VERSION to 0x1a reflecting those changes and
add compat shims.
When traversing through buses and devices do not descend if we can
already see that requested pattern does not match the bus or device.
It allows to save significant amount of time on system with thousands
of disks when doing limited searches.
Mike Karels [Sun, 5 Sep 2021 18:14:04 +0000 (13:14 -0500)]
Change lowest address on subnet (host 0) not to broadcast by default.
The address with a host part of all zeros was used as a broadcast long
ago, but the default has been all ones since 4.3BSD and RFC1122. Until
now, we would broadcast the host zero address as well as the configured
address. Change to not broadcasting that address by default, but add a
sysctl (net.inet.ip.broadcast_lowest) to re-enable it. Note that the
correct way to use the zero address for broadcast would be to configure
it as the broadcast address for the network.
See https:/datatracker.ietf.org/doc/draft-schoen-intarea-lowest-address/
and the discussion in https://reviews.freebsd.org/D19316. Note, Linux
now implements this.