John Baldwin [Fri, 6 Aug 2021 21:03:00 +0000 (14:03 -0700)]
iSCSI: Add support for segmentation offload for hardware offloads.
Similar to TSO, iSCSI segmentation offload permits the upper layers to
submit a "large" virtual PDU which is split up into multiple segments
(PDUs) on the wire. Similar to how the TCP/IP headers are used as
templates for TSO, the BHS at the start of a large PDU is used as a
template to construct the specific BHS at the start of each PDU. In
particular, the DataSN is incremented for each subsequent PDU, and the
'F' flag is only set on the last PDU.
struct icl_conn has a new 'ic_hw_isomax' field which defaults to 0,
but can be set to the largest virtual PDU a backend supports. If this
value is non-zero, the iSCSI target and initiator use this size
instead of 'ic_max_send_data_segment_length' to determine the maximum
size for SCSI Data-In and SCSI Data-Out PDUs. Note that since PDUs
can be constructed from multiple buffers before being dispatched, the
target and initiator must wait for the PDU to be fully constructed
before determining the number of DataSN values were consumed (and thus
updating the per-transfer DataSN value used for the start of the next
PDU).
The target generates large PDUs for SCSI Data-In PDUs in
cfiscsi_datamove_in(). The initiator generates large PDUs for SCSI
Data-Out PDUs generated in response to an R2T.
Kyle Evans [Thu, 18 Feb 2021 04:10:46 +0000 (22:10 -0600)]
pkg: use specific CONFSNAME_${file} for FreeBSD.conf
Setting CONFSNAME directly is a little more complicated for downstream
consumers, as any additional CONFS that are added here will inherit the
group name by default. This is perhaps arguably a design flaw in CONFS
because inheriting NAME will never give a good result when additional
files are added, but this is a low-effort change.
While we're here, pull FreeBSD.conf.${branch} out into a PKGCONF
variable so one can just drop a new repo config in entirely with a new
naming scheme. CONFSNAME gets set based on chopping anything off after
".conf", so that, e.g.:
Kyle Evans [Thu, 18 Feb 2021 03:41:53 +0000 (21:41 -0600)]
pkg: allow multiple add arguments again
While pkg(7) add only handles a single 'add' argument, pkg-add(8) fully
handles multiple arguments.
Stop rejecting it, just turn off local-bootstrap mode and proceed to
remote bootstrap if we need it.
While we're here, check if the first argument to pkg add is even a pkg
package. If it's not, also do remote bootstrap instead. Future work
could improve this altogether by picking out a pkg package out of many
and local bootstrap then pass the rest through to the newly installed
pkg.
Reviewed by: bapt, manu (earlier version)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D28766
Dimitry Andric [Thu, 5 Aug 2021 18:57:22 +0000 (20:57 +0200)]
Add ElfW() macro for compatibility with Linux
Some Linux software using ELF headers assumes the existence of an
ElfW(type) macro, which concatenates 'Elf', the default ELF word size,
and the given type. This is identical to our __ElfN(x) macro in
<sys/elf_generic.h>. Add the macro for compatibility, with a comment
that we prefer the __ElfN() macro for FreeBSD.
Emmanuel Vadot [Fri, 6 Aug 2021 13:20:06 +0000 (15:20 +0200)]
modules: felix: Add needed dependencies
Modules should list all needed _if dependencies in their makefile otherwise
if one compiles a kernel that didn't compile those files the module won't build.
Emmanuel Vadot [Fri, 6 Aug 2021 12:36:06 +0000 (14:36 +0200)]
pkgbase: locales: Also tag the files dir
Otherwise bsd.dirs.mk will create the directory with the default
package (utilities) and we end up with a bunch of empty dirs managed
by this package while it shouldn't be the case.
Add check that ifp supports IPv6 multicasts in in6_getmulti.
This fixes panic when user application tries to join into multicast
group on an interface that doesn't support IPv6 multicasts, like
IFT_PFLOG interfaces.
Update the TCP LRO code to handle both encrypted and un-encrypted traffic.
Encrypted and un-encrypted traffic needs to be coalesced separately.
Split the 16-bit lro_type field in the address information into two
8-bit fields, and then use the last 8-bit field for flags, which among
other indicate if the received mbuf is encrypted or un-encrypted.
Kyle Evans [Thu, 5 Aug 2021 18:39:36 +0000 (13:39 -0500)]
ctypedef: fix installation of C.UTF-8
The appropriate directory and name were assigned to the FILESDIR
grouping, but not the ALWAYS grouping where C.UTF-8 is actually
assigned. Add the appropriate bits for ALWAYSDIR, and remove an
obsolete *PACKAGE= assignment since C.UTF-8 is explicitly not included
in FILES.
Prior to this change, C.UTF-8 was being installed as
/usr/share/C.UTF-8.LC_CTYPE.
Reviewed by: manu
Fixes: 0fa5403d493b ("pkgbase: move locales into their own package")
Differential Revision: https://reviews.freebsd.org/D31429
Kyle Evans [Thu, 5 Aug 2021 18:37:18 +0000 (13:37 -0500)]
pkgbase: fix locale packages
Most places spelled it -locales, but numericdef spelled it as -locale
in just this one place. Pluralize it.
Reviewed by: emaste, manu
Fixes: 0fa5403d493b ("pkgbase: move locales into their own package")
Differential Revision: https://reviews.freebsd.org/D31428
Andrew Gallatin [Thu, 5 Aug 2021 21:19:12 +0000 (17:19 -0400)]
ktls: Use the new PNOLOCK flag
Use the new PNOLOCK flag to tsleep() to indicate that
we are managing potential races, and don't need to
sleep with a lock, or have a backstop timeout.
Andrew Gallatin [Thu, 5 Aug 2021 21:16:30 +0000 (17:16 -0400)]
tsleep: Add a PNOLOCK flag
Add a PNOLOCK flag so that, in the race circumstance where
wakeup races are externally mitigated, tsleep() can be
called with a sleep time of 0 without triggering an
an assertion.
Ka Ho Ng [Thu, 5 Aug 2021 15:20:42 +0000 (23:20 +0800)]
Add fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9).
fspacectl(2) is a system call to provide space management support to
userspace applications. VOP_DEALLOCATE(9) is a VOP call to perform the
deallocation. vn_deallocate(9) is a public KPI for kmods' use.
The purpose of proposing a new system call, a KPI and a VOP call is to
allow bhyve or other hypervisor monitors to emulate the behavior of SCSI
UNMAP/NVMe DEALLOCATE on a plain file.
fspacectl(2) comprises of cmd and flags parameters to specify the
space management operation to be performed. Currently cmd has to be
SPACECTL_DEALLOC, and flags has to be 0.
fo_fspacectl is added to fileops.
VOP_DEALLOCATE(9) is added as a new VOP call. A trivial implementation
of VOP_DEALLOCATE(9) is provided.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28347
Ka Ho Ng [Wed, 4 Aug 2021 19:20:37 +0000 (03:20 +0800)]
Add vnode_pager_purge_range(9) KPI
This KPI is created in addition to the existing vnode_pager_setsize(9)
KPI. The KPI is intended for file systems that are able to turn a range
of file into sparse range, also known as hole-punching.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D27194
Andrew Gallatin [Thu, 5 Aug 2021 14:15:09 +0000 (10:15 -0400)]
ktls: start a thread to keep the 16k ktls buffer zone populated
Ktls recently received an optimization where we allocate 16k
physically contiguous crypto destination buffers. This provides a
large (more than 5%) reduction in CPU use in our
workload. However, after several days of uptime, the performance
benefit disappears because we have frequent allocation failures
from the ktls buffer zone.
It turns out that when load drops off, the ktls buffer zone is
trimmed, and some 16k buffers are freed back to the OS. When load
picks back up again, re-allocating those 16k buffers fails after
some number of days of uptime because physical memory has become
fragmented. This causes allocations to fail, because they are
intentionally done without M_NORECLAIM, so as to avoid pausing
the ktls crytpo work thread while the VM system defragments
memory.
To work around this, this change starts one thread per VM domain
to allocate ktls buffers with M_NORECLAIM, as we don't care if
this thread is paused while memory is defragged. The thread then
frees the buffers back into the ktls buffer zone, thus allowing
future allocations to succeed.
Note that waking up the thread is intentionally racy, but neither
of the races really matter. In the worst case, we could have
either spurious wakeups or we could have to wait 1 second until
the next rate-limited allocation failure to wake up the thread.
This patch has been in use at Netflix on a handful of servers,
and seems to fix the issue.
Use newly-create llentry_request_feedback(),
llentry_mark_used() and llentry_get_hittime() to
request datapatch usage check and fetch the results
in the same fashion both in IPv4 and IPv6.
While here, simplify llentry_provide_feedback() wrapper
by eliminating 1 condition check.
Ed Maste [Mon, 1 Mar 2021 17:25:22 +0000 (12:25 -0500)]
Use compressed debug in standalone userland debug files by default
The compiler supports CFLAGS=-gz=zlib to compress .debug sections in
object files, libraries, and binaries. Enable it to reduce disk usage
for standalone debug files (and /usr/obj).
Reviewed by: dim, kevans
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29002
Mitchell Horne [Wed, 4 Aug 2021 18:18:18 +0000 (15:18 -0300)]
arm: enable stack-smashing protection
With current generation clang/llvm it can pass all of our tests in
libc/ssp.
While here, remove the extra MACHINE_CPUARCH check for mips. SSP is
included in BROKEN_OPTIONS for this architecture in src.opts.mk, which
is enough to ensure normal builds won't set SSP_CFLAGS.
Mitchell Horne [Wed, 4 Aug 2021 17:37:05 +0000 (14:37 -0300)]
hwpmc_intel: assert for correct nclasses value
This variable is set based on the exact CPU model detected. If this
value is set too small, it could lead to a NULL-dereference from an
improperly initialized pmc_rowindex_to_classdep array.
Though it has been fixed, this was previously the case for Broadwell.
Add two asserts to catch this in DEBUG kernels, as it represents a
configuration error that may be hard to uncover otherwise.
PR: 253687
Reported by: Zhenlei Huang <zlei.huang@gmail.com>
Sponsored by: The FreeBSD Foundation
Mitchell Horne [Wed, 4 Aug 2021 17:31:36 +0000 (14:31 -0300)]
hwpmc: disable uncore class on Sandy Bridge and newer
It was written for Nehalem and Westmere, with minor but incomplete
updates for Sandy Bridge in 78d763a29b15. The uncore architecture
changed significantly with this generation, bringing new layouts and
locations for some MSRs.
Misprogramming these MSRs in ucp_start_pmc() may panic the system, and
this is trivially reproducible via pmcstat(8) on at least Broadwell and
Haswell. Disable the class on these CPUs until it can be updated more
completely and leave a TODO comment detailing some of the work required.
Note that the nclasses value for Broadwell was already incorrect and
doesn't need changing.
The result is that any uncore events listed by pmcstat -L will no longer
be allocatable, but this is already the case for newer generations of
Intel CPUs.
Goran Mekić [Wed, 4 Aug 2021 10:04:54 +0000 (18:04 +0800)]
sound: Add an example of basic sound application
This is an example demonstrating the usage of the OSS-compatible APIs
provided by the sound(4) subsystem. It reads frames from a dsp node and
writes them to the same dsp node.
Current POSIX standard requires fork() to be async-signal safe. Neither
our implementation, nor implementations in other operating systems are,
and practically it is impossible to make fork() async-signal safe without
too much efforts. Also, that would put undue requirement that all atfork
handlers should be async-signal safe as well, which contradicts its main
use.
As result, Austin Group dropped the requirement, and added a new function
_Fork() that should be async-signal safe, but it does not call atfork
handlers. Basically, _Fork() can be implemented as a raw syscall.
Release of glibc 2.34 added _Fork(), do the same for FreeBSD.
Clarify threading behavior for fork() in the manpage.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D31378
the way SAMEDIRS was defined was an abuse of bsd.dirs.mk resulting in
all the directory to be created in one single command, but DESTDIR is
only prepend once on the first element of the list
Andrew Turner [Fri, 16 Jul 2021 13:49:33 +0000 (13:49 +0000)]
Move setting arm64 HWCAP values to the ID tables
The HWCAPS values are based on the ID registers. Move setting these
to the existing ID register parsing code.
Previously we would need to handle all possible ID field values where
a HWCAP is set, however as most ID fields follow a scheme where when
the field increments it will only add new features meaning we only
need to check if the field is greater than when the HWCAP feature
was added.
While here stop setting HWCAP value that need kernel support, but this
support is missing.
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31201
locales: stop hardcoding the directories in the mtree
The framework knows how to create directories and tag them properly
for a the creation of a mtree, not need to hardcode all the locales
entries in bsd.usr.mk
This simplifies addition of new locales but also allow people building
with WITHOUT_LOCALES to end up with a directory full of empty files
enetc: Support building the driver as a loadable module.
Function level reset has to be done in attach in order to put the
hardware in a known state before configuring it.
The order of DRIVER_MODULEs was changed to ensure that the miibus driver
is loaded when mii_attach is called.
Obtained from: Semihalf
Sponsored by: Alstom Group
Marcin Wojtas [Wed, 9 Jun 2021 11:23:26 +0000 (13:23 +0200)]
Introduce driver for Freescale Felix switch
It is found on boards equipped with LS1028A SoC.
802.1q VLAN grouping is supported.
An external MDIO device is used for communicating with PHYs.
The driver is built as a module by default, it is not included
in GENERIC kernel config.
Kornel Duleba [Wed, 23 Jun 2021 11:13:05 +0000 (13:13 +0200)]
etherswitch: Add a new striptagingress port flag
Felix switch found in LS1028A supports stripping VLAN tag on
ingress, instead of egress. The striptag flag excepts the latter
behaviour.
Add a new flag to support the feature.
Obtained from: Semihalf
Sponsored by: Alstom Group
Differential Revision: https://reviews.freebsd.org/D30922
Alex Richardson [Tue, 3 Aug 2021 09:37:28 +0000 (10:37 +0100)]
Use .sinclude for bsd.sanitizer.mk
We don't install this file since MK_ASAN/MK_UBSAN is only supported for
src builds. However, some ports also use bsd.lib.mk/bsd.prog.mk so we
should not fail the build if it can't be included.
Reported by: jkim
Fixes: 7bc797e3f380 ("Add build system support for ASAN+UBSAN instrumentation")
Warner Losh [Mon, 2 Aug 2021 21:49:47 +0000 (15:49 -0600)]
clock_id: These symbols weren't in 4.4BSD, adjust copyright
Peter Wemm added the first CLOCK_* symbols in 0f5ed9f420528 in 1997
after obtaining them from NetBSD. In NetBSD, jtc@netbsd.org committed
them in sys/sys/time.h rev 1.19 dated 1996/11/15, along with all the
system calls associated with 1003.1b. FreeBSD's values are, however,
different than NetBSD's today. The USL/UCB lawsuit was settled in 1994,
so these couldn't have been derived from material provided to University
of California covered in that settlement. This file does not need the
settlement disclaimer.
Furthermore, I rewrote most of the code (except the symbols and their
values) when merging it from time.h and sys/time.h. Most of the creative
content of the file is new, so update copyright to reflect that.
John Baldwin [Mon, 2 Aug 2021 16:41:27 +0000 (09:41 -0700)]
cxgbe tom: Permit rcv_nxt mismatches on FIN for iSCSI connections on T6.
The remote peer might send a FIN in the middle of a burst of data
PDUs. In the case of T6 with data PDU completion moderation, the
driver would not have seen these PDUs since the final PDU in the burst
was never received resulting in a stale rcv_nxt when the FIN is
received.
While here, invert the logic in the condition to be more readable and
always set tp->rcv_nxt from the sequence number in the CPL. This sets
the proper value of rcv_nxt for FINs on connections with data received
but not reported via a CPL (e.g. a partial iSCSI PDU burst interrupted
by a FIN).
Kristof Provost [Mon, 2 Aug 2021 07:46:33 +0000 (09:46 +0200)]
pf: bound DIOCGETSTATES memory use
Similar to what we did earlier for DIOCGETSTATESV2 we only allocate
enough memory for a handful of states and copy those out, bit by bit,
rather than allocating memory for all states in one go.
Adam Fenn [Mon, 2 Aug 2021 16:27:17 +0000 (11:27 -0500)]
devclass_alloc_unit: move "at" hint test to after device-in-use test
Only perform this expensive operation when the unit number is a
potential candidate (i.e. not already in use), thereby reducing device
scan time on systems with many devices, unit numbers, and drivers.
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
X-NetApp-PR: #61
Differential Revision: https://reviews.freebsd.org/D31381
Alexander Motin [Mon, 2 Aug 2021 14:50:34 +0000 (10:50 -0400)]
sched_ule(4): Pre-seed sched_random().
I don't think it changes anything, but why not.
While there, make cpu_search_highest() use all 8 lower load bits for
noise, since it does not use cs_prefer and the code is not shared
with cpu_search_lowest() any more.
Alex Richardson [Mon, 2 Aug 2021 13:36:03 +0000 (14:36 +0100)]
Allow bootstrapping llvm-tblgen on macOS and Linux
This is needed in order to build various LLVM binutils (e.g. addr2line)
as well as clang/lld/lldb.
Co-authored-by: Jessica Clarke <jrtc27@FreeBSD.org>
Test Plan: Compiles on ubuntu 18.04 and macOS 11.4
Reviewed By: dim
Differential Revision: https://reviews.freebsd.org/D31057
Alex Richardson [Mon, 2 Aug 2021 08:51:34 +0000 (09:51 +0100)]
libc: Disable ASAN for certain string functions
They deliberately read out-of-bounds values to avoid byte-by-byte
loads and check multiple bytes at once. While this will work on x86,
it is flagged as an out-of-bounds read with ASAN, so we have to
disable instrumentation here. This also causes bounds errors for CHERI,
so in CheriBSD we use implementations that avoid OOB reads.
Alex Richardson [Mon, 2 Aug 2021 08:50:16 +0000 (09:50 +0100)]
Fix build of stand/ when building world with ASAN
The userboot/test program links against the default userspace libraries
(e.g. shared libgcc_s.so) that will be instrumented if WITH_ASAN is set.
All other programs link against libsa instead of libc and therefore can't
use the sanitizer runtime library. To fix the stand/ build with
sanitizers, we disable MK_ASAN/MK_UBSAN if -nostdlib is found in the
LDFLAGS (i.e. we are using libsa instead of libc).
Alex Richardson [Mon, 2 Aug 2021 08:49:21 +0000 (09:49 +0100)]
libthr: work around an ASAN false-positive
I got the following error with an ASAN-instrument libthr:
==803==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fffffffcdb0 at pc 0x000801863396 bp 0x7ff8
READ of size 4 at 0x7fffffffcdb0 thread T0
#0 0x801863395 in handle_signal /local/scratch/alr48/cheri/freebsd/lib/libthr/thread/thr_sig.c:262:2
#1 0x801860da2 in thr_sighandler /local/scratch/alr48/cheri/freebsd/lib/libthr/thread/thr_sig.c:246:2
Address 0x7fffffffcdb0 is located in stack of thread T0 at offset 208 in frame
#0 0x80186080f in thr_sighandler /local/scratch/alr48/cheri/freebsd/lib/libthr/thread/thr_sig.c:213
This frame has 1 object(s):
[32, 64) 'act' (line 216) <== Memory access at offset 208 overflows this variable
HINT: this may be a false positive if your program uses some custom stack
This seems like a false-positive since the line in question is
`SIGSETOR(actp->sa_mask, ucp->uc_sigmask);` and it complains about a read
operation (from the ucontext_t argument) so this indicates to me that ASAN
does not understand that thr_sighandler() is a signal handler.
Alex Richardson [Mon, 2 Aug 2021 08:48:21 +0000 (09:48 +0100)]
Add build system support for ASAN+UBSAN instrumentation
This adds two new options WITH_ASAN/WITH_UBSAN that can be set to
enable instrumentation of all binaries with AddressSanitizer and/or
UndefinedBehaviourSanitizer. This current patch is almost sufficient
to get a complete buildworld with sanitizer instrumentation but in
order to actually build and boot a system it depends on a few more
follow-up commits.
Alex Richardson [Mon, 2 Aug 2021 08:45:05 +0000 (09:45 +0100)]
tools/build: Don't redefine open() for the linux bootstrap
This is needed to bootstrap llvm-tblgen on Linux since LLVM calls
`::open(...)` which does not work if open is a statement macro.
Also stop defining O_SHLOCK/O_EXLOCK and update the only bootstrap tools
user of those flags to deal with missing definitions.
Ka Ho Ng [Mon, 2 Aug 2021 09:54:40 +0000 (17:54 +0800)]
vmm: Bump vmname buffer in struct vm to VM_MAX_NAMELEN + 1
In hw.vmm.create sysctl handler the maximum length of vm name is
VM_MAX_NAMELEN. However in vm_create() the maximum length allowed is
only VM_MAX_NAMELEN - 1 chars. Bump the length of the internal buffer to
allow the length of VM_MAX_NAMELEN for vm name.
MFC after: 3 days
Reviewed by: grehan
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31372
Roger Pau Monné [Mon, 2 Aug 2021 08:22:22 +0000 (10:22 +0200)]
xen/timer: fix amd64 LINT kernel build
On amd64 XENHVM depends on the xentimer device for PVH early startup,
so both should be added or removed together (like the current
dependency with xenpci). Fix this by adding xentimer to NOTES and
updating the comments on the config files. Note that on i386 there's
no such dependency between xentimer and XENHVM, since there's no PVH
support.
While there also fix the MINIMAL i386 build to include the xentimer,
so it keeps the same functionality as before xentimer was split from
XENHVM.
Alexander Motin [Mon, 2 Aug 2021 02:42:01 +0000 (22:42 -0400)]
sched_ule(4): Use trylock when stealing load.
On some load patterns it is possible for several CPUs to try steal
thread from the same CPU despite randomization introduced. It may
cause significant lock contention when holding one queue lock idle
thread tries to acquire another one. Use of trylock on the remote
queue allows both reduce the contention and handle lock ordering
easier. If we can't get lock inside tdq_trysteal() we just return,
allowing tdq_idled() handle it. If it happens in tdq_idled(), then
we repeat search for load skipping this CPU.
On 2-socket 80-thread Xeon system I am observing dramatic reduction
of the lock spinning time when doing random uncached 4KB reads from
12 ZVOLs, while IOPS increase from 327K to 403K.
Alexander Motin [Mon, 2 Aug 2021 02:07:51 +0000 (22:07 -0400)]
sched_ule(4): Reduce duplicate search for load.
When sched_highest() called for some CPU group returns nothing, idle
thread calls it for the parent CPU group. But the parent CPU group
also includes the CPU group we've just searched, and unless there is
a race going on, it is unlikely we find anything new this time.
Avoid the double search in case of parent group having only two sub-
groups (the most prominent case). Instead of escalating to the parent
group run the next search over the sibling subgroup and escalate two
levels up after if that fail too. In case of more than two siblings
the difference is less significant, while searching the parent group
can result in better decision if we find several candidate CPUs.
On 2-socket 40-core Xeon system I am measuring ~25% reduction of CPU
time spent inside cpu_search_highest() in both SMT (2x20x2) and non-
SMT (2x20) cases.
amd64 pmap_vm_page_alloc_check(): loose the assert
Current expression checks that vm_page_alloc(9) never returns a page
belonging to the preload area. This is not true if something was freed
from there, for instance a preloaded module was unloaded, or ucode update
freed.
Only check that we never allow to allocate a page belonging to the kernel
proper, check against _end.
Reported and tested by: dhw
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
bhyve: net_backends, automatically IFF_UP tap devices
If you want communications with the outside world and tell bhyve to
create an interfaces then it should be usable as well.
Rather than relying on the sysctl net.link.tap.up_on_open automatically
try to IFF_UP the opened tap device.