CyberLeo.Net >> Repos - FreeBSD/FreeBSD.git/log

Use the vm_radix_init() helper when initializing pmaps

No functional change intended.

Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation

(cherry picked from commit ff93447d8ed61081adfe00a23a1e4c7bee479e53)

amd64: Add comments to pmap_pinit_type()

... explaining why we don't pass the pmap pointer to
pmap_alloc_pt_page().

Reported by: alc
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 34fac29e98313fb0bfba0503e2e19e352b452516)

Convert consumers to vm_page_alloc_noobj_contig()

Remove now-unneeded page zeroing. No functional change intended.

Reviewed by: alc, hselasky, kib
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 84c3922243a7b7fd510dcfb100aec59c878c57d0)

Introduce vm_page_alloc_noobj_contig()

This is the same as vm_page_alloc_noobj(), but allocates physically
contiguous runs of memory. For now it is implemented in terms of
vm_page_alloc_contig(), with the difference that
vm_page_alloc_noobj_contig() implements VM_ALLOC_ZERO by zeroing the
page.

Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 92db9f3bb7623883231214e74ec38788c3dffc6a)

Convert vm_page_alloc() callers to use vm_page_alloc_noobj().

Remove page zeroing code from consumers and stop specifying
VM_ALLOC_NOOBJ. In a few places, also convert an allocation loop to
simply use VM_ALLOC_WAITOK.

Similarly, convert vm_page_alloc_domain() callers.

Note that callers are now responsible for assigning the pindex.

Reviewed by: alc, hselasky, kib
Sponsored by: The FreeBSD Foundation

(cherry picked from commit a4667e09e6520dc2c4b0b988051f060fed695a91)

vm_page: Add a new page allocator interface for unnamed pages

The diff adds vm_page_alloc_noobj() and vm_page_alloc_noobj_domain().
These mostly correspond to vm_page_alloc() and vm_page_alloc_domain()
when no VM object is specified, with the exception that they handle
VM_ALLOC_ZERO by zeroing the page, rather than by preserving PG_ZERO.

This simplifies callers and will permit simplification of the
vm_page_alloc_domain() definition.

Since the new allocator variant is similar to vm_page_alloc_freelist(),
implement both of them using a common backend allocator function. No
functional change intended.

Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation

(cherry picked from commit b498f71bc56af0069d9a4685b8385ee613a00727)

Add a VM flag to prevent reclaim on a failed contig allocation

If a M_WAITOK contig alloc fails, the VM subsystem will try to
reclaim contiguous memory twice before actually failing the
request. On a system with 64GB of RAM I've observed this take
400-500ms before it finally gives up, and I believe that this
will only be worse on systems with even more memory.

In certain contexts this delay is extremely harmful, so add a flag
that will skip reclaim for allocation requests to allow those
paths to opt-out of doing an expensive reclaim.

Sponsored by: Dell Inc
Differential Revision: https://reviews.freebsd.org/D28422
Reviewed by: markj, kib

(cherry picked from commit 660344ca44c63bfe4a16c3e57d0f6dbcbb5e083e)

vlapic: Schedule callouts on the local CPU

The virtual LAPIC driver uses callouts to implement the LAPIC timer.
Callouts are armed using callout_reset_sbt(), which currently puts
everything on CPU 0. On systems running many bhyve VMs this results in
a large amount of contention for CPU 0's callout lock.

Modify vlapic to schedule callouts on the local CPU instead. This
allows timer interrupts to be scheduled more evenly among CPUs where
bhyve is running.

Reviewed by: grehan, jhb
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 4c812fe61b7ce2f297a381950ff7bd87fd51f698)

rmslock: Update td_locks during lock and unlock operations

Reviewed by: mjg
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 71f31d784e1816a155cafbccf4b28291200097aa)

amd64: Define KVA regions for KMSAN shadow maps

KMSAN requires two shadow maps, each one-to-one with the kernel map.
Allocate regions of the kernels PML4 page for them. Add functions to
create mappings in the shadow map regions, these will be used by the
KMSAN runtime.

Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation

(cherry picked from commit f95f780ea4e163ce9a0295a699f41f0a7e1591d4)

conf: Add a KMSAN kernel option

Sponsored by: The FreeBSD Foundation

(cherry picked from commit 30d00832d7733e60f5e030d335c129bfa77dd77a)

kasan: Use vm_offset_t for the first parameter to kasan_shadow_map()

No functional change intended.

Sponsored by: The FreeBSD Foundation

(cherry picked from commit 20e3b9d8bd778445bb80b2be28d2fdedf7bae37e)

amd64 pmap: Pre-set PG_M on 2MB KASAN shadow map entries

Also remove a redundant assertion in pmap_kasan_enter().

Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 4fd450a87df015fe85cadfac0e22c73e3c878d24)

usb(4): Fix for use after free in combination with EVDEV_SUPPORT.

When EVDEV_SUPPORT was introduced, the USB transfers may be running
after the main FIFO is closed. In connection to this a race may appear
which can lead to use-after-free scenarios. Fix this for all FIFO
consumers by initializing and resetting the FIFO queues under the
lock used by the client. Then the client driver will see an empty
queue in all cases a race may appear.

Found by: pho@
Sponsored by: NVIDIA Networking

(cherry picked from commit aad0c65d6b37364d8ba92ecb8c85e004398a5194)

sinpi[fl] etc: Fix the ld128 implementations

PR: 218514

(cherry picked from commit 4f889260c33c163ab28e0e082b4d7e7562d9c647)

sinpi,cospi,tanpi: float.h needed for week reference

PR: 218514

(cherry picked from commit 3bfc837685b8128067b946b31dfe2120dae0d003)

lib/msun: Move the files to appropriate locations in the Makefile

(cherry picked from commit ca3d8cb087cd5b40369478b1693f3e4038b5fa23)

lib/msun/ld128/s_tanpil.c: make it compile.

(cherry picked from commit 6312d144613f97bf59703c442ee4871be1450c46)

[LIBM] implementations of sinpi[fl], cospi[fl], and tanpi[fl]

PR: 218514

(cherry picked from commit dce5f3abed7181cc533ca5ed3de44517775e78dd)

sleepqueue(9): Remove sbinuptime() from sleepq_timeout().

Callout c_time is always bigger or equal than the scheduled time. It
is also smaller than sbinuptime() and can't change while the callback
is running. So we reliably can use it instead of sbinuptime() here.
In case there was a race and the callout was rescheduled to the later
time, the callback will be called again.

According to profiles it saves ~5% of the timer interrupt time even
with fast TSC timecounter.

MFC after: 1 month

(cherry picked from commit 6df1359e5542f69179c142be1ea099d447e273d1)

Generalize sanitizer interceptors for memory and string routines

Similar to commit 3ead60236f ("Generalize bus_space(9) and atomic(9)
sanitizer interceptors"), use a more generic scheme for interposing
sanitizer implementations of routines like memcpy().

No functional change intended.

Sponsored by: The FreeBSD Foundation

(cherry picked from commit ec8f1ea8d536e91ad37e03e45a688c4e255b9cb0)

Generalize bus_space(9) and atomic(9) sanitizer interceptors

Make it easy to define interceptors for new sanitizer runtimes, rather
than assuming KCSAN. Lay a bit of groundwork for KASAN and KMSAN.

When a sanitizer is compiled in, atomic(9) and bus_space(9) definitions
in atomic_san.h are used by default instead of the inline
implementations in the platform's atomic.h. These definitions are
implemented in the sanitizer runtime, which includes
machine/{atomic,bus}.h with SAN_RUNTIME defined to pull in the actual
implementations.

No functional change intended.

Sponsored by: The FreeBSD Foundation

(cherry picked from commit 3ead60236fd25ce64fece7ae4a453318ca18c119)

KASAN: Disable checking before triggering a panic

KASAN hooks will not generate reports if panicstr != NULL, but then
there is a window after the initial panic() call where another report
may be raised. This can happen if a false positive occurs; to simplify
debugging of such problems, avoid recursing.

Sponsored by: The FreeBSD Foundation

(cherry picked from commit ea3fbe0707f9a02a29875966668b6f15284f335a)

redzone: Raise a compile error if KASAN is configured

redzone(9) does some munging of the allocation to insert redzones before
and after a valid memory buffer, but KASAN does not know about this and
will raise false positives if both are configured. Until this is fixed,
do not allow both to be configured. Note that KASAN provides similar
checking on its own but currently does not force the creation of
redzones for all UMA allocations; this should be addressed as well.

Sponsored by: The FreeBSD Foundation

(cherry picked from commit 4e8e26a00471f1a5e7a2af322265c45b1529c5b8)

KASAN: Implement __asan_unregister_globals()

It will be called during KLD unload to unpoison the redzones following
global variables. Otherwise, virtual address ranges previously used for
a KLD may be left tainted, triggering false positives when they are
recycled.

Reported by: pho
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 588c7a06dffbc74b281dacbdd854437b0815e501)

uma: Fix a few problems with KASAN integration

- Ensure that all items returned by UMA are aligned to
  KASAN_SHADOW_SCALE (8).  This was true in practice since smaller
  alignments are not used by any consumers, but we should enforce it
  anyway.
- Use a non-zero code for marking redzones that appear naturally in
  items that are not a multiple of the scale factor in size.  Currently
  we do not modify keg layouts to force the creation of redzones.
- Use a non-zero code for marking freed per-CPU items, otherwise
  accesses of freed per-CPU items are not detected by the runtime.

Sponsored by: The FreeBSD Foundation

(cherry picked from commit b0dfc48684780024a3d736c5a5449284dad97f4e)

x86: Mark the trapframe as initialized in ipi_bitmap_handler()

Otherwise KASAN may generate false positives if the trapframe was
written into a poisoned region of the stack.

Reported by: pho
Reported by: syzbot+ee60455cd58e6eed20c9@syzkaller.appspotmail.com
Reported by: syzbot+be5f9df26426ace3a00c@syzkaller.appspotmail.com
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 36226163fa48ee2c5f73bd2e870ce2e5a057f42e)

hwpmc: Disable KASAN in pmc_save_kernel_callchain()

As in commit 831850d8b087, this routine can trigger false positives, so
exclude it from instrumentation.

Reported by: pho
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 5d243d41b1206044cb5eddd5d48c1c711b731478)

amd64: Mark the trapframe as initialized in trap()

Otherwise KASAN may generate false positives if the trapframe was
written into a poisoned region of the stack.

Reported by: pho
Sponsored by: The FreeBSD Foundation

(cherry picked from commit f08f0ae5247ab31de58bda0817e74ccc1a3a5e95)

stack(9): Disable KASAN in stack_capture()

When unwinding the stack, we may encounter a stack frame in a poisoned
region of the stack, triggering a false positive.

Reviewed by: andrew, kib
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 831850d8b0870c75c21d2e01527af1e55fe2fec8)

cdefs: Make __nosanitizeaddress work for KASAN as well

Add __nosanitizememory while I'm here.

Reviewed by: andrew, kib
Sponsored by: The FreeBSD Foundation

(cherry picked from commit cfad8bd24f038e4779e937f48b05511f2dd4a5a8)

linker_set: Disable ASAN only in userspace

KASAN does not insert redzones around global variables and so is not
susceptible to the problem that led to us disabling ASAN for linker set
elements in the first place (see commit fe3d8086fb6f).

Reviewed by: andrew, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30126

(cherry picked from commit 2d499d505262c9c965fc5f4fd36afdd2bb7cad3d)

realloc: Fix KASAN(9) shadow map updates

When copying from the old buffer to the new buffer, we don't know the
requested size of the old allocation, but only the size of the
allocation provided by UMA. This value is "alloc". Because the copy
may access bytes in the old allocation's red zone, we must mark the full
allocation valid in the shadow map. Do so using the correct size.

Reported by: kp
Tested by: kp
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 9a7c2de36460cdb916734a6969aac666707a639b)

malloc: Add state transitions for KASAN

- Reuse some REDZONE bits to keep track of the requested and allocated
sizes, and use that to provide red zones.
- As in UMA, disable memory trashing to avoid unnecessary CPU overhead.

Sponsored by: The FreeBSD Foundation

(cherry picked from commit 06a53ecf24005b3a74b85ecc4b504a401ac26cd0)

execve: Mark exec argument buffers

We cache mapped execve argument buffers to avoid the overhead of TLB
shootdowns. Mark them invalid when they are freed to the cache.

Sponsored by: The FreeBSD Foundation

(cherry picked from commit f1c3adefd95d35115bd4597293e0b904ae401245)

vfs: Add KASAN state transitions for vnodes

vnodes are a bit special in that they may exist on per-CPU lists even
while free. Add a KASAN-only destructor that poisons regions of each
vnode that are not expected to be accessed after a free.

Sponsored by: The FreeBSD Foundation

(cherry picked from commit b261bb4057f4abbc1366e4af8e9e4081d039be4a)

kmem: Add KASAN state transitions

Memory allocated with kmem_* is unmapped upon free, so KASAN doesn't
provide a lot of benefit, but since allocations are always a multiple of
the page size we can create a redzone when the allocation request size
is not a multiple of the page size.

Sponsored by: The FreeBSD Foundation

(cherry picked from commit 2b914b85ddf4c25d112b2639bbbb7618641872b4)

kstack: Add KASAN state transitions

We allocate kernel stacks using a UMA cache zone. Cache zones have
KASAN disabled by default, but in this case it makes sense to enable it.

Reviewed by: andrew

(cherry picked from commit 244f3ec642ed99a371c97b946b93b877d8be1756)

uma: Add KASAN state transitions

- Add a UMA_ZONE_NOKASAN flag to indicate that items from a particular
  zone should not be sanitized.  This is applied implicitly for NOFREE
  and cache zones.
- Add KASAN call backs which get invoked:
  1) when a slab is imported into a keg
  2) when an item is allocated from a zone
  3) when an item is freed to a zone
  4) when a slab is freed back to the VM

  In state transitions 1 and 3, memory is poisoned so that accesses will
  trigger a panic.  In state transitions 2 and 4, memory is marked
  valid.
- Disable trashing if KASAN is enabled.  It just adds extra CPU overhead
  to catch problems that are detected by KASAN.

Sponsored by: The FreeBSD Foundation

(cherry picked from commit 09c8cb717d214d03e51b3e4f8e9997b9f4e1624d)

amd64: Add MD bits for KASAN

- Initialize KASAN before executing SYSINITs.
- Add a GENERIC-KASAN kernel config, akin to GENERIC-KCSAN.
- Increase the kernel stack size if KASAN is enabled.  Some of the
  ASAN instrumentation increases stack usage and it's enough to
  trigger stack overflows in ZFS.
- Mark the trapframe as valid in interrupt handlers if it is
  assigned to td_intr_frame.  Otherwise, an interrupt in a function
  which creates a poisoned alloca region can trigger false positives.

Sponsored by: The FreeBSD Foundation

(cherry picked from commit f115c0612131d8f939f6f357f57bdd85bd6a59de)

amd64: Implement a KASAN shadow map

The idea behind KASAN is to use a region of memory to track the validity
of buffers in the kernel map.  This region is the shadow map.  The
compiler inserts calls to the KASAN runtime for every emitted load
and store, and the runtime uses the shadow map to decide whether the
access is valid.  Various kernel allocators call kasan_mark() to update
the shadow map.

Since the shadow map tracks only accesses to the kernel map, accesses to
other kernel maps are not validated by KASAN.  UMA_MD_SMALL_ALLOC is
disabled when KASAN is configured to reduce usage of the direct map.
Currently we have no mechanism to completely eliminate uses of the
direct map, so KASAN's coverage is not comprehensive.

The shadow map uses one byte per eight bytes in the kernel map.  In
pmap_bootstrap() we create an initial set of page tables for the kernel
and preloaded data.

When pmap_growkernel() is called, we call kasan_shadow_map() to extend
the shadow map.  kasan_shadow_map() uses pmap_kasan_enter() to allocate
memory for the shadow region and map it.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29417

(cherry picked from commit 6faf45b34b14da5f138774b43ec14fb5567ac584)

Add the KASAN runtime

KASAN enables the use of LLVM's AddressSanitizer in the kernel.  This
feature makes use of compiler instrumentation to validate memory
accesses in the kernel and detect several types of bugs, including
use-after-frees and out-of-bounds accesses.  It is particularly
effective when combined with test suites or syzkaller.  KASAN has high
CPU and memory usage overhead and so is not suited for production
environments.

The runtime and pmap maintain a shadow of the kernel map to store
information about the validity of memory mapped at a given kernel
address.

The runtime implements a number of functions defined by the compiler
ABI.  These are prefixed by __asan.  The compiler emits calls to
__asan_load*() and __asan_store*() around memory accesses, and the
runtime consults the shadow map to determine whether a given access is
valid.

kasan_mark() is called by various kernel allocators to update state in
the shadow map.  Updates to those allocators will come in subsequent
commits.

The runtime also defines various interceptors.  Some low-level routines
are implemented in assembly and are thus not amenable to compiler
instrumentation.  To handle this, the runtime implements these routines
on behalf of the rest of the kernel.  The sanitizer implementation
validates memory accesses manually before handing off to the real
implementation.

The sanitizer in a KASAN-configured kernel can be disabled by setting
the loader tunable debug.kasan.disable=1.

Obtained from: NetBSD
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 38da497a4dfcf1979c8c2b0e9f3fa0564035c147)

Add a KASAN option to the kernel build

LLVM support for enabling KASAN has not yet landed so the option is not
yet usable, but hopefully this will change soon.

Reviewed by: imp, andrew
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 01028c736cbcdba079967c787bee1551fc8439aa)

timecounter: Lock the timecounter list

Timecounter registration is dynamic, i.e., there is no requirement that
timecounters must be registered during single-threaded boot.  Loadable
drivers may in principle register timecounters (which can be switched to
automatically).  Timecounters cannot be unregistered, though this could
be implemented.

Registered timecounters belong to a global linked list.  Add a mutex to
synchronize insertions and the traversals done by (mpsafe) sysctl
handlers.  No functional change intended.

Reviewed by: imp, kib
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 621fd9dcb2d83daab477c130bc99b905f6fc27dc)

cpuset(9): Add CPU_FOREACH_IS(SET|CLR) and modify consumers to use it

This implementation is faster and doesn't modify the cpuset, so it lets
us avoid some unnecessary copying as well. No functional change
intended.

This is a re-application of commit
9068f6ea697b1b28ad1326a4c7a9ba86f08b985e.

Reviewed by: cem, kib, jhb
Sponsored by: The FreeBSD Foundation

(cherry picked from commit de8554295b47475e758a573ab7418265f21fee7e)

bitset: Reimplement BIT_FOREACH_IS(SET|CLR)

Eliminate the nested loops and re-implement following a suggestion from
rlibby.

Add some simple regression tests.

Reviewed by: rlibby, kib
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 51425cb2107c07ff379639edfbad65c77b55c3b8)

clang-format: Add bitset loop macros

Sponsored by: The FreeBSD Foundation

(cherry picked from commit a3e3d90863f3af81bca485468814a787206a235d)

bitset(9): Introduce BIT_FOREACH_ISSET and BIT_FOREACH_ISCLR

These allow one to non-destructively iterate over the set or clear bits
in a bitset.  The motivation is that we have several code fragments
which iterate over a CPU set like this:

while ((cpu = CPU_FFS(&cpus)) != 0) {
cpu--;
CPU_CLR(cpu, &cpus);
<do something>;
}

This is slow since CPU_FFS begins the search at the beginning of the
bitset each time.  On amd64 and arm64, CPU sets have size 256, so there
are four limbs in the bitset and we do a lot of unnecessary scanning.

A second problem is that this is destructive, so code which needs to
preserve the original set has to make a copy.  In particular, we have
quite a few functions which take a cpuset_t parameter by value, meaning
that each call has to copy the 32 byte cpuset_t.

The new macros address both problems.

Reviewed by: cem, kib
Sponsored by: The FreeBSD Foundation

(cherry picked from commit dfd3bde5775ecf88851d5dffd6a8ed6076b53566)

signal: Add SIG_FOREACH and refactor issignal()

Add a SIG_FOREACH macro that can be used to iterate over a signal set.
This is a bit cleaner and more efficient than calling sig_ffs() in a
loop. The implementation is based on BIT_FOREACH_ISSET(), except
that the bitset limbs are always 32 bits wide, and signal sets are
1-indexed rather than 0-indexed like bitset(9) sets.

issignal() cannot really be modified to use SIG_FOREACH() directly.
Take this opportunity to split the function into two explicit loops.
I've always found this function hard to read and think that this change
is an improvement.

Remove sig_ffs(), nothing uses it now.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 81f2e9063d64cc976b47e7ee1e9c35692cda7cb4)

sort: Fix random sort

bwsrawdata() is supposed to return the string buffer.

PR: 259451
Reported by: sigsys@gmail.com
Fixes: d053fb22f6d3 ("usr.bin/sort: Avoid UBSan errors")
Sponsored by: The FreeBSD Foundation

(cherry picked from commit e9bfb50d5e7aa5d673a5a35318820320c4190d33)

hyperv: Register hyperv_timecounter later during boot

Previously the MSR-based timecounter was registered during
SI_SUB_HYPERVISOR, i.e., very early during boot, and before SI_SUB_LOCK.
After commit 621fd9dcb2d8 this triggers a panic since the timecounter
list lock is not yet initialized.

The hyperv timecounter does not need to be registered so early, so defer
that to SI_SUB_DRIVERS, at the same time the hyperv TSC timecounter is
registered.

Reported by: whu
Approved by: whu
Fixes: 621fd9dcb2d8 ("timecounter: Lock the timecounter list")
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 9ef7df022a467776aa616b92fe5783e4261e84c6)

nfscl: Handle NFSv4.1/4.2 Close RPC NFSERR_DELAY replies better

Without this patch, if a NFSv4.1/4.2 server replies NFSERR_DELAY to
a Close operation, the client loops retrying the Close while holding
a shared lock on the clientID. This shared lock blocks returns of
delegations, even though the server has issued a CB_RECALL to request
the delegation return.

This patch delays doing a retry of a Close that received a reply of
NFSERR_DELAY until after the shared lock on the clientID is released,
for NFSv4.1/4.2. To fix this for NFSv4.0 would be very difficult and
since the only known NFSv4 server to reply NFSERR_DELAY to Close only
does NFSv4.1/4.2, this fix is hoped to be sufficient.

This problem was detected during a recent IETF working group NFSv4
testing event.

(cherry picked from commit 52dee2bc035545f7ae2b838d8a0449f65043cd8a)

nfscl: Modify Close RPC so that it does not use "owner" for NFSv4.1/4.2

This patch modifies the function that does the Close RPC (nfsrpc_closerpc)
so that it does not use the open_owner (nfso_own) for NFSv4.1/4.2.
Use of the seqid in the open_owner structure is only needed for NFSv4.0.
Same applies to a NFSERR_STALESTATEID reply, which should only happen
for NFSv4.0. This allows nfsrpc_closerpc() to be called when nfso_own
is no longer valid. This, in turn, allows nfsrpc_closerpc() to be called
after the shared lock on the clientID is released, for NFSv4.1/4.2.

This is being done to prepare the code for a future patch that fixes
the case where an NFSv4.1/4.2 server replies NFSERR_DELAY to a Close
operation.

(cherry picked from commit d95c0a12a2dd58b4b13cbc2d1a9fccd848f8ac5e)

systat: Handle SIGWINCH to properly window resizing and adjust -swap disk stat based on new size.

(cherry picked from commit 66483838039b21a20d748448f8916a73ec419691)

Augment systat(1) -swap to display large swap space processes

(cherry picked from commit 57e5da2c98003e5ab77a337e9fbe22ab7e512ba7)

libutil: add kinfo_getswapvmobject(3)

(cherry picked from commit f2069331e5821f4c2b65d82af2809946a34158d2)

sysctl vm.objects: yield if hog

(cherry picked from commit 350fc36b4cf896cbfce657a6dab600b26367a34a)

vm.objects_swap: disable reporting some information

(cherry picked from commit 7738118e9a298a205b37c256245fd8449acccb0c)

Add vm.swap_objects sysctl

(cherry picked from commit 42812ccc969f174b3e5827c1c320b1738a1e0985)

vm_object_list: split sysctl handler in separate function

(cherry picked from commit 1b610624fdc851f54871f7ee4d67642f5879096f)

Makefile.inc1: Remove mentions of removed target "update"

This is follow-up to commits e290182bcf38 and 1f7d11e636ab.

(cherry picked from commit eab5358b90804669681b639f76ff7e5707e27138)

config(5): Update upper limit for maxusers on 64-bit systems

The limit of 384 maxusers for auto configuration was only imposed on
32-bit systems. Document that maxusers scales above 384 based on memory
for 64-bit systems.

PR: 204938
Reported by: David Höppner <0xffea@gmail.com>

(cherry picked from commit 191c624d9519a2767801de390b192ee7a96b41cd)

Revert "bhyve: Map the MSI-X table unconditionally for passthrough"

This reverts commit 382eec24c0284bd7dc5997b85abc9ee70ea704a1.

This change causes a regression where a VM using passthrough no longer
starts. Until this is resolved, revert the commit.

Reported by: Raúl Muñoz <raul.munoz@custos.es>

Revert "bhyve: Fix the WITH_BHYVE_SNAPSHOT build"

This reverts commit 000b70f038f4fd6893d69bd3dce75a416cd13dfe.

sh: Set PATH envvar after setting HOME in dotfile

In single-user mode, all env vars are absent, so exptilde() would not be
able to expand ~ correctly.
Place the lines setting PATH below HOME, so exptilde() would work as
expected.

Sponsored by: The FreeBSD Foundation
Reviewed by: jilles, emaste
Differential Revision: https://reviews.freebsd.org/D27003

(cherry picked from commit fcfa64801a4fe836ff481465ea068e791aa4ce6a)

bhyve: Fix the WITH_BHYVE_SNAPSHOT build

Note, this breaks compatibility with snapshots generated by older builds
of bhyve(8).

Fixes: 7fa233534736 ("bhyve: Map the MSI-X table unconditionally for passthrough")
Reported by: Greg V <greg@unrelenting.technology>
Reviewed by: grehan, bz
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 77bc75c7abd29de69d3ef35b66c23c7baba95094)

bhyve: Map the MSI-X table unconditionally for passthrough

It is possible for the PBA to reside in the same page as the MSI-X
table.  And, while devices are not supposed to do this, at least some
Intel wifi devices place registers in a page shared with the MSI-X
table.  To handle the first case we currently map the PBA page using
/dev/mem, and the second case is not handled.

Kill two birds with one stone: map the MSI-X table BAR using the
PCIOCBARMMAP ioctl instead of /dev/mem, and map the entire table so that
accesses beyond the bounds of the table can be emulated.  Regions of the
BAR not containing the table are left unmapped.

Reviewed by: bz, grehan, jhb
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 7fa2335347362378322a4d27cb40f6e6cd5dd0fb)

bxe(4): Fix a few common typos in source code comments

- s/controled/controlled/
- s/allignment/alignment/

(cherry picked from commit 80abcfbdfe1af72318c2c0b1690013f43e875267)

jail(8): Fix a few common typos in source code comments

- s/phyiscal/physical/

(cherry picked from commit 70de1003da6f6e78e32f92bd98c9f18f965e6663)

nfscl: Move release of the clientID lock into nfscl_doclose()

This patch moves release of the shared clientID lock from nfsrpc_close()
just after the nfscl_doclose() call to the end of nfscl_doclose() call.
This does make the code cleaner, since the shared lock is acquired at
the beginning of nfscl_doclose(). The only semantics change is that
the code no longer drops and reaquires the NFSCLSTATELOCK() mutex,
which I do not believe will have a negative effect on the NFSv4 client.

This is being done to prepare the code for a future patch that fixes
the case where an NFSv4.1/4.2 server replies NFSERR_DELAY to a Close
operation.

(cherry picked from commit e2aab5e2d73486aa76bb861d583bbce021661601)

iscsi: Abort data-out tasks queued on a terminating session.

cfiscsi_datamove_out() can race with cfiscsi_session_terminate_tasks()
and enqueue a new task after the latter function has aborted existing
tasks. This could result in a deadlock as
cfiscsi_session_terminate_tasks() waited forever for this task to
complete.

Reviewed by: mav
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D31892

(cherry picked from commit 0cd6e85e242bb07a33df9a6314e90bcb0ba99576)

iscsi: Add a helper routine to abort a data-out task.

Reviewed by: mav
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D31891

(cherry picked from commit 529364b032d774bff4dc818ff23d20be482f9d99)

ctld: Disable TCP DDP for connection sockets.

cxgbei is not able to offload PDU processing for a socket using TCP
DDP offload.

Sponsored by: Chelsio Communications

(cherry picked from commit 3b5f95d7bd20e366d720a47a79c451ae037a3ae1)

iscsid: Disable TCP DDP for connection sockets.

cxgbei is not able to offload PDU processing for a socket using TCP
DDP offload.

Sponsored by: Chelsio Communications

(cherry picked from commit 91c62d626d0e9995da9dc424120a4f1b0b987eea)

cxgbei: Only convert "plain" TCP connections to ISCSI.

Reject attempts to convert a connection using a different ULP
mode: (e.g. DDP or TLS) to ISCSI.

Reported by: Jithesh Arakkan @ Chelsio
Sponsored by: Chelsio Communications

(cherry picked from commit f63ddf465fe09d3547deaf80fbdb91bc7b816dfb)

cxgbei: Return early for EBUSY error in icl_cxgbei_conn_handoff.

This permits unindenting almost half of the function.

Sponsored by: Chelsio Communications

(cherry picked from commit b7caa8157602f4eb9acd2729b48ba3a0c0cdc045)

cxgbei: Disable ISO for -SO cards without external memory.

Reported by: Jithesh Arakkan @ Chelsio
Sponsored by: Chelsio Communications

(cherry picked from commit 9b1bb0aee697352b39b3efa1843f581ca29068ba)

cxgbei: Handle errors in PDUs.

When a PDU with an error (bad padding, header digest, or data digest)
is received, log the error via ICL_WARN() and then reset the
connection via the ic_error callback.

While here, add per-rxq counters for errors.

Sponsored by: Chelsio Communications

(cherry picked from commit 4d4cf62e29b06a763dfa8b218de38c8d2cf051bb)

cxgbei: Add sysctls to report the maximum data segment lengths.

These sysctls report the maximum data segment lengths supported by an
adapter. These are the values advertised to the remote end during the
login phase.

Sponsored by: Chelsio Communications

(cherry picked from commit d39e65b5bdc04cac4521ad8e071015cd751c2302)

cxgbei: Limit T5 transmit data segments to 15k.

This avoids exceeding a limit in the firmware when using ISO with
jumbo frames.

Reported by: Jithesh Arakkan @ Chelsio
Sponsored by: Chelsio Communications

(cherry picked from commit 64f09f2346650f02b6deccbe05bb02b88fce4a5e)

iscsi: Teach the iSCSI stack about "large" received PDUs.

When using iSCSI PDU offload (cxgbei) on T6 adapters, a burst of
received PDUs can be reported via a single message to the driver.

Previously the driver passed these multi-PDU bursts up to the iSCSI
stack up as a single "large" PDU by rewriting the buffer offset, data
segment length, and DataSN fields in the iSCSI header.  The DataSN
field in particular was rewritten so that each of the "large" PDUs
used consecutively increasing values.  While this worked, the forged
DataSN values did not match the ExpDataSN value in the subsequent SCSI
Response PDU.  The initiator does not currently verify this value, but
the forged DataSN values prevent adding a check.

To avoid this, allow a logical iSCSI PDU (struct icl_pdu) to describe
a burst of PDUs via a new 'ip_additional_pdus' field.  Normally this
field is set to zero when 'struct icl_pdu' represents a single PDU.
If logical PDU represents a burst of on-the-wire PDUs, then 'ip_npdus'
contains the count of additional on-the-wire PDUs.  The header of this
"large" PDU is still modified, but the DataSN field now contains the
DataSN value of the first on-the-wire PDU in the burst.

Reviewed by: mav
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D31577

(cherry picked from commit c261b6ea4e2ef1fc6a446443ee594ad76f392350)

cxgbei: Restrict received PDUs to 4 DDP pages in length.

Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D31576

(cherry picked from commit d75b0870e542613e63d9f4ac8ec9fb22817e34fa)

cxgbei: Only round PDU data segment lengths down by 512 on T5.

Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D31575

(cherry picked from commit f28715fdc1f7e801b260369787e7bcd633a481bb)

cxgbei: Restructure how PDU limits are managed.

- Compute data segment limits in read_pdu_limits() rather than PDU
length limits.

- Add back connection-specific PDU overhead lengths to compute PDU
length limits in icl_cxgbei_conn_handoff().

Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D31574

(cherry picked from commit cbc186360c658eda884ed97f37cdc2d1b6512b91)

cxgbei: Wait for the final CPL to be received in icl_cxgbei_conn_close.

A socket in the FIN_WAIT_1 state is marked disconnected by
do_close_con_rpl() even though there might still receive data pending.
This is because the socket at that point has set SBS_CANTRCVMORE which
causes the protocol layer to discard any data received before the FIN.
However, icl_cxgbei_conn_close needs to wait until all the data has
been discarded. Replace the wait for SS_ISDISCONNECTED with instead
waiting for final_cpl_received() to be called.

Reported by: Jithesh Arakkan @ Chelsio
Sponsored by: Chelsio Communications

(cherry picked from commit 2eb0e53a6b5ec1a72be70e966d4e562e1a8d4e88)

cxgbei: Support for ISO (iSCSI segmentation offload).

ISO can be disabled before establishing a connection by setting
dev.tNnex.N.toe.iso to 0.

Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D31223

(cherry picked from commit 5b27e4b27caae840bd79ccc5cb7811a0c9acc656)

iSCSI: Add support for segmentation offload for hardware offloads.

Similar to TSO, iSCSI segmentation offload permits the upper layers to
submit a "large" virtual PDU which is split up into multiple segments
(PDUs) on the wire.  Similar to how the TCP/IP headers are used as
templates for TSO, the BHS at the start of a large PDU is used as a
template to construct the specific BHS at the start of each PDU.  In
particular, the DataSN is incremented for each subsequent PDU, and the
'F' flag is only set on the last PDU.

struct icl_conn has a new 'ic_hw_isomax' field which defaults to 0,
but can be set to the largest virtual PDU a backend supports.  If this
value is non-zero, the iSCSI target and initiator use this size
instead of 'ic_max_send_data_segment_length' to determine the maximum
size for SCSI Data-In and SCSI Data-Out PDUs.  Note that since PDUs
can be constructed from multiple buffers before being dispatched, the
target and initiator must wait for the PDU to be fully constructed
before determining the number of DataSN values were consumed (and thus
updating the per-transfer DataSN value used for the start of the next
PDU).

The target generates large PDUs for SCSI Data-In PDUs in
cfiscsi_datamove_in().  The initiator generates large PDUs for SCSI
Data-Out PDUs generated in response to an R2T.

Reviewed by: mav
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D31222

(cherry picked from commit f0594f52f6fdabecee134dd5700bf936283959ad)

iscsi: Remove icl_soft-only fields from struct icl_conn.

Create a struct icl_soft_conn which extends struct icl_conn and
move fields only used by icl_soft from struct icl_conn to
struct icl_soft_conn.

Reviewed by: mav
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D31414

(cherry picked from commit 87322a907545fa76fbaf7949f80e85b1377a53ad)

cxgbe tom: Permit rcv_nxt mismatches on FIN for iSCSI connections on T6.

The remote peer might send a FIN in the middle of a burst of data
PDUs.  In the case of T6 with data PDU completion moderation, the
driver would not have seen these PDUs since the final PDU in the burst
was never received resulting in a stale rcv_nxt when the FIN is
received.

While here, invert the logic in the condition to be more readable and
always set tp->rcv_nxt from the sequence number in the CPL.  This sets
the proper value of rcv_nxt for FINs on connections with data received
but not reported via a CPL (e.g. a partial iSCSI PDU burst interrupted
by a FIN).

Reported by: Jithesh Arakkan @ Chelsio
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D30871

(cherry picked from commit d59f1c49e26ba29e7583019bb5d6aa029466fdb6)

cxgbe tom: Update rcv_nxt for a FIN after handle_ddp_close().

For TCP DDP, handle_ddp_close() needs to see the pre-FIN rcv_nxt to
determine how much data was placed in the local buffer before the FIN
was received.  The changes in d59f1c49e26b broke this by updating
rcv_nxt before calling handle_ddp_close().

Fixes: d59f1c49e26b cxgbe tom: Permit rcv_nxt mismatches on FIN for iSCSI connections on T6.
Sponsored by: Chelsio Communications

(cherry picked from commit 5dbf8c1588da167c17c45bdf78de51fcb4929504)

cxgbei: Round up the maximum PDU data length by the MSS for TXDATAPLEN_MAX.

Recent firmware versions round down the value passed here by the MSS
and subsequently mishandle transmitted PDUs larger than the rounded
down value.

Reported by: Jithesh Arakkan @ Chelsio
Sponsored by: Chelsio Communications

(cherry picked from commit d0d631d5f4437223664f7bbdfdb421ec05cf9657)

cxgbei: Wait for socket to close in icl_cxgbei_conn_close.

This ensures the TOE has finished processing any in-flight received
data before returning to the caller. The caller assumes it is safe to
free any open tasks or transfers (and associated buffers) after this
function returns.

Previously, data placed directly via DDP could be written to buffers
after the caller had freed the buffers.

Reported by: Jithesh Arakkan @ Chelsio
Sponsored by: Chelsio Communications

(cherry picked from commit 67495c13d0bc25c57ebf0103e9d2af7c4a3088c9)

cxgbei: Don't assert F for data completion PDUs.

If a data PDU encounters an error such as a digest error, the firmware
will report that data PDU when completion moderation is active even if
it is not the final data PDU in a burst.

Sponsored by: Chelsio Communications

(cherry picked from commit b5e73dd952f9d5224e9e076bb9719f7bcec871b0)

cxgbei: Remove invalid assertion.

A non-placed PDU can be delivered by CPL_RX_ISCSI_CMP in the middle of
a burst of placed PDUs (received via DDP) in which case the rcv_nxt
will not match the start of the non-placed PDU.

Reported by: Jithesh Arakkan @ Chelsio
Sponsored by: Chelsio Communications

(cherry picked from commit 4a7d15ebb6afe1b662afd2fde0ed2725790a1ba1)

cxgbei: Better handle new tasks and transfers when disconnecting.

If the connection is in the process of disconnecting, ic_socket can be
NULL. For icl_cxgbei_conn_transfer_setup(), lock the connection and
check ic_socket before using it. For icl_cxgbei_conn_task_setup(),
the caller already holds the connection lock, so assert it and bail
early with ECONNRESET if the connection is disconnecting.

Reported by: Jithesh Arakkan @ Chelsio
Fixes: f949967c8eb3 cxgbei: Fix a race between transfer setup and a peer reset.

(cherry picked from commit abc273a2901b116cc98a1fb506c75ac1b0a14cd3)

cxgbe tom: Free pending iSCSI mbufs on connection shutdown.

If an iSCSI connection is shutdown abruptly (e.g. by a RST from the
peer), pending iSCSI PDUs and page pod work requests can be in the
ulp_pduq when the final CPL is received indicating the death of the
connection.

Reported by: Jithesh Arakkan @ Chelsio

(cherry picked from commit 677cb9722a64d3f944d3e374e0ef1bb0e45644b5)

cxgbei: Fix a race between transfer setup and a peer reset.

In 4427ac3675f9, the TOM driver stopped sending work requests to
program iSCSI page pods directly and instead queued them to be written
asynchronously with iSCSI PDUs.  The queue of mbufs to send is
protected by the inp lock.  However, the inp cannot be safely obtained
from the toep since a RST from the remote peer might have cleared
toep->inp asynchronously in an ithread.  To fix, obtain the inp from
the socket as is already done in icl_cxgbei_conn_pdu_queue_cb() and
fail the new transfer setup with ECONNRESET if the connection has been
reset.

To avoid passing sockets or inps into the page pod routines, pull the
mbufq out of the two relevant page pod routines such that the routines
queue new work request mbufs to a caller-supplied mbufq.

Reported by: Jithesh Arakkan @ Chelsio
Fixes: 4427ac3675f91df039d54a23518132e0e0fede86

(cherry picked from commit f949967c8eb3ab5e5a965e3cf07a726dfdc81263)

cxgbei: Support iSCSI offload on T6.

T6 makes several changes relative to T5 for receive of iSCSI PDUs.

First, earlier adapters issue either 2 or 3 messages to the host for
each PDU received: CPL_ISCSI_HDR contains the BHS of the PDU,
CPL_ISCSI_DATA (when DDP is not used for zero-copy receive) contains
the PDU data as buffers on the freelist, and CPL_RX_ISCSI_DDP with
status of the PDU such as result of CRC checks.  In T6, a new
CPL_RX_ISCSI_CMP combines CPL_ISCSI_HDR and CPL_RX_ISCSI_DDP.  Data
PDUs which are directly placed via DDP only report a single
CPL_RX_ISCSI_CMP message.  Data PDUs received on the free lists are
reported as CPL_ISCSI_DATA followed by CPL_RX_ISCSI_CMP.  Control PDUs
such as R2T are still reported via CPL_ISCSI_HDR and CPL_RX_ISCSI_DDP.

Supporting this requires changing the CPL_ISCSI_DATA handler to
allocate a PDU structure if it is not preceded by a CPL_ISCSI_HDR as
well as support for the new CPL_RX_ISCSI_CMP.

Second, when using DDP for zero-copy receive, T6 will only issue a
CPL_RX_ISCSI_CMP after a burst of PDUs have been received (indicated
by the F flag in the BHS).  In this case, the CPL_RX_ISCSI_CMP can
reflect the completion of multiple PDUs and the BHS and TCP sequence
number included in the message are from the last PDU received in the
burst.  Notably, the message does not include any information about
earlier PDUs received as part of the burst.  Instead, the driver must
track the amount of data already received for a given transfer and use
this to compute the amount of data received in a burst.  In addition,
the iSCSI layer currently has no way to permit receiving a logical PDU
which spans multiple PDUs.  Instead, the driver presents each burst as
a single, "large" PDU to the iSCSI target and initiators.  This is
done by rewriting the buffer offset and data length fields in the BHS
of the final PDU as well as rewriting the DataSN so that the received
PDUs appear to be in order.

To track all this, cxgbei maintains a hash table of 'cxgbei_cmp'
structures indexed by transfer tags for each offloaded iSCSI
connection.  When a SCSI_DATA_IN message is received, the ITT from the
received BHS is used to find the necessary state in the hash table,
whereas SCSI_DATA_OUT replies use the TTT as the key.  The structure
tracks the expected starting offset and DataSN of the next burst as
well as the rewritten DataSN value used for the previously received
PDU.

Discussed with: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D30458

(cherry picked from commit 67360f7bb0bb575d823c21420abaf165ecf62066)

iscsi: Move the maximum data segment limits into 'struct icl_conn'.

This fixes a few bugs in iSCSI backends where the backends were using
the limits they advertised initially during the login phase as the
final values instead of the values negotiated with the other end.

Reported by: Jithesh Arakkan @ Chelsio
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D30271

(cherry picked from commit 0cc7d64a2a37533afe03d2b640dc107be41b5f56)

iscsi: Always free a cdw before its associated ctl_io.

cxgbei stores state about a target transfer in the ctl_private[] array
of a ctl_io that is freed when a target transfer (represented by the
cdw) is freed.  As such, freeing a ctl_io before a cdw that references
it can result in a use after free in cxgbei.  Two of the four places
freed the cdw first, and the other two freed the ctl_io first.  Fix
the latter two places to free the cdw first.

Reported by: Jithesh Arakkan @ Chelsio
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D30270

(cherry picked from commit 71e3d1b3a0ee4080c53615167bde4d93efe103fe)

cxgbei: Add tunable sysctls for the FirstBurstLength and MaxBurstLength.

Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D30269

(cherry picked from commit 3bede2908acc6cbc8e809d63d7c9b5fd95932dfb)