CyberLeo.Net >> Repos - FreeBSD/FreeBSD.git/log

mpr(4): Fix unmatched devq release.

Before this change devq was frozen only if some command was sent to
the target after reset started, but release was called always. This
change freezes the devq immediately, leaving mprsas_action_scsiio()
check only to cover race condition due to different lock devq use.

This should also avoid unnecessary requeue of the commands, creating
additional log noise and confusing some broken apps.

MFC after: 2 weeks
Sponsored by: iXsystems, Inc.

(cherry picked from commit 9781c28c6d63cfa8438d1aa31f512a6b217a6b2b)

mpr(4): Handle mprsas_alloc_tm() errors on device removal.

SAS9305-16e with firmware 16.00.01.00 report HighPriorityCredit of
only 8, while for comparison some other combinations I have report
100 or even 128. In case of large JBOD detach requirement to send
target reset command to each target same time overflows the limit,
and without adequate handling makes devices stuck in half-detached
state, preventing later re-attach.

To handle that in case of allocation error mark the target with new
MPRSAS_TARGET_TOREMOVE flag, and retry the removal attempt next time
something else free high priority command. With this patch I can
successfully detach/attach 102 disk JBOD from/to the SAS9305-16e.

MFC after: 2 weeks
Sponsored by: iXsystems, Inc.

(cherry picked from commit e3c5965c259f7029afe01612b248c3acf9f5b3e0)

mpr/mps: Minor state machine fix

When a DMA chain can't be loaded, set the state to STATE_INQUEUE so that
the mp[rs]_complete_command can properly fail the command.

Sponsored by: Netflix

(cherry picked from commit 33755dbb207878c10fd99de39dadf89fad713bc7)

Fix mpr(4) and mps(4) state transitions and a use-after-free panic.

When the mpr(4) and mps(4) drivers probe a SATA device, they issue an
ATA Identify command (via mp{s,r}sas_get_sata_identify()) before the
target is fully setup in the driver.  The drivers wait for completion of
the identify command, and have a 5 second timeout.  If the timeout
fires, the command is marked with the SATA_ID_TIMEOUT flag so it can be
freed later.

That is where the use-after-free problem comes in.  Once the ATA
Identify times out, the driver sends a target reset, and then frees any
identify commands that have timed out.  But, once the target reset
completes, commands that were queued to the drive are returned to the
driver by the controller.

At that point, the driver (in mp{s,r}_intr_locked()) looks up the
command descriptor for that particular SMID, marks it CM_STATE_BUSY and
sends it on for completion handling.

The problem at this stage is that the command has already been freed,
and put on the free queue, so its state is CM_STATE_FREE.  If INVARIANTS
are turned on, we get a panic as soon as this command is allocated,
because its state is no longer CM_STATE_FREE, but rather CM_STATE_BUSY.

So, the solution is to not free ATA Identify commands that get stuck
until they actually return from the controller.  Hopefully this works
correctly on older firmware versions.  If not, it could result in
commands hanging around indefinitely.  But, the alternative is a
use-after-free panic or assertion (in the INVARIANTS case).

This also tightens up the state transitions between CM_STATE_FREE,
CM_STATE_BUSY and CM_STATE_INQUEUE, so that the state transitions happen
once, and we have assertions to make sure that commands are in the
correct state before transitioning to the next state.  Also, for each
state assertion, we print out the current state of the command if it is
incorrect.

mp{s,r}.c:      Add a new sysctl variable, dump_reqs_alltypes,
                that controls the behavior of the dump_reqs sysctl.
                If dump_reqs_alltypes is non-zero, it will dump
                all commands, not just the commands that are in the
                CM_STATE_INQUEUE state.  (You can see the commands
                that are in the queue by using mp{s,r}util debug
                dumpreqs.)

                Make sure that the INQUEUE -> BUSY state transition
                happens in one place, the mp{s,r}_complete_command
                routine.

mp{s,r}_sas.c:  Make sure we print the current command type in
                command state assertions.

mp{s,r}_sas_lsi.c:
                Add a new completion handler,
                mp{s,r}sas_ata_id_complete.  This completion
                handler will free data allocated for an ATA
                Identify command and free the command structure.

                In mp{s,r}_ata_id_timeout, do not set the command
                state to CM_STATE_BUSY.  The command is still in
                queue in the controller.  Since we were blocking
                waiting for this command to complete, there was
                no completion handler previously.  Set the
                completion handler, so that whenever the command
                does come back, it will get freed properly.

                Do not free ATA Identify commands that have timed
                out in mp{s,r}sas_add_device().  Wait for them
                to actually come back from the controller.

mp{s,r}var.h:   Add a dump_reqs_alltypes variable for the new
                dump_reqs_alltypes sysctl.

                Make sure we print the current state for state
                transition asserts.

This was tested in the Spectra Logic test bed (as described in the
review), as well Netflix's Open Connect fleet (where panics dropped from
a dozen or two a month to zero).

Reviewed by: imp@ (who is handling the commit with ken's OK)
Sponsored by: Spectra Logic
Differential Revision: https://reviews.freebsd.org/D25476

(cherry picked from commit 175ad3d00318a345790eecf2f5a33cd16b119e55)

Fix acpica macros that subtract null pointers

Clang 13.0.0 produces a new -Werror warning about the ACPI_TO_INTEGER(p)
and ACPI_OFFSET(d, f) macros in acpica's actypes.h:

    sys/contrib/dev/acpica/components/dispatcher/dsopcode.c:708:31: error: performing pointer subtraction with a null pointer has undefined behavior [-Werror,-Wnull-pointer-subtraction]
        ObjDesc->Region.Address = ACPI_PTR_TO_PHYSADDR (Table);
                                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
    sys/contrib/dev/acpica/include/actypes.h:664:41: note: expanded from macro 'ACPI_PTR_TO_PHYSADDR'
    #define ACPI_PTR_TO_PHYSADDR(i)         ACPI_TO_INTEGER(i)
                                            ^~~~~~~~~~~~~~~~~~
    sys/contrib/dev/acpica/include/actypes.h:661:41: note: expanded from macro 'ACPI_TO_INTEGER'
    #define ACPI_TO_INTEGER(p)              ACPI_PTR_DIFF (p, (void *) 0)
                                            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    sys/contrib/dev/acpica/include/actypes.h:656:82: note: expanded from macro 'ACPI_PTR_DIFF'
    #define ACPI_PTR_DIFF(a, b)             ((ACPI_SIZE) (ACPI_CAST_PTR (UINT8, (a)) - ACPI_CAST_PTR (UINT8, (b))))
                                                                                     ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~
    1 error generated.

This problem of undefined behavior was also reported to acpica by @cem
in 2018: https://github.com/acpica/acpica/issues/407, but it seems there
was never any fix committed for it upstream.

Instead fix these locally, for ACPI_TO_INTEGER by simply casting the
incoming pointer to ACPI_SIZE (which corresponds roughly to uintptr_t
and size_t), and for ACPI_OFFSET by reusing our __offsetof definition
from sys/cdefs.h.

Reviewed by: emaste, kib, imp
Differential Revision: https://reviews.freebsd.org/D31710

(cherry picked from commit 130a690ae16e1b845629e586203b508eff699f38)

Remove -simplifycfg-dup-ret from CLANG_OPT_SMALL flags for clang 13

After llvm/clang 13.0.0, the -simplifycfg-dup-ret backend flag is no
longer supported. This was part of CLANG_OPT_SMALL, which is only still
used for stand/i386/boot2 and stand/i386/isoboot, to achieve the very
small binary size required. Luckily clang 13.0.0 does not need any
additional flags for this (I get 240 bytes available when building
boot2).

(cherry picked from commit 22b8ab15c41a9efac201691b40e961b83698aa9c)

xen: Fix warning by adding KERNBASE to modlist_paddr before casting

Clang 13 produces the following warning for hammer_time_xen():

sys/x86/xen/pv.c:183:19: error: the pointer incremented by -2147483648 refers past the last possible element for an array in 64-bit address space containing 256-bit (32-byte) elements (max possible 576460752303423488 elements) [-Werror,-Warray-bounds]
                    (vm_paddr_t)start_info->modlist_paddr + KERNBASE;
                                ^                               ~~~~~~~~
sys/xen/interface/arch-x86/hvm/start_info.h:131:5: note: array 'modlist_paddr' declared here
    uint64_t modlist_paddr;         /* Physical address of an array of           */
    ^

This is because the expression first casts start_info->modlist_paddr to
struct hvm_modlist_entry * (via vmpaddr_t), and *then* adds KERNBASE,
which is then interpreted as KERNBASE * sizeof(struct
hvm_modlist_entry).

Instead, parenthesize the addition to get the intended result, and cast
it to struct hvm_modlist_entry * afterwards. Also remove the cast to
vmpaddr_t since it is not necessary.

Reviewed by: royger
Differential Revision: https://reviews.freebsd.org/D31711

(cherry picked from commit 8e3c56d6b676a175e974bad4c20797fb35017db8)

Don't error out on unused but set variables with clang 13

Clang 13.0.0 now has a -Wunused-but-set-variable warning similar to the
one gcc has had for quite a while. Since this triggers *very* often for
our kernel builds, don't make it a hard error, but leave the warning
visible so is some incentive to fix the instances.

(cherry picked from commit 395d46caaed73228b84dfaeb37c702304a46ba8f)

Cirrus-CI: reduce VM memory to 8G

Cirrus-CI now provides a task memory use graph, and it is clear we do
not need to provision the VM with 24GB.

Sponsored by: The FreeBSD Foundation

(cherry picked from commit bbf70270551d8defb427316e5d0e0f368b9aac96)

Revert "arm: Bump KSTACK_PAGES default to match i386/amd64"

This reverts commit b684d812fcb04b2997fd755405a92c36b9f6e30e.

It causes an issue on a pfsense routing workload where memory
fragmentation prevents the necessary consecutive pages from being
readily available.

Reported by: pfsense (mjg, scottl)
Approved by: ian
MFC after: 1 day
Differential Revision: https://reviews.freebsd.org/D31244

(cherry picked from commit 5647f85ade3ae1db042560a3354b6a9945d619a4)

Fix a common typo in man pages and src comments

- s/desciptor/descriptor/

(cherry picked from commit b1603638e38b3d8c23da6599e389db9a9218c240)

amd64: remove lfence after swapgs on syscall entry

(cherry picked from commit 7aa47cace14948a7b8277a4b24a0ca9e0308990a)

Style

(cherry picked from commit 31607861e2ea3adae26a4ce3f6fbd61dfbc37894)

pcib(4): Write window registers after resource adjustment

When adjusting resources we should write updated window base/limit into
the registers. Without this newly added address range won't be routed
through the bridge properly.

Use MIN()/MAX() against current window base/limit to not shrink it on
the other side if the window is shared by several resources.

Align passed resource start/end to the set window granularity to keep
it properly aligned. Currently this is mostly called by other bridges
having the same window alignment, but it may be change one day.

Reviewed by: jrtc27, jhb
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D31693

pci_pci: Support growing windows in bus_adjust_resource for NEW_PCIB

If we allocate a new window for a bridge rather than reusing an existing
one set up by firmware to cover all the devices then the new window only
includes the range needed for the first device to allocate the resource.
If a request comes in to adjust this resource in order to extend a
downstream window for another device then this will fail as the rman
doesn't have any space, so we must first grow the bridge's own window.

This is needed to support successfully attaching more than one PCI
device on SiFive's HiFive Unmatched, which has the following topology:

  Root Port <---> Bridge <---> Bridge <-+-> Bridge <---> (Unused)
   (pcib0)        (pcib1)      (pcib2)  |   (pcib3)
                                        +-> Bridge <---> xHCI
                                        |   (pcib4)
                                        +-> Bridge <---> M.2 E-key
                                        |   (pcib5)
                                        +-> Bridge <---> M.2 M-key
                                        |   (pcib6)
                                        +-> Bridge <---> x16 slot
                                            (pcib7)

Without this, the xHCI endpoint successfully attaches but NVMe M.2 M-key
endpoint fails to attach as, when its adjacent bridge (pcib6) attempts
to allocate a window from its parent (pcib2) on the other side of the
switch, its parent attempts to grow its own window by calling
bus_adjust_resource on its own parent (pcib1) which fails to call the
root port device (pcib0) to request more memory to grow its own window.
Had the root port been directly connected to the switch without the
bridge in the middle then the existing code would have worked, but the
extra hop broke it.

Reviewed by: jhb
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D31035

pmap: Micro-optimize pmap_remove_pages() on amd64 and arm64

Reduce the live ranges for three variables so that they do not span the
call to PHYS_TO_VM_PAGE(). This enables the compiler to generate
slightly smaller machine code.

Reviewed by: kib, markj

(cherry picked from commit d411b285bc293a37e062d8fb15b85212ce16abab)

Clear the accessed bit when copying a managed superpage mapping

pmap_copy() is used to speculatively create mappings, so those mappings
should not have their access bit preset.

Reviewed by: kib, markj

(cherry picked from commit 325ff9327459bc7307130675fa19367ff8b02310)

factor out PT page allocation/freeing

As follow-on work to e4b8deb222278b2a, move page table page
allocation and freeing into their own functions. Use these
functions to provide separate kernel vs. user page table page
accounting, and to wrap common tasks such as management of
zero-filled page state.

Requested by: markj, kib
Reviewed by: kib

(cherry picked from commit c2460d7cfe9fab30459ce495f08544a237a5baa3)

amd64 pmap: convert to counter(9), add PV and pagetable page counts

This change converts most of the counters in the amd64 pmap from
global atomics to scalable counter(9) counters.  Per discussion
with kib@, it also removes the handrolled per-CPU PCID save count
as it isn't considered generally useful.

The bulk of these counters remain guarded by PV_STATS, as it seems
unlikely that they will be useful outside of very specific debugging
scenarios.  However, this change does add two new counters that
are available without PV_STATS.  pt_page_count and pv_page_count
track the number of active physical-to-virtual list pages and page
table pages, respectively.  These will be useful in evaluating
the memory footprint of pmap structures under various workloads,
which will help to guide future changes in this area.

Reviewed by: kib

(cherry picked from commit e4b8deb222278b2a12c9c67021b406625f5be301)

pf tests: Test ALTQ on top of if_bridge

Reviewed by: donner
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31676

(cherry picked from commit 062463698eeafc7f75ce22541a244238f37ef2e2)

if_bridge: add ALTQ support

Similar to the recent addition of ALTQ support to if_vlan.

Reviewed by: donner
Obtained from: pfsense
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31675

(cherry picked from commit eb680a63de1dbf5c974f483975dcb2c60ec6fa08)

pf tests: ALTQ priority test

Test that ALTQ can prioritise one type of traffic over another. Do this
by establishing a slow link and saturating it with ICMP echos.
When prioritised TCP connections reliably go through. When not
prioritised TCP connections reliably fail.

MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")

(cherry picked from commit cd46399b9ccf04f6ec00a532e52c8b1edb007af7)

pf tests: test ALTQ CBQ on top of if_vlan

The main purpose of this test is to verify that we can use ALTQ on top
of if_vlan, but while we're here we also exercise the CBQ code. There's
already a basis test for HFSC, so it makes sense to test another
algorithm while we test if_vlan.

Reviewed by: donner
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31649

(cherry picked from commit e62175df4ec2c8fe2aa2e372f683ddb933768e62)

if_vlan: add the ALTQ support to if_vlan.

Inspired by the iflib implementation, allow ALTQ to be used with if_vlan
interfaces.

Reviewed by: donner
Obtained from: pfsense
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31647

(cherry picked from commit 2e5ff01d0a1fabc757252f9c28ad5cddc98a652d)

vmm: Add credential to cdev object

Add a credential to the cdev object in sysctl_vmm_create(), then check
that we have the correct credentials in sysctl_vmm_destroy(). This
prevents a process in one jail from opening or destroying the /dev/vmm
file corresponding to a VM in a sibling jail.

Add regression tests.

Reviewed by: jhb, markj
Sponsored by: The FreeBSD Foundation

(cherry picked from commit a85404906bc8f402318524b4ccd196712fc09fbd)

fsetown: Avoid process group lock recursion

Restore the pre-1d874ba4f8ba behaviour of disassociating the current
SIGIO recipient before looking up the specified process or process
group. This avoids a lock recursion in the scenario where a process
group is configured to receive SIGIO for an fd when it has already been
so configured.

Reported by: pho
Tested by: pho
Reviewed by: kib

(cherry picked from commit 7326e8589cc21431d62f25802eac7c5dd6f74122)

fsetown: Simplify error handling

No functional change intended.

Suggested by: kib
Reviewed by: kib
Sponsored by: The FreeBSD Foundation

(cherry picked from commit a507a40f3b587bde7ab391f8f1400a25f33e65c1)

fsetown: Fix process lookup bugs

- pget()/pfind() will acquire the PID hash bucket locks, which are
  sleepable sx locks, but this means that the sigio mutex cannot be held
  while calling these functions.  Instead, use pget() to hold the
  process, after which we lock the sigio and proc locks, respectively.
- funsetownlst() assumes that processes cannot be registered for SIGIO
  once they have P_WEXIT set.  However, pfind() will happily return
  exiting processes, breaking the invariant.  Add an explicit check for
  P_WEXIT in fsetown() to fix this. [1]

Fixes: f52979098d3c ("Fix a pair of races in SIGIO registration")
Reported by: syzkaller [1]
Reviewed by: kib
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 1d874ba4f8ba58296cd9df611f5346dad8e91664)

libsa: Fix a typo in source code comments

- s/mininum/minimum/

(cherry picked from commit 005fe24f2a4c873a96f446604e0453cf99e9bcd7)

isci(4): Fix a common typo in src comments

- s/exlusive/exclusive/

(cherry picked from commit 2dfcc3a91dd4d21c16269b7add3141c99dfa48ab)

Fix a common typo in source code comments

- s/concurently/concurrently/

(cherry picked from commit 5d785ad65e000f9ff636a777599bfa414b88d970)

amd64: Don't repeat unnecessary tests when cmpset fails

When a cmpset for removing the PG_RW bit in pmap_promote_pde() fails,
there is no need to repeat the alignment, PG_A, and PG_V tests just to
reload the PTE's value. The only bit that we need be concerned with at
this point is PG_M. Use fcmpset instead.

(cherry picked from commit 3687797618b6c978ad733bd206a623e5df47dbe3)

amd64: Eliminate a redundant test from pmap_enter_object()

The call to pmap_allow_2m_x_page() in pmap_enter_object() is redundant.
Specifically, even without the call to pmap_allow_2m_x_page() in
pmap_enter_object(), pmap_allow_2m_x_page() is eventually called by
pmap_enter_pde(), so the outcome will be the same. Essentially,
calling pmap_allow_2m_x_page() in pmap_enter_object() amounts to
"optimizing" for the unexpected case.

Reviewed by: kib

(cherry picked from commit b7de535288362b072cf2801007e4d7e0e903d467)

arm64: Sync icache when creating executable superpage mappings

Reviewed by: andrew, kib, markj

(cherry picked from commit 7fb152d22935e014afcad4ddc0b3a7e3c2795762)

arm64: Simplify fcmpset failure in pmap_promote_l2()

When the initial fcmpset in pmap_promote_l2() fails, there is no need
to repeat the check for the physical address being 2MB aligned or for
the accessed bit being set. While the pmap is locked the hardware can
only transition the accessed bit from 0 to 1, and we have already
determined that it is 1 when the fcmpset fails.

(cherry picked from commit 0add3c9945c85c7f766f9225866e99e2a805819b)

On a failed fcmpset don't pointlessly repeat tests

In a few places, on a failed compare-and-set, both the amd64 pmap and
the arm64 pmap repeat tests on bits that won't change state while the
pmap is locked. Eliminate some of these unnecessary tests.

Reviewed by: andrew, kib, markj

(cherry picked from commit e41fde3ed71c1e4fce81eac002c9f5b0926e6c49)

amd64: a simplication to pmap_remove_{all,write}

Eliminate some unnecessary unlocking and relocking when we have to retry
the operation to avoid deadlock. (All of the other pmap functions that
iterate over a PV list already implemented retries without these same
unlocking and relocking operations.)

Reviewed by: kib, markj

(cherry picked from commit 1a8bcf30f97e6153def2af781db2fe54f5c0d106)

arm64: a few simplications to pmap_remove_{all,write}

Eliminate some unnecessary unlocking and relocking when we have to retry
the operation to avoid deadlock. (All of the other pmap functions that
iterate over a PV list already implemented retries without these same
unlocking and relocking operations.)

Avoid a pointer dereference by using an existing local variable that
already holds the desired value.

Eliminate some unnecessary repetition of code on a failed fcmpset.
Specifically, there is no point in retesting the DBM bit because it
cannot change state while the pmap lock is held.

Reviewed by: kib, markj

(cherry picked from commit 26a357245f2197eea4dbbae0956d5c71ef8ba4f1)

arm64: eliminate a duplicated #define

(cherry picked from commit 19c288b3a6640742ab45200031661fe5be710d7f)

arm64: fix a potential KVA leak in pmap_demote_l1()

In the unlikely event that the 1 GB page mapping being demoted is used
to access the L1 page table page containing the 1 GB page mapping and
the vm_page_alloc() to allocate a new L2 page table page fails, we
would leak a page of kernel virtual address space. Fix this leak.

(cherry picked from commit 5dd84e315a9f777772017f9f628aa67f08a6493a)

arm64: make it possible to define PV_STATS

Remove an #if 0 that results in a compilation error if PV_STATS is
defined. Aside from this #if 0, there is nothing wrong with the
PV_STATS code.

(cherry picked from commit c94249decd16de71a00d837ee132954d9f259e49)

arm64: replace pa_to_pvh() with page_to_pvh() in pmap_remove_l2()

Revise pmap_remove_l2() to use the constant-time function page_to_pvh()
instead of the linear-time function pa_to_pvh().

Reviewed by: kib, markj

(cherry picked from commit 0c188c06c627b5de30eeeeb7cde00d071a80ecfa)

arm64: remove an unneeded test from pmap_clear_modify()

The page table entry for a 4KB page mapping must be valid if a PV entry
for the mapping exists, so there is no point in testing each page table
entry's validity when iterating over a PV list.

Reviewed by: kib, markj

(cherry picked from commit 62ea198e95f139e6b8041ec44f75d65aa26970d0)

arm64: Use page_to_pvh() when the vm_page_t is known

When support for a sparse pv_table was added, the implementation of
pa_to_pvh() changed from a simple constant-time calculation to iterating
over the array vm_phys_segs[].  To mitigate this issue, an alternative
function, page_to_pvh(), was introduced that still runs in constant time
but requires the vm_page_t to be known.  However, three cases where the
vm_page_t is known were not converted to page_to_pvh().  This change
converts those three cases.

Reviewed by: kib, markj

(cherry picked from commit 6f6a166eaf5e59dedb761ea6152417433a841e3b)

Explicitly link zfsd with libspl to avoid undefined references

Because lld 13.0.0 is more strict about undefined references when
linking to shared libraries, it produces the following errors for zfsd:

ld: error: /home/dim/obj/home/dim/src/llvm-13-update/amd64.amd64/tmp/usr/lib/libzfs_core.so: undefined reference to libspl_assertf [--no-allow-shlib-undefined]
ld: error: /home/dim/obj/home/dim/src/llvm-13-update/amd64.amd64/tmp/usr/lib/libnvpair.so: undefined reference to libspl_assertf [--no-allow-shlib-undefined]
ld: error: /home/dim/obj/home/dim/src/llvm-13-update/amd64.amd64/tmp/usr/lib/libavl.so: undefined reference to libspl_assertf [--no-allow-shlib-undefined]
*** [zfsd.full] Error code 1

Fix this by adding libspl (where libspl_assertf lives) to zfsd's LIBADD.

(cherry picked from commit 9fae476669574792d75706a5401bbdc927ab2b9a)

Silence more gtest warnings, now in fusefs tests

Follow-up d396c67f26b0 by also silencing warnings about deprecated
implicit copy constructors in the fusefs tests, which use googletest.

Fixes: d396c67f26b0

(cherry picked from commit 5a3a8cb01ab8ef4aa16a1950b1ef804070ce1ac6)

Fix null pointer subtraction in mergesort()

Clang 13 produces the following warning for this function:

lib/libc/stdlib/merge.c:137:41: error: performing pointer subtraction with a null pointer has undefined behavior [-Werror,-Wnull-pointer-subtraction]
if (!(size % ISIZE) && !(((char *)base - (char *)0) % ISIZE))
^ ~~~~~~~~~

This is meant to check whether the size and base parameters are aligned
to the size of an int, so use our __is_aligned() macro instead.

Also remove the comment that indicated this "stupid subtraction" was
done to pacify some ancient and unknown Cray compiler, and which has
been there since the BSD 4.4 Lite Lib Sources were imported.

(cherry picked from commit 4e5d32a445f90d37966cd6de571978551654e3f3)

pfctl: build fix

Fix the build issue introduced in e59eff9ad328 (pfctl: fix killing states by ID)

MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")

(cherry picked from commit 9ce320820e6d760df11a88de11fbae024c18d23c)

pf tests: test killing states by ID

MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")

(cherry picked from commit bbf832f34479d19bff0fa8dc43b48ab5553cc85e)

pfctl: fix killing states by ID

Since the conversion to the new DIOCKILLSTATESNV the kernel no longer
exists the id and creatorid to be big-endian.
As a result killing states by id (i.e. `pfctl -k id -k 12345`) no longer
worked.

Reported by: Özkan KIRIK
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")

(cherry picked from commit e59eff9ad3285838730acf48f6d066cec0e53114)

pmap_extract.9: Fix pmap_extract_and_hold()'s function type

pmap_extract_and_hold() returns a vm_page_t instead of a physical page
address.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D31691

(cherry picked from commit 6e1df1d14c6dfcc209c1416ec0832e4d08191c72)

Fix some common typos in source code comments

- s/priviledged/privileged/
- s/funtion/function/
- s/doens't/doesn't/
- s/sychronization/synchronization/

(cherry picked from commit 5bdf58e196096993758b3e50291db17104025b65)

sound(4): Fix some common typos in comments

- s/doens't/doesn't/
- s/apropriate/appropriate/
- s/intepretation/interpretation/

(cherry picked from commit 58d868c88d21b46d3d6d40a2920e7ba8996723b8)

inet(3): Fix a few common typos in source code comments

- s/funtion/function/

(cherry picked from commit 586c9dc37470a2b862b50c041d70229026dd530a)

inet6(4): Fix a few common typos in source code comments

- s/reshedule/reschedule/

(cherry picked from commit 10e0082fff4ec9392db2763ce3b095bc010526df)

ktls: Fix accounting for TLS 1.0 empty fragments.

TLS 1.0 empty fragment mbufs have no payload and thus m_epg_npgs is
zero.  However, these mbufs need to occupy a "unit" of space for the
purposes of M_NOTREADY tracking similar to regular mbufs.  Previously
this was done for the page count returned from ktls_frame() and passed
to ktls_enqueue() as well as the page count passed to pru_ready().

However, sbready() and mb_free_notready() only use m_epg_nrdy to
determine the number of "units" of space in an M_EXT mbuf, so when a
TLS 1.0 fragment was marked ready it would mark one unit of the next
mbuf in the socket buffer as ready as well.  To fix, set m_epg_nrdy to
1 for empty fragments.  This actually simplifies the code as now only
ktls_frame() has to handle TLS 1.0 fragments explicitly and the rest
of the KTLS functions can just use m_epg_nrdy.

Reviewed by: gallatin
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31536

(cherry picked from commit d16cb228c1a62a9641ffb2f0bfcacc3bffec5db1)

ktls: Init reset tag task for cloned sessions

When cloning a ktls session (which is needed when we need to
switch output NICs for a NIC TLS session), we need to also
init the reset task, like we do when creating a new tls session.

Reviewed by: jhb
Sponsored by: Netflix

(cherry picked from commit 95c51fafa40d56d0a32aff857261097acc65ec92)

ktls: Don't mark existing received mbufs notready for TOE TLS.

The TOE driver might receive decrypted TLS records that are enqueued
to the socket buffer after ktls_try_toe() returns and before
ktls_enable_rx() locks the receive buffer to call sb_mark_notready().
In that case, sb_mark_notready() would incorrectly treat the decrypted
TLS record as an encrypted record and schedule it for decryption.
This always resulted in the connection being dropped as the data in
the control message did not look like a valid TLS header.

To fix, don't try to handle software decryption of existing buffers in
the socket buffer for TOE TLS in ktls_enable_rx(). If a TOE TLS
driver needs to decrypt existing data in the socket buffer, the driver
will need to manage that in its tod_alloc_tls_session method.

Sponsored by: Chelsio Communications

(cherry picked from commit faf0224ff27b93b743d50b3830bf5ce345b67e94)

kdb: Handle process enumeration before procinit()

Make kdb_thr_first() and kdb_thr_next() return sane values if the
allproc list and pidhashtbl haven't been initialized yet. This can
happen if the debugger is entered very early on, for example with the
'-d' boot flag.

This allows remote gdb to attach at such a time, and fixes some ddb
commands like 'show threads'.

Be explicit about the static initialization of these variables. This
part has no functional change.

Reviewed by: markj, imp (previous version)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D31495

(cherry picked from commit 4ccaa87f695c8b9eb31f2ba9ce4913a045755fe0)

pmc(3): remove Pentium-related man pages and references

Support for Pentium events was removed completely in e92a1350b50e.

Don't bump .Dd where we are just removing xrefs.

Reviewed by: emaste
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31423

(cherry picked from commit d78896e46f1d7744155919476c012400321d0dab)

arm: enable stack-smashing protection

With current generation clang/llvm it can pass all of our tests in
libc/ssp.

While here, remove the extra MACHINE_CPUARCH check for mips. SSP is
included in BROKEN_OPTIONS for this architecture in src.opts.mk, which
is enough to ensure normal builds won't set SSP_CFLAGS.

Reviewed by: kevans, imp, emaste
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D31400

(cherry picked from commit 1b8db4b4e3614ef6334ce776dcdd46fe7f2c5a78)

llvm/powerpc64*: fix broken binaries generated by clang12

Amends LLVM commit 2518433f861fcb877d0a7bdd9aec1aec1f77505a that
was pointed as the source of regression on LLVM12.

This affects powerpc64*, making binaries crash with segmentation fault
due to bad code generation around "__stack_chk_guard"

Root cause and/or proper fix is under investigation by:
    https://bugs.llvm.org/show_bug.cgi?id=51590

Reviewed by:    dim
MFC after:      2 days
Sponsored by:   Instituto de Pesquisas Eldorado (eldorado.org.br)
Differential Revision:  https://reviews.freebsd.org/D31698

(cherry picked from commit 9a4d48a645a7a3ebee05fae25afd154a132b638a)

virtio-modern: fix PCI common read/write functions on big endian targets

Virtio modern has the common data organized in little endian, but
on powerpc64 BE it was reading and writing in the wrong endian.

Submitted by: Leonardo Bianconi <leonardo.bianconi@eldorado.org.br>
Reviewed by: bryanv, alfredo
Sponsored by: Eldorado Research Institute (eldorado.org.br)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28947

(cherry picked from commit fb53b42e36a9745d0a33821175a962c7a15eeeaa)

Add more arm64 external abort sources

These will be used when we support the Arm Reliability, Availability,
and Serviceability extension.

Sponsored by: The FreeBSD Foundation

(cherry picked from commit dcfd60587102b6854cda04a7c59c8de51ecf89b3)

Only store the arm64 ID registers in the cpu_desc

There is no need to store a pointer to the CPU implementer and part
strings. Switch to load them directly into the sbuf used to print them
on boot.

While here print the machine ID register when we fail to determine the
implementer or part we are booting on.

Reviewed by: markj, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31346

(cherry picked from commit 8b3bd5a2b571e2681c96dbe5b4a04529ef0d7f53)

Move setting arm64 HWCAP values to the ID tables

The HWCAPS values are based on the ID registers. Move setting these
to the existing ID register parsing code.

Previously we would need to handle all possible ID field values where
a HWCAP is set, however as most ID fields follow a scheme where when
the field increments it will only add new features meaning we only
need to check if the field is greater than when the HWCAP feature
was added.

While here stop setting HWCAP value that need kernel support, but this
support is missing.

Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31201

(cherry picked from commit 1a78f44cd205b6f9ca11ef5cdb6e1f32c0134193)

Add missing arm64 ID registers

These may contain values we export to userpsace.

Sponsored by: The FreeBSD Foundation

(cherry picked from commit 2d6d5f88d16fc43b6e7ce2b71136ec6b04d10e6e)

Sort the arm64 ID_AA64* user registers

Sponsored by: The FreeBSD Foundation

(cherry picked from commit c3f2fcf5b90991c0155ed64bbf3d9468c3033ebc)

Add macros for arm64 special reg op and CR values

Use these to simplify the definition of the user_regs array.

Reviewed by: imp, markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31333

(cherry picked from commit 10f6680faae0177cb5ab18754fb27dbb8a0cf226)

Teach the arm64 kernel to identify the Arm AEM

The Arm Architecture Envelope Model is a simulator that models the
architecture rather than any specific implementation. Add its part ID
macro and add it to the list of Arm CPUs we can decode.

Sponsored by: The FreeBSD Foundation

(cherry picked from commit 2531f067ea0e9c77b081445800de8e9584d7d4ab)

Switch to an ifunc in the kernel for crc32c

There is no need to read the same variable to check if the CPU supports
crc32c instructions.

Reviewed by: arichardson, kib, markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31274

(cherry picked from commit a93941b439fce7047dffad6bc380cc9454b967cd)

Start to clean up arm64 address space selection

On arm64 we should use bit 55 of the address to decide if aan address
is a user or kernel address. Add a new macro with this check and a
second to ensure the address is in teh canonical form, i.e.
the top bits are all zero or all one.

This will help with supporting future cpu features, including Top
Byte Ignore, Pointer Authentication, and Memory Tagging.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31179

(cherry picked from commit b7a78d573a566c77117af5fe0b259620934a5c7f)

Split out the arm64 ID field comparison function

This will be useful in an update for finding which HWCAPS to set.

Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31200

(cherry picked from commit 04f6015706f73c90ba78699953d0d4d0b0237298)

altq: Fix panics on rmc_restart()

rmc_restart() is called from a timer, but can trigger traffic. This
means the curvnet context will not be set.
Use the vnet associated with the interface we're currently processing to
set it. We also have to enter net_epoch here, for the same reason.

Reviewed by: mjg
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31642

(cherry picked from commit 159258afb50ad57f7ed27fe86ded83a7b3a26f90)

nfsstat(1): Fix a typo in an error message

- s/priviledged/privileged/

(cherry picked from commit 72a92f91f466fdb73071ec28982b9f4d4cf9b672)

kern: mountroot: avoid fd leak in .md parsing

parse_dir_md() opens /dev/mdctl but only closes the resulting fd on
success, not upon failure of the ioctl or when we exceed the md unit
max.

Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
X-NetApp-PR: #62

(cherry picked from commit 23ecfa9d5bc4f04eb58e26018c2d15f032d5d742)

freebsd-update: create a ZFS boot environment on install

Updated freebsd-update to allow it to create boot environments using
bectl should the system support it. The bectl utility was updated in
r352211 (490e13c1403f) to support a 'check' to determine if the system
supports boot environments. If UFS is used, the bectl check will fail
then no attempt will be made to create the boot environment.

If freebsd-update is run inside a jail, no attempt will be made to
create a boot environment.

The boot environment function will create a new environment using the
format: current FreeBSD kernel version and date/timestamp, example:

12.0-RELEASE-p10_2019-10-03_185233

This functionality can be disabled by setting 'CreateBootEnv' in
freebsd-update.conf to 'no'.

(cherry picked from commit f28f138905416c45ebaa6429f44a0b88a72f54b1)

bhyve: Use pci(4) to access I/O port BARs

This removes the dependency on /dev/io.

PR: 251046
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 42375556e5b2e68746d999b43d124040b6affb91)

pci: Add an ioctl to perform I/O to BARs

This is useful for bhyve, which otherwise has to use /dev/io to handle
accesses to I/O port BARs when PCI passthrough is in use.

Reviewed by: imp, kib
Discussed with: jhb
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 7e14be0b0717105f4b3b8c62df82a1e883d8ebb6)

Cleanup compiler warning flags in lib/libefivar/Makefile

There is no need to set -Wno-unused-parameter twice, and instead of
appending to CFLAGS, append to CWARNFLAGS instead. While here, add
-Wno-unused-but-set-variable for the sake of clang 13.0.0.

(cherry picked from commit f643997a1761689ef9108d69b86d42881ea0ab1c)

googletest: Silence warnings about deprecated implicit copy constructors

Our copy of googletest is rather stale, and causes a number of -Werror
warnings about implicit copy constructor definitions being deprecated,
because several classes have user-declared copy assignment operators.
Silence the warnings until we either upgrade or remove googletest.

(cherry picked from commit d396c67f26b079c2808002c07212d9df9818a11b)

Enable rc.d/jail within jails

Jails with jails is a supported. This change allows the script to run
upon startup with a jail. Without this, jails are not automatically
started within jails.

(cherry picked from commit 35cf9fecbd80f56e39524f480240acfd953c93e1)

ext2fs(5): Correct a typo in an error message

- s/talbes/tables/

(cherry picked from commit 47f880ebeb3092b1b7bbc6d75e82532e43bbf010)

loader: loader_lua can run command_more twice

When we quit pager, the return value 1 is returned and command_more()
interprets it as error.

when lua loader gets error from command, it will try to
interpret it once more, so we get the same file shown once more.

There is no reason why we should return error from command_more().

(cherry picked from commit 7b0d05d56dfaad4e1d5a19727e34252072913d17)

ls: prevent no-color build from complaining when COLORTERM is non-empty

(cherry picked from commit ced2dcadccfcff8f7991b3cb5f6f70d6710eadfb)

sh: fix NO_HISTORY build

(cherry picked from commit 35b253d9d238c46a710a0cd7ba027ec87ba7c8ba)

mount.h: improve a comment about flags

(cherry picked from commit c66e9307ea9520f3d6e4d38dc842a99a31ae80bf)

style.9: remove an outdated comment about indent(1)

(cherry picked from commit f49931c1423e1c9454214f82bbb3ec30d0fee57d)

fstyp: add BeFS support

A simple support for detecting BeFS (BeOS) filesystem

Submitted by: Miguel Gocobachi
Differential Revision: https://reviews.freebsd.org/D29917

(cherry picked from commit 0e92585cde5101506720ca1b904372317b7d84b6)

loader: FB console does leave garbage on screen while scrolling

Scrolling screen will leave "trail" of chars from first column.
Apparently caused by cursor location mismanagement.
Make sure we do not [attempt to] set cursor out of the screen.

(cherry picked from commit e5a50b03297fa09652b3cf319bc6e863392554e0)

sched_ule(4): Pre-seed sched_random().

I don't think it changes anything, but why not.

While there, make cpu_search_highest() use all 8 lower load bits for
noise, since it does not use cs_prefer and the code is not shared
with cpu_search_lowest() any more.

MFC after: 1 month

(cherry picked from commit ca34553b6f631ec4ec5ae9f3825e3196e172c35d)

sched_ule(4): Use trylock when stealing load.

On some load patterns it is possible for several CPUs to try steal
thread from the same CPU despite randomization introduced.  It may
cause significant lock contention when holding one queue lock idle
thread tries to acquire another one.  Use of trylock on the remote
queue allows both reduce the contention and handle lock ordering
easier.  If we can't get lock inside tdq_trysteal() we just return,
allowing tdq_idled() handle it.  If it happens in tdq_idled(), then
we repeat search for load skipping this CPU.

On 2-socket 80-thread Xeon system I am observing dramatic reduction
of the lock spinning time when doing random uncached 4KB reads from
12 ZVOLs, while IOPS increase from 327K to 403K.

MFC after: 1 month

(cherry picked from commit 8bb173fb5bc33a02d5a4670c9a60bba0ece07bac)

sched_ule(4): Reduce duplicate search for load.

When sched_highest() called for some CPU group returns nothing, idle
thread calls it for the parent CPU group. But the parent CPU group
also includes the CPU group we've just searched, and unless there is
a race going on, it is unlikely we find anything new this time.

Avoid the double search in case of parent group having only two sub-
groups (the most prominent case). Instead of escalating to the parent
group run the next search over the sibling subgroup and escalate two
levels up after if that fail too. In case of more than two siblings
the difference is less significant, while searching the parent group
can result in better decision if we find several candidate CPUs.

On 2-socket 40-core Xeon system I am measuring ~25% reduction of CPU
time spent inside cpu_search_highest() in both SMT (2x20x2) and non-
SMT (2x20) cases.

MFC after: 1 month

(cherry picked from commit 2668bb2add8d11c56524ce9014b510412f8f6aa9)

Refactor/optimize cpu_search_*().

Remove cpu_search_both(), unused for many years.  Without it there is
less sense for the trick of compiling common cpu_search() into separate
cpu_search_lowest() and cpu_search_highest(), so split them completely,
making code more readable.  While there, split iteration over children
groups and CPUs, complicating code for very small deduplication.

Stop passing cpuset_t arguments by value and avoid some manipulations.
Since MAXCPU bump from 64 to 256, what was a single register turned
into 32-byte memory array, requiring memory allocation and accesses.
Splitting struct cpu_search into parameter and result parts allows to
even more reduce stack usage, since the first can be passed through
on recursion.

Remove CPU_FFS() from the hot paths, precalculating first and last CPU
for each CPU group in advance during initialization.  Again, it was
not a problem for 64 CPUs before, but for 256 FFS needs much more code.

With these changes on 80-thread system doing ~260K uncached ZFS reads
per second I observe ~30% reduction of time spent in cpu_search_*().

MFC after: 1 month

(cherry picked from commit aefe0a8c32d370f2fdd0d0771eb59f8845beda17)

AMD-vi: Fortify IVHD device_identify process

- Use malloc(9) to allocate ivhd_hdrs list. The previous assumption
  that there are at most 10 IVHDs in a system is not true. A counter
  example would be a system with 4 IOMMUs, and each IOMMU is related
  to IVHDs type 10h, 11h and 40h in the ACPI IVRS table.
- Always scan through the whole ivhd_hdrs list to find IVHDs that has
  the same DeviceId but less prioritized IVHD type.

Sponsored by: The FreeBSD Foundation
MFC with: 74ada297e897
Reviewed by: grehan
Approved by: lwhsu (mentor)
Differential Revision: https://reviews.freebsd.org/D29525

(cherry picked from commit 6fe60f1d5c39c94fc87534e9dd4e9630594e0bec)

vmm: Bump vmname buffer in struct vm to VM_MAX_NAMELEN + 1

In hw.vmm.create sysctl handler the maximum length of vm name is
VM_MAX_NAMELEN. However in vm_create() the maximum length allowed is
only VM_MAX_NAMELEN - 1 chars. Bump the length of the internal buffer to
allow the length of VM_MAX_NAMELEN for vm name.

Reviewed by: grehan
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31372

(cherry picked from commit df95cc76affbbf114c9ff2e4ee011b6f162aa8bd)

tmpfs: Fix error being cleared after commit c12118f6cec0

In tmpfs_link() error was erroneously cleared in commit c12118f6cec0.

Sponsored by: The FreeBSD Foundation
MFC with: c12118f6cec0

(cherry picked from commit a48416f844e3007b4e9f6df125e25cf3a1daad62)

tmpfs: Fix styles

A lot of return statements were in the wrong style before this commit.

Sponsored by: The FreeBSD Foundation

(cherry picked from commit c12118f6cec0ca5f720be6c06d6c59d551461e5a)

sound: Add an example of basic sound application

This is an example demonstrating the usage of the OSS-compatible APIs
provided by the sound(4) subsystem. It reads frames from a dsp node and
writes them to the same dsp node.

Reviewed by: hselasky, bcr
Differential revision: https://reviews.freebsd.org/D30149

(cherry picked from commit 21d854658801f6ddb91de3a3c3384e90f5d920f2)

vfs: Add get_write_ioflag helper to calculate ioflag

Converted vn_write to use this helper.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31513

(cherry picked from commit c15384f8963191a238cb4a33382b4d394f1ac0b4)