CyberLeo.Net >> Repos - FreeBSD/FreeBSD.git/log

procctl(PROC_ASLR_STATUS): fix vmspace leak

(cherry picked from commit 0bdb2cbf9d7c4366a0668b4563c8630538a50086)

Use sleepq_signal(SLEEPQ_DROP) in cv_signal().

Same as wakeup_one()/wakeup_any() commit before it reduces the lock
hold time and so contention.

MFC after: 1 week

(cherry picked from commit 63ca9ea4f34d887b66c7b9f1710f5e4be543ebed)

x86: use ANSI C definition style for trap_fatal

PR: 257062

(cherry picked from commit 55e63ed307fb099722cf6d30a18c9badab9b5d03)

amd64 pmap: unexpand the NBPDR macro definition

(cherry picked from commit fdc71fa112d66c7c0aba9ff80adc7b8bb22ea6ca)

amd64 locore.S: trim .globl list from symbols gone for long time

(cherry picked from commit 9dc715230ccab1c3ad17f076379d29a017059030)

amd64 mpboot.S: fix typo in comment

(cherry picked from commit 71463a34ab3f65ff109b529f2fae93b694b73fdd)

amd64 locore.S: add FF copyright for LA57 work

(cherry picked from commit 63664df72036dc8ee99bd83fecc91faf167fa232)

loader: Don't reserve space for symbols twice.

The current code bumps lastaddr twice for the symbol table
location. However, the first bump is bogus and results in wasted
space. Remove it.

PR: 110995
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31017

(cherry picked from commit 297e9f364b5aa243572ee52b1faef9b3542c1c9e)

loader: update autoboot description and move to loader.conf.5

Document "NO" special value for the autoboot_delay and move the
description to loader.conf.5.

imp reworked some of the wording from danger's patch.

Reviewed by: imp
PR: 85128
Differential Revision: https://reviews.freebsd.org/D11887

(cherry picked from commit 71f6aea4150c66784cbad42c1e1ff908d909c2ec)

MINIMAL: remove debugging and some loadable network modules

Remove deugging stuff, since it's arguably not needed in a minimal
setup. Also vlan, tuntap and gif since they can be loaded.

imp didn't include the part of the patch that removed xen guest support.
Xen guest is relatively small and has no way of being loaded.

Reviewed by: imp
PR: 229564
MFC After: 3 days

(cherry picked from commit b21f19c9e0b7f3c923d845e9e31c0552f0162a62)

nanobsd: enhance fill_pkg.sh

NanoBSD has helper script "fill_pkg.sh" which links all packages and
ther dependencies from "package dump" (like /usr/ports/packages/All) to
specified director. fill_pkg.sh has some limitations:

1) It needs ports tree, which should have exactly same versions as
   "package dump".
2) It requires full paths to needed ports, including "/usr/ports" part.
3) It has assumptions about Nano Package Dir (it assumes, that it
   specified rtelative to current directory).
4) It does not have any diagnostics (almost).

This PR enhances "fill_pkg.sh" script in several ways:

1) Nano package dir could be absolute path.
2) Script understands four ways to specify "root" ports/packages:
   (a) Absolute directory with port (old one)
   (b) Relative directory with port, relative to ${PORTSDIR} or /usr/ports
   (c) Absolute path to file with package (with .tbz suffix)
   (d) Name of package in dump dir, with or without .tbz suffix

   These ways can be mixed in one call. Dependencies for
   packages are obtained with 'pkg_info -r' call, and are searched for
   in same directory as "parent" package. Dependencies for ports are
   obtained in old way from port's Makefile.
3) Three levels of diagnostic (and -v option, could be repeated) are added.
4) All path variables are enclosed in quotes, to make script work with paths,
   containing spaces.

Note: imp merged in the changes to fill_pkg.sh since this has been a PR.

PR: 151695
Reviewed by: imp@
MFC After: 3 days
Differential Revision: https://reviews.freebsd.org/D31101

(cherry picked from commit 36cfb5d50f8e8856695780a6792fb7e81816e9ee)

loader: support.4th resets the read buffer incorrectly

Large nextboot.conf files (over 80 bytes) are not read correctly by the
Forth loader, causing file parsing to abort, and nextboot configuration
fails to apply.

Simple repro:

nextboot -e foo=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
shutdown -r now

That will cause the bug to cause a parse failure but shouldn't otherwise
affect the boot.  Depending on your loader configuration, you may also
have to set beastie_disable and/or reduce the number of modules loaded
to see the error on a small console screen.  12.0 or CURRENT users will
also have to explicitly use the Forth loader instead of the Lua loader.
The error will look something like:

Warning: syntax error on file /boot/loader.conf.local
foo="xxxxxxxxxxxxxxnextboot_enable="YES"
                                    ^
/boot/support.4th has crude file I/O buffering, which uses a buffer
'read_buffer', defined to be 80 bytes by the 'read_buffer_size'
constant.  The loader first tastes nextboot.conf, reading and parsing
the first line in it for nextboot_enable="YES".  If this is true, then
it reopens the file and parses it like other loader .conf files.

Unfortunately, the file I/O buffering code does not fully reset the
buffer state in the reset_line_reading word.  If the last file was read
to the end, that doesn't matter; the file buffer is treated as empty
anyway.  But in the nextboot.conf case, the loader will not read to the
end of file if it is over 80 bytes, and the file buffer may be reused
when reading the next file.  When the file is reread, the corrupt text
may cause file parsing to abort on bad syntax (if the corrupt line has
<>2 quotes in it), the wrong variable to be set, no variable to be set
at all, or (if the splice happens to land at a line ending) something
approximating normal operation.

The bug is very old, dating back to at least 2000 if not before, and is
still present in 12.0 and CURRENT r345863 (though it is now hidden by
the Lua loader by default).

Suggested one-line attached.  This does change the behavior of the
reset_line_reading word, which is exported in the line-reading
dictionary (though the export is not documented in loader man pages).
But repo history shows it was probably exported for the PNP support
code, which was never included in the loader build, and was removed 5
months ago.

One thing that puzzles me: how has this bug gone unnoticed/unfixed for
nearly 2 decades?  I find it hard to believe that nobody's tried to do
something interesting with nextboot, like load a kernel and filesystem,
which is what I'm doing.

PR: 239315
Reviewed by: imp

(cherry picked from commit 9c1c02093b90ae49745a174eb26ea85dd1990eec)

mk: LZMA_SUPPORT is unused

Retire LZMA_SUPPORT. It's unused since r332995.

Reviewed by: delphij
PR: 244302
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31088

(cherry picked from commit d8514fa6f1b0b9824b169c5ab66f37713b303c57)

arm: remove fslsdma from GENERIC

The fslsdma device requires sdma_fw, but that's not included in
GENERIC. That firmware is not in the FreeBSD tree at the moment, but
could easily be.

The license for the firmware can be found in the linux firmware repo:
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=3123d78e09d2f815de4d94aa35c07b3c0469c80e
and looks to be a BSD license + no reverse engineer.

We can add this back after the firmware is imported, made a port, or
whose automatic loading can be made to happen.

Reviewed by: imp (with ian finding the license)
PR: 237466
MFC after: 1 week

(cherry picked from commit 9e3761d126c5c019d6c935e83989928eb1a0e76e)

devmatch: don't announce autoloading so much

devmatch rc script would announce it was loading a module multiple
times. It used kldload -n so it really wasn't loading it that many
times, but the message is confusing. Use kldstat to see if we need to
load the module before saying we do. This fixes the vast majority of the
problems. It may be possible to race devmatch with a user invocation and
devd, though quite hard. In that case we'll announce things twice, but
still only load it once. No attempt is made to fix this.

PR: 232782
MFC After: 2 weeks
Sponsored by: Netflix

(cherry picked from commit 5549c6a62f0f4fc5d7e80973b28ebcf7f556edf8)

devmatch: Be tolerant of .ko being present.

We document that we did not need .ko on the module names in
devmatch_blocklist, but we really needed them. Keep the documentation
the same, but strip the .ko when we need to use the names so you can
specify either.

PR: 256240
MFC After: 2 weeks
Sponsored by: Netflix

(cherry picked from commit b29ebb9c65b350e78aedfc790bfcaf9717059b70)

devmatch: defer until after kld

devmatch loads a number of things automatically. Allow the list of
things to load to happen first in case those drivers affect what would
be loaded. Normally, this will produce the same results, but there's
some special cases that may not when drivers are loaded that report
other drivers missing, like virtio_pci.

PR: 253287
Reviewed by: imp
MFC After: 2 weeks

(cherry picked from commit f68e3ea831b76a8927eed7f7abfea55ee5a193c4)

pf: bound DIOCGETSTATESV2 memory use

Rather than allocating however much memory userspace asks for we only
allocate enough for a handful of states, and copy to userspace for each
completed row.
We start out with enough space for 16 states (per row), but grow that as
required. In most configurations we expect at most a handful of states
per row (more than that would have other negative effects on packet
processing performance).

Reviewed by: mjg
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31111

(cherry picked from commit 3fc12ae042040192aa43984106a75663aaa9e0f5)

libpfctl: migrate to DIOCGETSTATESV2

Stop using the *NV version to retrieve states, as its performance is
unacceptably bad.

For 1,000,000 states the nvlist version needed ~100 seconds to retrieve
the states, the new version needs ~3 seconds.

Reviewed by: mjg
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31098

(cherry picked from commit be70c7a50d324dd56f3a0d37d8372e7855dd580b)

pf: add DIOCGETSTATESV2

Add a new version of the DIOCGETSTATES call, which extends the struct to
include the original interface information.

MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31097

(cherry picked from commit c6bf20a2a46dc36bf881ac594454f71379828a9a)

dummynet: reduce console spam

Only print this warning when boot verbose is enabled.
This can get pretty annoying (and useless) in some systems.

Reviewed by: kp
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")

(cherry picked from commit c5dd8bac0b96e11da02181bd1dbee677e270842d)

pf: pf_killstates() never fails, so remove the return value

Suggested by: mjg
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")

(cherry picked from commit 34641052826c718566b994b75cd2bddb53a21583)

pf: Handle errors returned by pf_killstates()

Happily this wasn't a real bug, because pf_killstates() never fails, but
we should check the return value anyway, in case it does ever start
returning errors.

Reported by: clang --analyze
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")

(cherry picked from commit fa96701c8abbc29aad7f8f8d6b823bd7f89c6c15)

pf: Remove unneeded NULL check

pidx is never NULL, and is used unconditionally later on in the
function.
Add an assertion, as documentation for the requirement to provide an idx
pointer.

Reported by: clang --analyze
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")

(cherry picked from commit 8cceacc0f1ee6a77c5f0566b8e6b0f054160fb20)

rc.d: connect sysctl_lastload

Add recently added sysctl_lastload.

(cherry picked from commit 20eb969793921dce9e524d19fc02b84cabd98f74)

ipfw: reload sysctl.conf variables if needed

Currently ipfw has multiple components that are not parts
of GENERIC kernel like dummynet etc. They can bring in important
sysctls if enabled with rc.conf(5) and loaded with ipfw startup script
by means of "required_modules" after initial consult
with /etc/sysctl.conf at boot time. Here is an example of one
increasing limit for dummynet hold queues that defaults to 100:

net.inet.ip.dummynet.pipe_slot_limit=1000

This makes it possible to use ipfw/dummynet rules such as:

ipfw pipe 1 config bw 50Mbit/s queue 1000

Such rule is rejected unless above sysctl is applied.
Another example is a group of net.inet.ip.alias.* sysctls
created after libalias.ko loaded as dependency of ipfw_nat.

This is not a problem if corresponding code compiled in custom kernel
so sysctls exist when sysctl.conf is read early or kernel modules
loaded with a loader. This change makes it work also for GENERIC
and modules loaded by means of rc.conf(5) settings.

(cherry picked from commit f5b5de1a3210234f3a6864c88a2d3e11ac2dbf04)

rc.d: unbreak sysctl lastload

/etc/rc.d/securelevel is supposed to run /etc/rc.d/sysctl lastload
late at boot time to apply /etc/sysctl.conf settings that fail
to apply early. However, this does not work in default configuration
because of kern_securelevel_enable="NO" by default.

Add new script /etc/rc.d/sysctl_lastload that starts unconditionally.

Reported by: Marek Zarychta

(cherry picked from commit f4b38c360e63a6e66245efedbd6c070f9c0aee55)

UPDATING: Not unusual side effect of the awk bug fixed in 3e804463521

You might not be able to build the kernel if you have an awk between Jul
10th and today. It does not affect all platforms due to the nature of
the bug (so amd64 is unaffected in stable/13 or current, but is affected
in stable/12. i386 seems to be affected everywhere).

Sponsored by: Netflix

awk: revert upstream's attempt to disallow hex strings

Upstream one-true-awk decided to disallow hex strings as numbers. This
is in line with awk's behavior prior to C99, and allowed by the POSIX
standard. The standard, however, allows them to be treated as numbers
because that's what the standard said in the 2001 through 2004 editions.
Since 2001, the nawk in FreeBSD has treated them as numbers, so restore
that behavior, allowed by the standard.

A number of scripts in the FreeBSD tree depend on this interpretation,
including scripts to build the kernel which had mysteriously started
failing for some people and not others. By re-allowing 0x hex numbers,
this fixes those scripts and restores POLA.

Upstream issue: https://github.com/onetrueawk/awk/issues/126
Sponsored by: Netflix
Reviewed by: kevans
MFC After: asap due to regression alrady merged to stable
Differential Revision: https://reviews.freebsd.org/D31199

(cherry picked from commit d4d252c49976de33d0a2926df733744d0b8d95fa)

nfscl: Improve "Consider increasing kern.ipc.maxsockbuf" message

When the setting of kern.ipc.maxsockbuf is less than what is
desired for I/O based on vfs.maxbcachebuf and vfs.nfs.bufpackets,
a console message of "Consider increasing kern.ipc.maxsockbuf".
is printed.

This patch modifies the message to provide a suggested value
for kern.ipc.maxsockbuf.
Note that the setting is only needed when the NFS rsize/wsize
is set to vfs.maxbcachebuf.

While here, make nfs_bufpackets global, so that it can be used
by a future patch that adds a sysctl to set the NFS server's
maximum I/O size. Also, remove "sizeof(u_int32_t)" from the maximum
packet length, since NFS_MAXXDR is already an "overestimate"
of the actual length.

(cherry picked from commit c5f4772c66d2eb31b84a84a89c8a284043f03452)

pf: add pf_find_state_all_exists

Reviewed by: kp
Sponsored by: Rubicon Communications, LLC ("Netgate")

(cherry picked from commit 19d6e29b872232c47190344f3dfded2f73edd8ae)

pf: padalign global locks found in pf.c

Reviewed by: kp
Sponsored by: Rubicon Communications, LLC ("Netgate")

(cherry picked from commit f649cff58721f493f218a4d1fb88a12255945472)

pf: allow table stats clearing and reading with ruleset rlock

Instead serialize against these operations with a dedicated lock.

Prior to the change, When pushing 17 mln pps of traffic, calling
DIOCRGETTSTATS in a loop would restrict throughput to about 7 mln. With
the change there is no slowdown.

Reviewed by: kp (previous version)
Sponsored by: Rubicon Communications, LLC ("Netgate")

(cherry picked from commit dc1ab04e4c9ede3606985e0cce1200e3060ac166)

pf: depessimize table handling

Creating tables and zeroing their counters induces excessive IPIs (14
per table), which in turns kills single- and multi-threaded performance.

Work around the problem by extending per-CPU counters with a general
counter populated on "zeroing" requests -- it stores the currently found
sum. Then requests to report the current value are the sum of per-CPU
counters subtracted by the saved value.

Sample timings when loading a config with 100k tables on a 104-way box:

stock:

pfctl -f tables100000.conf  0.39s user 69.37s system 99% cpu 1:09.76 total
pfctl -f tables100000.conf  0.40s user 68.14s system 99% cpu 1:08.54 total

patched:

pfctl -f tables100000.conf  0.35s user 6.41s system 99% cpu 6.771 total
pfctl -f tables100000.conf  0.48s user 6.47s system 99% cpu 6.949 total

Reviewed by: kp (previous version)
Sponsored by: Rubicon Communications, LLC ("Netgate")

(cherry picked from commit f92c21a28cd856834249a008771b2f002e477a39)

libalias: fix divide by zero causing panic

The packet_limit can fall to 0, leading to a divide by zero abort in
the "packets % packet_limit".

An possible solution would be to apply a lower limit of 1 after the
calculation of packet_limit, but since any number modulo 1 gives 0,
the more efficient solution is to skip the modulo operation for
packet_limit <= 1.

Reported by: Karl Denninger <karl@denninger.net>

(cherry picked from commit 58080fbca09fda6d5f011d37059edbca8ceb4c58)

pf: rename pf_state to pf_kstate

Indicate that this is a kernel-only structure, and make it easier to
distinguish from others used to communicate with userspace.

Reviewed by: mjg
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31096

(cherry picked from commit 211cddf9e3a1bc0d4b1b94bea7d16a47b5a17f49)

tcp: fix alternate stack build with LINT-NO{INET,INET6,IP}

When fixing another bug, I noticed that the alternate
TCP stacks do not build when various combinations of
ipv4 and ipv6 are disabled.

Reviewed by: rrs, tuexen
Differential Revision: https://reviews.freebsd.org/D31094
Sponsored by: Netflix

(cherry picked from commit b1e806c0ed960e1eb9ee889c7d0df3c168290c4f)

tcp: Fix 32 bit platform breakage

This fixes the incorrect use of a sysctl add to u64. It
was for a useconds time, but on 32 bit platforms its
not a u64. Instead use the long directive.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31107

(cherry picked from commit 7312e4e5cfc8e48597acf17f4faa8159f0b5fa06)

tcp: HPTS performance enhancements

HPTS drives both rack and bbr, and yet there have been many complaints
about performance. This bit of work restructures hpts to help reduce CPU
overhead. It does this by now instead of relying on the timer/callout to
drive it instead use user return from a system call as well as lro flushes
to drive hpts. The timer becomes a backstop that dynamically adjusts
based on how "late" we are.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31083

(cherry picked from commit d7955cc0ffdf9fb58013245a6f181c757574ea0a)

tcp: Address goodput and TLP edge cases.

There are several cases where we make a goodput measurement and we are running
out of data when we decide to make the measurement. In reality we should not make
such a measurement if there is no chance we can have "enough" data. There is also
some corner case TLP's that end up not registering as a TLP like they should, we
fix this by pushing the doing_tlp setup to the actual timeout that knows it did
a TLP. This makes it so we always have the appropriate flag on the sendmap
indicating a TLP being done as well as count correctly so we make no more
that two TLP's.

In addressing the goodput lets also add a "quality" metric that can be viewed via
blackbox logs so that a casual observer does not have to figure out how good
of a measurement it is. This is needed due to the fact that we may still make
a measurement that is of a poorer quality as we run out of data but still have
a minimal amount of data to make a measurement.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31076

(cherry picked from commit e834f9a44acc577e658f40023d9465e887c94920)

tcp: Preparation for allowing hardware TLS to be able to kick a tcp connection that is retransmitting too much out of hardware and back to software.

Hardware TLS is now supported in some interface cards and it works well. Except that
when we have connections that retransmit a lot we get into trouble with all the retransmits.
This prep step makes way for change that Drew will be making so that we can "kick out" a
session from hardware TLS.

Reviewed by: tuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30895

(cherry picked from commit 9e4d9e4c4d79db58740a05d67645351e640aa32c)

tcp: enter network epoch when calling tfb_tcp_fb_fini

We need to enter the network epoch when calling into
tfb_tcp_fb_fini. I noticed this when I hit an assert
running the latest rack

Differential Revision: https://reviews.freebsd.org/D30407
Reviewed by: rrs, tuexen
Sponsored by: Netflix

(cherry picked from commit 086a35562f47917a516d30acc8b78a4884e31a4f)

tcp: Rack not being very friendly with V6:4 socket and having a connection from V4

There were two bugs that prevented V4 sockets from connecting to
a rack server running a V4/V6 socket. As well as a bug that stops the
mapped v4 in V6 address from working.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30885
PR: 256657
(cherry picked from commit 66aec14a5391bda1e9a20f5e4381626797c3e0fb)

sctp: Fix errno in case of association setup failures

Do not report always ETIMEDOUT, but only when appropriate. In
other cases report ECONNABORTED.

(cherry picked from commit 105b68b42dd11bce5c554b1ef0ddf73aa069d7da)

sctp: provide consistent stream information in case of early errors

While there, make sure the function is called correctly.

(cherry picked from commit ce64352a702db1fef8c0e33f3a6f13a3e5d92736)

sctp: provide sac_error also for ABORT chunk being sent

Thanks to Florent Castelli for bringing this issue up for the
userland stack and providing an initial patch.

(cherry picked from commit 84992a3251d56df3bc36e0ac33ba383f41107864)

sctp: initialize sequence numbers for ECN correctly

Reported by: Junseok Yang (for the userland stack)

(cherry picked from commit c7f048ab3532a9f081addd6da0adf96f25271de8)

sctp: Fix length check for ECNE chunks

(cherry picked from commit 6587a2bd1e88b5b99aea114e3d20b0d4c48c95df)

tcp: tolerate missing timestamps

Some TCP stacks negotiate TS support, but do not send TS at all
or not for keep-alive segments. Since this includes modern widely
deployed stacks, tolerate the violation of RFC 7323 per default.

Reviewed by: rgrimes, rrs, rscheff
Differential Revision: https://reviews.freebsd.org/D30740
Sponsored by: Netflix, Inc.

(cherry picked from commit 870af3f4dc57a6bbfc03f6a49ca0d5b7ff1b975a)

vmm: Fix ivrs_drv device_printf usage

The original %b description string is wrong.

Sponsored by: The FreeBSD Foundation
Reviewed by: imp, jhb
Differential Revision: https://reviews.freebsd.org/D30805

(cherry picked from commit 210e6aec4f83ee0efef348ed9dd86be7592596a1)

zfsd: Check for error from zpool_vdev_online

Onlining a vdev can fail. Log the error if it does.

Reviewed by: mav, asomers
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D30882

(cherry picked from commit 53b438b2425c374f6147ac80b3330a9ec08432bb)

nvme(4): Report NPWA before NPWG as stripesize.

New Samsung 980 SSDs report Namespace Preferred Write Alignment of
8 (4KB) and Namespace Preferred Write Granularity of 32 (16KB).
My quick tests show that 16KB is a minimal sequential write size
when the SSD reaches peak IOPS, so writing much less is very slow.
But writing slightly less or slightly more does not change much,
so it seems not so much a size granularity as minimum I/O size.

Thinking about different stripesize consumers:
- Partition alignment should be based on NPWA by definition.
- ZFS ashift in part of forcing alignment of all I/Os should also
be based on NPWA. In part of forcing size granularity, if really
needed, it may be set to NPWG, but too big value can make ZFS too
space-inefficient, and the 16KB is actually the biggest supported
value there now.
- ZFS recordsize/volblocksize could potentially be tuned up toward
NPWG to work as I/O size granularity, but enabled compression makes
it too fuzzy. And those are normally user-configurable things.
- ZFS I/O aggregation code could definitely use Optimal Write Size
value and may be NPWG, but we don't have fields in GEOM now to report
the minimal and optimal I/O sizes, and even maximal is not reported
outside GEOM DISK to be used by ZFS.

MFC after: 1 week

(cherry picked from commit e3bcd07d834def94dcf570ac7350ca2c454ebf10)

Skip netgraph tests when WITHOUT_NETGRAPH is set

PR: 256986
Reported by: John Marshall
MFC after: 1 week
Sponsored by: The FreeBSD Foundation

(cherry picked from commit c9144ec14d2a5a53cfe91ada1b3b9c06b78dc999)

nvmecontrol: fix typo (s/Managment/Management/)

Reported By: pstef

(cherry picked from commit 95a74ab4fb0879da270342bc98719b0e735694f3)

nvmecontrol: update copyright on passthru command

I wrote this code, not Intel, so put my copyright on this. I mistakenly
copied it for the initial commit.

Sponsored by: Netflix

(cherry picked from commit 6d6cca363392943689204f920fa2da9226e42056)

nvmecontrol: Report status from passthru commands

Report status from dword0 for passthru commands. Many commands report
some status or information here, so reporting it can help know what's
going on.

Sponsored by: Netflix

(cherry picked from commit 510a3da1477a917aa2aaf6b9e3cd6fd50dd13206)

nvmecontrol: document power command

The description of the power command is missing. While the synopsis is
present, there's no explanation. Add one.

Reviewed by: mav, chuck
PR: 237866
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31122

(cherry picked from commit 445b5554bf97254a0ead3d70f801871d62dcfb62)

stand/kmem_zalloc: panic when a M_WAITOK allocation fails

Malloc() might return NULL, in which case we will panic with a NULL
pointer deref. Make it panic when the allocation fails to preserve the
postcondtion that we never return a non-NULL value.

Reviewed by: tsoome
PR: 249859
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31106

(cherry picked from commit 72821668b039c276914569e9caa1cdfa4e4cb674)

nanobsd: remove sparc64 embedded example

Remove the qemu sparc64 example. It was only ever compile tested since
qemu had issues booting FreeBSD/sparc64. Also remove obsolete info about
armv5 configs removed long ago.

Sponsored by: Netflix

(cherry picked from commit 25a66f1fb177e0f3628f92f73b6ecf48a305d230)

nvme: coherently read status of completion records

Coherently read the phase bit of the status completion record. We loop
over the completion record array, looking for all the transactions in
the same phase that have been completed. In doing that, we have to be
careful to read the status field first, and if it indicates a complete
record, we need to read and process that record. Otherwise, the host
might be overtaken by device when reading this completion record,
leading to a mistaken belief that the record is in phase. This leads to
the code using old values and looking at an already completed entry, which
has no current tracker.

To work around this problem, we read the status and make sure it is in
phase, we then re-read the entire completion record guaranteeing it's
complete, valid, and consistent . In addition we resync the dmatag to
reflect changes since the prior loop for the bouncing dma case.

Reviewed by: jrtc27@, chuck@
Found by: jrtc27 (this fix is based in part on her D30995 fix)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31002

(cherry picked from commit aa0ab681ae755e01cd69435fab50f6852f248c42)

nvme: Fix alignment on nvme structures

Remove __packed from nvme_command, nvme_completion and
nvme_dsm_trim. Add super-alignment to nvme_completion since it's always
at least that aligned in hardware (and in our existing uses of it
embedded in structures). It generates better code in
nvme_qpair_process_completions on riscv64 because otherwise the ABI
assumes a 4-byte alignment, and the same on all other platforms.

Reviewed by: jrtc27@, mav@, chuck@
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31001

(cherry picked from commit fea3cf1d6da0acf40bc1d3dadeeea7eeccbc10dd)

nvme: style nit

Put the { on the same line as the struct nvme_foo when we define these
structures. It's FreeBSD standard and these were inconsistent.

Sponsored by: Netflix

(cherry picked from commit 80a75155e1601bddc2c595c06ab6ea916c603071)

nvme: fix a race between failing the controller and failing requests

Part of the nvme recovery process for errors is to reset the
card. Sometimes, this results in failing the entire controller. When nda
is in use, we free the sim, which will sleep until all the I/O has
completed. However, with only one thread, the request fail task never
runs once the reset thread sleeps here. Create two threads to allow I/O
to fail until it's all processed and the reset task can proceed.

This is a temporary kludge until I can work out questions that arose
during the review, not least is what was the race that queueing to a
failure task solved. The original commit is vague and other error paths
in the same context do a direct failure. I'll investigate that more
completely before committing changing that to a direct failure. mav@
raised this issue during the review, but didn't otherwise object.

Multiple threads, though, solve the problem in the mean time until other
such means can be perfected.

Reviewed by: jhb@
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D30366

(cherry picked from commit f0f47121653e88197d8537572294b90f5aef7f17)

nvme: use config_intrhook_drain to avoid removable card races

nvme drives are configured early in boot. However, a number of the configuration
steps takes which take a while, so we defer those to a config intrhook that runs
before the root filesystem is mounted. At the same time, the PCI hot plug wakes
up and tests the status of the card. It may decide that the card has gone away
and deletes the child. As part of that process nvme_detach is called. If this
call happens after the config_intrhook starts to run, but before it is finished,
there's a race where we can tear down the device's soft state while the
config_intrhook is still using it. Use the new config_intrhook_drain to
disestablish the hook. Either it will be removed w/o running, or the routine
will wait for it to finish. This closes the race and allows safe hotplug at any
time, even very early in boot.

Sponsored by: Netflix, Inc
Reviewed by: jhb, mav
Differential Revision: https://reviews.freebsd.org/D29006

(cherry picked from commit 8423f5d4c127f18e7500bc455bc7b6b1691385ef)

config_intrhook: provide config_intrhook_drain

config_intrhook_drain will remove the hook from the list as
config_intrhook_disestablish does if the hook hasn't been called. If it has,
config_intrhook_drain will wait for the hook to be disestablished in the normal
course (or expedited, it's up to the driver to decide how and when
to call config_intrhook_disestablish).

This is intended for removable devices that use config_intrhook and might be
attached early in boot, but that may be removed before the kernel can call the
config_intrhook or before it ends. To prevent all races, the detach routine will
need to call config_intrhook_train.

Sponsored by: Netflix, Inc
Reviewed by: jhb, mav, gde (in D29006 for man page)
Differential Revision: https://reviews.freebsd.org/D29005

(cherry picked from commit e52368365db3c0a696b37bfc09d08b7093b41b57)

math(3): Use the .Fa macro for function arguments

.Fa is the suitable macro for functions in comparsion to the
.Ar macro, which should be used for commandline arguments.

While here, fix some mandoc warnings.

Reviewed by: imp (earlier version)
Obtained from: OpenBSD (in partial)
Differential Revision: https://reviews.freebsd.org/D31090

(cherry picked from commit c5cbef2f85e6020ef8357b7d3af3ca228a262309)

clang: stop linking _p libs for -pg as of FreeBSD 14

In FreeBSD 14 we will stop providing _p libraries (compiled with -pg).

[Note this is controlled by the target version. There is no change for
FreeBSD <= 13.]

Reviewed by: dim (upstream)
Obtained from: LLVM 699d47472c3f
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30861

(cherry picked from commit b762974cf4b9ea77f1decf4a6d829372f0a97f75)

RELNOTES: Add an entry for commit 8a04edfdcbd2

This is a direct commit.

UPDATING: Add an entry for commit 8a04edfdcbd2

This is a direct commit.

mount_nfs.8: Update the man page for commit a145cf3f73c7

The NFSv4 client now uses the highest minor version of NFSv4
by default instead of minor version 0, for NFSv4 mounts.
The "minorversion" mount option may be used to override this default.

This patch updates the man page to reflect this change. While here,
fix nfsstat(8) to be nfsstat(1).

(cherry picked from commit b413b03597db10dbee514141b10f7c7ef236abe6)

nfscl: Change the default minor version for NFSv4 mounts

When NFSv4.1 support was added to the client, the implementation was
still experimental and, as such, the default minor version was set to 0.
Since the NFSv4.1 client implementation is now believed to be solid
and the NFSv4.1/4.2 protocol is significantly better than NFSv4.0,
I beieve that NFSv4.1/4.2 should be used where possible.

This patch changes the default minor version for NFSv4 to be the highest
minor version supported by the NFSv4 server. If a specific minor version
is desired, the "minorversion" mount option can be used to override
this default. This is compatible with the Linux NFSv4 client behaviour.

This was discussed on freebsd-current@ in mid-May 2021 under
the subject "changing the default NFSv4 minor version" and
the consensus seemed to be support for this change.
It also appeared that changing this for FreeBSD 13.1 was
not considered a POLA violation, so long as UPDATING
and RELNOTES entries were made for it.

(cherry picked from commit a145cf3f73c7d0f6071a6bddbe8a50a280285900)

Narrow down the probe range for if_ure(4) compatible devices
to only match the first vendor specific interface, if any.

PR: 253374
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit dab84426a68d43efaede62ccf86ca3ef852f8ae3)

Add support for RTL8153B, RTL8156 and RTL8156B to if_ure(4).

Submitted by: fbbz@synack.eu
PR: 253374
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit d4cf41a99b405c73288aea81e3c4580d1de18435)

ibstat: Include prototype for sysctlbyname().

Fixes the following compile warning:
implicit declaration of function 'sysctlbyname' is invalid in C99
[-Wimplicit-function-declaration]

Found by: J87
Differential Revision: https://reviews.freebsd.org/D30484
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 16fa3dcba027d13dcda9ee78e6057e3e5a79f80c)

Improve handling of USB device re-open in the LibUSB v1.x API.

Make sure the "device_is_gone" flag is cleared after every successful open,
so that the "device_is_gone" flag doesn't persist forever.

Found by: sergii.dmytruk@3mdeb.com
PR: 256296
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 6847ea50196f1a685be408a24f01cb8d407da19c)

one-true-awk: import 20210221 (1e4bc42c53a1) which fixes a number of bugs

Import the latest bsd-features branch of the one-true-awk upstream:

o Move to bison for $YACC
o Set close-on-exec flag for file and pipe redirects that aren't std*
o lots of little fixes to modernize ocde base
o free sval member before setting it
o fix a bug where a{0,3} could match aaaa
o pull in systime and strftime from NetBSD awk
o pull in fixes from {Net,Free,Open}BSD (normalized our code with them)
o add BSD extensions and, or, xor, compl, lsheift, rshift (mostly a nop)

Also revert a few of the trivial FreeBSD changes that were done slightly
differently in the upstreaming process. Also, our PR database may have
been mined by upstream for these fixes, and Mikolaj Golub may deserve
credit for some of the fixes in this update.

Suggested by: Mikolaj Golub <to.my.trociny@gmail.com>
PR: 143363, 143365, 143368, 143369, 143373, 143375, 214783
Sponsored by: Netflix

(cherry picked from commit f39dd6a9784467f0db5886012b3f4b13899be6b8)

zfs: merge openzfs/zfs@4f92fe0f5 (zfs-2.1-release) into stable/13

OpenZFS release 2.1.0

Version bump only, no changes in code.

Obtained from: OpenZFS
OpenZFS commit: 4f92fe0f5c822f6802c6ec675809d7c112a46f2e
OpenZFS tag: zfs-2.1.0
Relnotes: yes

zfs: update zfs_config.h to match current OpenZFS version (4f92fe0f5)

TBD: fetch(3) support for keylocation=http(s)://

(direct commit)

zfs: attach zpool_influxdb to build

From the zpool_influxdb.8 manual page:
  zpool_influxdb produces InfluxDB-line-protocol-compatible metrics from
  zpools.  Like the zpool command, zpool_influxdb reads the current pool
  status and statistics.  Unlike the zpool command which is intended for
  humans, zpool_influxdb formats the output in the InfluxDB line protocol.
  The expected use is as a plugin to a metrics collector or aggregator,
  such as Telegraf.

zpool_influxdb is installed into /usr/libexec/zfs/

Differential revision: https://reviews.freebsd.org/D31094

(cherry picked from commit 48b4fe0503282f03d25e23f44109c5cb6d450f7c)

bhyve: Fix NVMe iovec construction for large IOs

The UEFI driver included with Rocky Linux 8.4 uncovered an existing bug
in the NVMe emulation's construction of iovec's.

By default, NVMe data transfer operations use a scatter-gather list in
which all entries point to a fixed size memory region. For example, if
the Memory Page Size is 4KiB, a 2MiB IO requires 512 entries. Lists
themselves are also fixed size (default is 512 entries).

Because the list size is fixed, the last entry is special. If the IO
requires more than 512 entries, the last entry in the list contains the
address of the next list of entries. But if the IO requires exactly 512
entries, the last entry points to data.

The NVMe emulation missed this logic and unconditionally treated the
last entry as a pointer to the next list. Fix is to check if the
remaining data is greater than the page size before using the last entry
as a pointer to the next list.

PR: 256422
Reported by: dave@syix.com
Tested by: jason@tubnor.net
Relnotes: yes

(cherry picked from commit 91064841d72b285a146a3f1c32cb447251e062ea)

netpfil tests: Basic dummynet pipe test

Test dummynet pipes (i.e. bandwidth limitation) with ipfw. This is put
in the common tests because we hope to add dummynet support to pf in the
near future.

MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30380

(cherry picked from commit ea3eca5cb6dbcb4deb7c7277a65c48911f0475d1)

libpfctl: memory leak fix

We must remember to free the nvlist we create from the kernel's response
to DIOCGETSTATESNV, on every iteration.

Reviewed by: donner
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30957

(cherry picked from commit 0e9f1892ec739d7fbd854af699507167a0a5dde2)

pf: getstates: avoid taking the hashrow lock if the row is empty

Reviewed by: mjg
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30946

(cherry picked from commit a19ff8ce9b58548a5f965db2c46eb03c38b15edb)

pf: Reduce the data returned in DIOCGETSTATESNV

This call is particularly slow due to the large amount of data it
returns. Remove all fields pfctl does not use. There is no functional
impact to pfctl, but it somewhat speeds up the call.

It might affect other (i.e. non-FreeBSD) code that uses the new
interface, but this call is very new, so there's unlikely to be any. No
releases contained the previous version, so we choose to live with the
ABI modification.

Reviewed by: donner
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30944

(cherry picked from commit 34285eefddc99c994c3e5374ba7836cc7cfc8e2e)

pf tests: Stress state retrieval

Create and retrieve 20.000 states. There have been issues with nvlists
causing very slow state retrieval. We don't impose a specific limit on
the time required to retrieve the states, but do log it. In excessive
cases the Kyua timeout will fail this test.

Reviewed by: donner
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30943

(cherry picked from commit d8d43b2de1fa679179f7089cb3c31e6780ec82af)

Allow sleepq_signal() to drop the lock.

Introduce SLEEPQ_DROP sleepq_signal() flag, allowing one to drop the
sleep queue chain lock before returning. Reduced lock scope allows
significantly reduce lock contention inside taskqueue_enqueue() for
ZFS worker threads doing ~350K disk reads/s on 40-thread system.

MFC after: 2 weeks
Sponsored by: iXsystems, Inc.

(cherry picked from commit 6df35af4d85c6311d8e762521580e7176b69394e)

devmatch: improve naming of devmatch config variable

Accept the old rc.conf variable if the new one is not present for
compatability.

Approved by: imp
Differential Revision: https://reviews.freebsd.org/D30806

(cherry picked from commit c43b0081faab742eb93c3d064b552b65f926b86e)

pf tests: ftp-proxy test

Basic test case for ftp-proxy

PR: 256917
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")

(cherry picked from commit dd82fd3543022017b84007ac1a0d45fc683f9850)

ftp-proxy: Revert incorrect migration to libpfctl

libpfctl supports creating rules, but not (yet) adding addresses to a
pool. Adding addresses certainly does not work through adding a rule.

PR: 256917
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")

(cherry picked from commit 8923ea6c867fd75b08b76883ec122c429a4018f9)

dummynet: fix sysctls

The sysctl nodes which use V_dn_cfg must be marked as CTLFLAG_VNET so
that we use the correct per-vnet offset

PR: 256819
Reviewed by: donner
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30974

(cherry picked from commit 8f76eebce424de064f65fec5cdd105446a2de3bd)

arm: Make sure we can handle a thumb entry point.

Similarly to what's been done on arm64 with commit
712c060c94fd447c91b0e6218c12a431206b487a, when executing a binary, if the
entry point is a thumb symbol, then make sure we set the PSL_T flag, otherwise
the CPU will interpret it in ARM mode, and that will likely leads to an
undefined instruction.

PR: 256899
MFC after: 1 week

(cherry picked from commit 8c3bd133dd52824e427e350c65eae1fd9eb5a3cd)
Signed-off-by: Olivier Houchard <cognet@FreeBSD.org>

arm64: Make sure COMPAT_FREEBSD32 handles thumb entry point.

If the entry point for the binary executed is a thumb 2 entry point, make
sure we set the PSR_T bit, or the CPU will interpret it as arm32 code and
bad things will happen.

PR: 256899
MFC after: 1 week

(cherry picked from commit 712c060c94fd447c91b0e6218c12a431206b487a)
Signed-off-by: Olivier Houchard <cognet@FreeBSD.org>

t_getgroups: No longer expected to fail

Sponsored by: Netflix

(cherry picked from commit bf26ea77553931c22e72ddf1f9df6fb51fcbadfe)

kern: fail getgroup and setgroup with negative int

Found using
https://github.com/NetBSD/src/blob/trunk/tests/lib/libc/sys/t_getgroups.c

getgroups/setgroups want an int and therefore casting it to u_int
resulted in `getgroups(-1, ...)` not returning -1 / errno = EINVAL.

imp@ updated syscall.master and made changes markj@ suggested

PR: 189941
Tested by: imp@
Reviewed by: markj@
Pull Request: https://github.com/freebsd/freebsd-src/pull/407
Differential Revision: https://reviews.freebsd.org/D30617

(cherry picked from commit 4bc2174a1b489c36195ccc8cfc15e0775b817c69)

Add deprecation notice for WITH_PROFILE option

As discussed on freebsd-current [1] and freebsd-arch [2] and review
D30833, FreeBSD 14 will ship without the _p.a libraries built with -pg.
Both upstream and base system (in commit b762974cf4b9) Clang have been
modified to remove the special case for linking against these libraries.

Clang's -pg support and mcount() remain, so building with -pg can still
be used on code that the user builds; we just do not provide prebuilt
libraries compiled with -pg. A similar change is still needed for GCC.

[1] https://lists.freebsd.org/pipermail/freebsd-current/2020-January/075105.html
[2] https://lists.freebsd.org/archives/freebsd-arch/2021-June/000016.html

MFC after: 1 week
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 175841285e289edebb6603da39f02549521ce950)

Clarify notice for profiled libraries in FreeBSD 14

Reported by: kevans
Fixes: 175841285e28 ("Add deprecation notice for...")
Sponsored by: The FreeBSD Foundation

(cherry picked from commit f94360971e649fa684ef3b7e72839b59c7242bdb)

ffs_softdep: force sync if journal is low in journal_check_space

(cherry picked from commit 50acaaef54b4d7811393eb8c05a398d7a1882418)

ffs_softdep.c: add journal_check_space() helper

(cherry picked from commit 2126f103e0434db6ca34f0a5167bf5f03d4f02ad)

softdep_prelink(): only do sync if other thread changed the vnode metadata since previous prelink

FreeBSD_version is bumped due to struct namei size change

(cherry picked from commit 64b494a1050ae2cf2412edc19b57dc80f49eeda1)

ufs_rename(): only do softdep_prerename() when other thread changed a vnode

(cherry picked from commit f7565466622a411a50522f23528faeb1e57d4571)

ffs: mark block (re-)allocations as seqc writes

(cherry picked from commit d4d289cd51078de9e82c9d83977cfa614032cd06)