Mark Johnston [Sat, 10 Jul 2021 00:38:21 +0000 (20:38 -0400)]
uma: Fix a few problems with KASAN integration
- Ensure that all items returned by UMA are aligned to
KASAN_SHADOW_SCALE (8). This was true in practice since smaller
alignments are not used by any consumers, but we should enforce it
anyway.
- Use a non-zero code for marking redzones that appear naturally in
items that are not a multiple of the scale factor in size. Currently
we do not modify keg layouts to force the creation of redzones.
- Use a non-zero code for marking freed per-CPU items, otherwise
accesses of freed per-CPU items are not detected by the runtime.
iichid(4): Perform bus_teardown_intr/bus_setup_intr to disable interrupts
during suspend/resume cycle. Previously used bus_generic_suspend_intr and
bus_generic_resume_intr may cause interrupt storm because of missed
interrupt acknowledges caused by blocking of intr handler.
The fslsdma device requires sdma_fw, but that's not included in
GENERIC. That firmware is not in the FreeBSD tree at the moment, but
could easily be.
The license for the firmware can be found in the linux firmware repo:
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=3123d78e09d2f815de4d94aa35c07b3c0469c80e
and looks to be a BSD license + no reverse engineer.
We can add this back after the firmware is imported, made a port, or
whose automatic loading can be made to happen.
Reviewed by: imp (with ian finding the license)
PR: 237466
MFC after: 1 week
Warner Losh [Fri, 9 Jul 2021 17:21:18 +0000 (11:21 -0600)]
stand/kmem_zalloc: panic when a M_WAITOK allocation fails
Malloc() might return NULL, in which case we will panic with a NULL
pointer deref. Make it panic when the allocation fails to preserve the
postcondtion that we never return a non-NULL value.
Rather than allocating however much memory userspace asks for we only
allocate enough for a handful of states, and copy to userspace for each
completed row.
We start out with enough space for 16 states (per row), but grow that as
required. In most configurations we expect at most a handful of states
per row (more than that would have other negative effects on packet
processing performance).
Warner Losh [Fri, 9 Jul 2021 03:51:24 +0000 (21:51 -0600)]
awk: Reduce diffs with upstream to almost nothing.
In the merge of 20210215, I left two merge conflicts #if 0'd by mistake
to check later rather than resolve them as part of the merge. This code
turns out to be from the original one-true-awk import and not FreeBSD
specific, so remove them.
Remove a extra definition of HAT.
Remove a stylistic change that also appears to be a mismerge along the
way.
Remove FREEBSD-upgrade. Nobody has updated it since the original 2007
cvs import. It talks about old CVS branches that never made it into svn,
let alone git. New imports will follow the standard practices now, so
there's nothing left to document.
Move README to README.md and copy the README.md from upstream over.
This leaves just the $FreeBSD$ lines (which remain for the stable/12
merge) and the strcoll part of ru@'s r201989/d98dd8e5f94c as the only
diffs with upstream. FreeBSD also still has its own man page, which I
don't plan on changing. Once this commit is merged to stable/12, I plan
no further merges to stable/12. Sometime after that I'll remove the
$FreeBSD$ lines to reduce the diffs even more (though i want to make
sure plans won't change first). I also plan to talk to upstream about
this change...
Rick Macklem [Fri, 9 Jul 2021 00:39:04 +0000 (17:39 -0700)]
nfscl: Add a Linux compatible "nconnect" mount option
Linux has had an "nconnect" NFS mount option for some time.
It specifies that N (up to 16) TCP connections are to created for a mount,
instead of just one TCP connection.
A discussion on freebsd-net@ indicated that this could improve
client<-->server network bandwidth, if either the client or server
have one of the following:
- multiple network ports aggregated to-gether with lagg/lacp.
- a fast NIC that is using multiple queues
It does result in using more IP port#s and might increase server
peak load for a client.
One difference from the Linux implementation is that this implementation
uses the first TCP connection for all RPCs composed of small messages
and uses the additional TCP connections for RPCs that normally have
large messages (Read/Readdir/Write). The Linux implementation spreads
all RPCs across all TCP connections in a round robin fashion, whereas
this implementation spreads Read/Readdir/Write across the additional
TCP connections in a round robin fashion.
Warner Losh [Thu, 8 Jul 2021 23:55:20 +0000 (17:55 -0600)]
nanobsd: remove sparc64 embedded example
Remove the qemu sparc64 example. It was only ever compile tested since
qemu had issues booting FreeBSD/sparc64. Also remove obsolete info about
armv5 configs removed long ago.
Warner Losh [Thu, 8 Jul 2021 19:53:18 +0000 (13:53 -0600)]
devmatch: don't announce autoloading so much
devmatch rc script would announce it was loading a module multiple
times. It used kldload -n so it really wasn't loading it that many
times, but the message is confusing. Use kldstat to see if we need to
load the module before saying we do. This fixes the vast majority of the
problems. It may be possible to race devmatch with a user invocation and
devd, though quite hard. In that case we'll announce things twice, but
still only load it once. No attempt is made to fix this.
Warner Losh [Thu, 8 Jul 2021 19:44:21 +0000 (13:44 -0600)]
devmatch: Be tolerant of .ko being present.
We document that we did not need .ko on the module names in
devmatch_blocklist, but we really needed them. Keep the documentation
the same, but strip the .ko when we need to use the names so you can
specify either.
Rather than extending syr827 for syr828 (as initially done in D31103)
switch to the Fairchild Semiconductor Corporation fan53555 implementation
which is in-tree but was not attached to the build. The fan53555
implementation also supports syr827/syr8278 already. [1]
Update NOTES and the arm64 GENERIC configuration for the switch.
syr827 for now stays in the tree but is not used by any
kernel configuration.
Support loading a default pf ruleset in case of invalid pf.conf.
If no pf rules are loaded pf will pass/allow all traffic, assuming the
kernel is compiled without PF_DEFAULT_TO_DROP, as is the case in
GENERIC.
In other words: if there's a typo in the main pf_rules we would allow
all traffic. The new default rules minimise the impact of this.
If $pf_program (i.e. pfctl) fails to set $pf_fules and
$pf_fallback_rules_enable is YES we will load $pf_fallback_rules_file if
set, or $pf_fallback_rules.
$pf_fallback_rules can include multiple rules, for example to permit
traffic on a management interface.
$pf_fallback_rules_enable defaults to "NO", preserving historic behaviour.
Randall Stewart [Thu, 8 Jul 2021 11:06:58 +0000 (07:06 -0400)]
tcp: Fix 32 bit platform breakage
This fixes the incorrect use of a sysctl add to u64. It
was for a useconds time, but on 32 bit platforms its
not a u64. Instead use the long directive.
Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31107
Happily this wasn't a real bug, because pf_killstates() never fails, but
we should check the return value anyway, in case it does ever start
returning errors.
pidx is never NULL, and is used unconditionally later on in the
function.
Add an assertion, as documentation for the requirement to provide an idx
pointer.
Michal Meloun [Fri, 2 Jul 2021 18:17:36 +0000 (20:17 +0200)]
intrng: Releasing interrupt source should clear interrupt table full state.
The first release of an interrupt in a situation where the interrupt table
is full should schedule a full table check the next time an interrupt is
allocated. A full check is necessary to ensure maximum separation between
the order of allocation and the order of release.
Import the latest bsd-features branch of the one-true-awk upstream:
o Move to bison for $YACC
o Set close-on-exec flag for file and pipe redirects that aren't std*
o lots of little fixes to modernize ocde base
o free sval member before setting it
o fix a bug where a{0,3} could match aaaa
o pull in systime and strftime from NetBSD awk
o pull in fixes from {Net,Free,Open}BSD (normalized our code with them)
o add BSD extensions and, or, xor, compl, lsheift, rshift (mostly a nop)
Also revert a few of the trivial FreeBSD changes that were done slightly
differently in the upstreaming process. Also, our PR database may have
been mined by upstream for these fixes, and Mikolaj Golub may deserve
credit for some of the fixes in this update.
Import the latest bsd-features branch of the one-true-awk upstream:
o Move to bison for $YACC
o Set close-on-exec flag for file and pipe redirects that aren't std*
o lots of little fixes to modernize ocde base
o free sval member before setting it
o fix a bug where a{0,3} could match aaaa
o pull in systime and strftime from NetBSD awk
o pull in fixes from {Net,Free,Open}BSD
o add BSD extensions and, or, xor, compl, lsheift, rshift
devmatch loads a number of things automatically. Allow the list of
things to load to happen first in case those drivers affect what would
be loaded. Normally, this will produce the same results, but there's
some special cases that may not when drivers are loaded that report
other drivers missing, like virtio_pci.
Andrew Gallatin [Wed, 7 Jul 2021 19:05:49 +0000 (15:05 -0400)]
ktls: make ktls_disable_ifnet() shim static
A user reported that when compiling without KERN_TLS, and
with -O0, the kernel failed to link due to ktls_disable_ifnet()
being undefined. Making the shim static works around this issue.
Alan Cox [Wed, 7 Jul 2021 18:16:03 +0000 (13:16 -0500)]
arm64: Simplify fcmpset failure in pmap_promote_l2()
When the initial fcmpset in pmap_promote_l2() fails, there is no need
to repeat the check for the physical address being 2MB aligned or for
the accessed bit being set. While the pmap is locked the hardware can
only transition the accessed bit from 0 to 1, and we have already
determined that it is 1 when the fcmpset fails.
Martin Matuska [Wed, 7 Jul 2021 17:45:52 +0000 (19:45 +0200)]
zfs: attach zpool_influxdb to build
From the zpool_influxdb.8 manual page:
zpool_influxdb produces InfluxDB-line-protocol-compatible metrics from
zpools. Like the zpool command, zpool_influxdb reads the current pool
status and statistics. Unlike the zpool command which is intended for
humans, zpool_influxdb formats the output in the InfluxDB line protocol.
The expected use is as a plugin to a metrics collector or aggregator,
such as Telegraf.
zpool_influxdb is installed into /usr/libexec/zfs/
Differential revision: https://reviews.freebsd.org/D31094
MFC after: 3 days
Randall Stewart [Tue, 6 Jul 2021 19:23:22 +0000 (15:23 -0400)]
tcp: HPTS performance enhancements
HPTS drives both rack and bbr, and yet there have been many complaints
about performance. This bit of work restructures hpts to help reduce CPU
overhead. It does this by now instead of relying on the timer/callout to
drive it instead use user return from a system call as well as lro flushes
to drive hpts. The timer becomes a backstop that dynamically adjusts
based on how "late" we are.
Do not call FreeBSD-ABI specific code for all ABIs
Use sysentvec hooks to only call umtx_thread_exit/umtx_exec, which handle
robust mutexes, for native FreeBSD ABI. Similarly, there is no sense
in calling sigfastblock_clear() for non-native ABIs.
Alexander Motin [Wed, 7 Jul 2021 00:39:23 +0000 (20:39 -0400)]
FreeBSD: Hardcode abd_chunk_size to PAGE_SIZE
It makes no sense to set it below PAGE_SIZE, since it increases all
overheads and makes returning memory to OS problematic. It makes no
sense to set it above PAGE_SIZE, since such allocations and especially
frees are too expensive and cause KVA fragmentation to benefit from
fewer chunks. After that it makes no sense to keep more complicated
math here.
What may have sense though is just a tunable border between linear and
scatter ABDs, previously also controlled by this tunable. Retain that
functionality by taking abd_scatter_min_size tunable from Linux, just
with different default value.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12328
Alexander Motin [Tue, 6 Jul 2021 21:38:00 +0000 (17:38 -0400)]
Move gethrtime() calls out of vdev queue lock
This dramatically reduces the lock contention on systems with slower
(non-TSC) timecounters. With TSC the difference is minimal, but since
this lock is pretty congested, any improvement counts. Plus I don't
see any reason to do it under the lock other than the latency of the
lock itself, which this change actually reduces.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #12281
MMCCAM: fix a panic after cam_sim_alloc_dev() removal in sdhci.c
During the removal of cam_sim_alloc_dev() in aeb04e88f51a706ef4b6a380bf5e82d15203fb6a for sdhci.c and the
follow-up build-fix in a72af82e3169fcacfedf9047120679300a4296f8
slot->dev and slot->bus got mixed up for MMCCAM; slot->dev is
only used in the !MMCCAM case so is uninitialised here leading to
a panic; switch back to slot->bus to return to the status quo.
Reviewed by: imp (ack on arm@)
X-Differential Revision: https://reviews.freebsd.org/D30857
Udev rules: remove zvol compat symlinks (without the leading zvol/)
This is a potentially arguable change, because it removes some
compatibility cruft that certain systems or people may have come to rely
on (either a very long time ago, or unwisely in recent times).
On the other hand, it's been literally over a decade since OpenZFS
switched to the strategy of using opaque numbered /dev/zd* device nodes,
with the canonical zvol access path being a directory tree of symlinks
created by udev rules inside /dev/zvol/*. (See #102.) Even at the time,
the /dev/* scheme was labeled as being for "compatibility".
This commit removes the second tree of symlinks located directly at
/dev/*, under the assumption that anybody with any sense has been using
the intended /dev/zvol/* path for a very very long time now.
(The more I think about this, the more I anticipate that some large
fraction of people will have been blissfully unaware that the intention
has been for them to use the /dev/zvol/* tree all along, and they will
have come to rely upon the /dev/* tree simply because it's been there
this whole time despite being a compat thing.)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes #12303
Randall Stewart [Tue, 6 Jul 2021 14:36:14 +0000 (10:36 -0400)]
tcp: Address goodput and TLP edge cases.
There are several cases where we make a goodput measurement and we are running
out of data when we decide to make the measurement. In reality we should not make
such a measurement if there is no chance we can have "enough" data. There is also
some corner case TLP's that end up not registering as a TLP like they should, we
fix this by pushing the doing_tlp setup to the actual timeout that knows it did
a TLP. This makes it so we always have the appropriate flag on the sendmap
indicating a TLP being done as well as count correctly so we make no more
that two TLP's.
In addressing the goodput lets also add a "quality" metric that can be viewed via
blackbox logs so that a casual observer does not have to figure out how good
of a measurement it is. This is needed due to the fact that we may still make
a measurement that is of a poorer quality as we run out of data but still have
a minimal amount of data to make a measurement.
Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31076
Elliott Mitchell [Thu, 13 May 2021 06:58:00 +0000 (03:58 -0300)]
etc/ttys: merge ttys file down to single file
The tty lists were already pretty similar and there hadn't been any real
need for them to remain distinct for some time. As such, merge to a
single file.
The RISC-V console is preserved. For systems where it doesn't exist, its
presence in /etc/ttys is harmless. The uncommented version of the
ttyv8/XDM line from ttys.amd64 was the one chosen.
Andrew Gallatin [Tue, 6 Jul 2021 14:17:33 +0000 (10:17 -0400)]
ktls: auto-disable ifnet (inline hw) kTLS
Ifnet (inline) hw kTLS NICs typically keep state within
a TLS record, so that when transmitting in-order,
they can continue encryption on each segment sent without
DMA'ing extra state from the host.
This breaks down when transmits are out of order (eg,
TCP retransmits). In this case, the NIC must re-DMA
the entire TLS record up to and including the segment
being retransmitted. This means that when re-transmitting
the last 1448 byte segment of a TLS record, the NIC will
have to re-DMA the entire 16KB TLS record. This can lead
to the NIC running out of PCIe bus bandwidth well before
it saturates the network link if a lot of TCP connections have
a high retransmoit rate.
This change introduces a new sysctl (kern.ipc.tls.ifnet_max_rexmit_pct),
where TCP connections with higher retransmit rate will be
switched to SW kTLS so as to conserve PCIe bandwidth.
Alex Richardson [Tue, 6 Jul 2021 11:18:29 +0000 (12:18 +0100)]
Fix building rescue/rescue when sanitizers are enabled
We have to ensure that we don't link any instrumented object files
into rescue as it is a static executable and static binaries can't
use the sanitizer runtime.
Alex Richardson [Tue, 6 Jul 2021 11:16:40 +0000 (12:16 +0100)]
usr.bin/diff: fix UBSan error in readhash
UBSan complains about the `sum = sum * 127 + chrtran(t);` line below since
that can overflow an `int`. Use `unsigned int` instead to ensure that
overflow is well-defined.
Alex Richardson [Tue, 6 Jul 2021 10:02:44 +0000 (11:02 +0100)]
Import Arm Optimized Routines v21.02
This is the new replacement for the existing cortex-strings code which will
be replaced in a follow-up commit.
We should also be able to use some of the math functions to allow the
tests to pass on AArch64 (and other architectures) instead of just x86.
We might also be able to reuse some of the tests for the kyua testsuite.
Imported using
```
curl -L https://github.com/ARM-software/optimized-routines/tarball/e823e3abf5f89ecba58a10fc0fd82c13d9984b6b | tar --strip-components=1 -xvzf -
git add .
```
Alex Richardson [Tue, 6 Jul 2021 09:50:05 +0000 (10:50 +0100)]
usr.bin/login: send errors to console if syslog isn't running
I was debugging why login(1) wasn't working as expected on a minimal
MFS_ROOT disk image. This image doesn't have syslogd running so the
warnings were lost and I had to use GDB to find out why login(1) was
failing (missing PAM libraries) instead of being able to see it in
the console output.
Alex Richardson [Mon, 5 Jul 2021 13:32:48 +0000 (14:32 +0100)]
usr.bin/sort: Avoid UBSan errors
UBSan complains about out-of-bounds accesses for zero-length arrays. To
avoid this we can use flexible array members. However, the C standard does
not allow for structures that only contain flexible array members, so we
move the length parameters into that structure too.
Before UMA CCBs, all CCBs were of the same size, and could
be trivially copied using bcopy(9). Now we have to preserve
alloc_flags, otherwise we might end up attempting to free
stack-allocated CCB to UMA; we also need to take CCB size
into account.
This fixes kernel panic which would occur when trying to access
a stopped (as in, SCSI START STOP, also "ctladm stop") SCSI device.
Reported By: Gary Jennejohn <gljennjohn@gmail.com>
Tested By: Gary Jennejohn <gljennjohn@gmail.com>
Reviewed By: imp
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D31054
Alexander Motin [Tue, 6 Jul 2021 02:19:48 +0000 (22:19 -0400)]
nvme(4): Report NPWA before NPWG as stripesize.
New Samsung 980 SSDs report Namespace Preferred Write Alignment of
8 (4KB) and Namespace Preferred Write Granularity of 32 (16KB).
My quick tests show that 16KB is a minimal sequential write size
when the SSD reaches peak IOPS, so writing much less is very slow.
But writing slightly less or slightly more does not change much,
so it seems not so much a size granularity as minimum I/O size.
Thinking about different stripesize consumers:
- Partition alignment should be based on NPWA by definition.
- ZFS ashift in part of forcing alignment of all I/Os should also
be based on NPWA. In part of forcing size granularity, if really
needed, it may be set to NPWG, but too big value can make ZFS too
space-inefficient, and the 16KB is actually the biggest supported
value there now.
- ZFS recordsize/volblocksize could potentially be tuned up toward
NPWG to work as I/O size granularity, but enabled compression makes
it too fuzzy. And those are normally user-configurable things.
- ZFS I/O aggregation code could definitely use Optimal Write Size
value and may be NPWG, but we don't have fields in GEOM now to report
the minimal and optimal I/O sizes, and even maximal is not reported
outside GEOM DISK to be used by ZFS.
Alan Cox [Sun, 4 Jul 2021 05:20:42 +0000 (00:20 -0500)]
On a failed fcmpset don't pointlessly repeat tests
In a few places, on a failed compare-and-set, both the amd64 pmap and
the arm64 pmap repeat tests on bits that won't change state while the
pmap is locked. Eliminate some of these unnecessary tests.
geom_label: Remove an old sysinstall(8) workaround
We removed sysinstall(8) back in 2011, so this workaround should be long
since unnecessary. This workaround can end up breaking cases that are
hit in the real world, such as dd'ing a small pre-built disk image to a
large partition that you intend to grow on first boot and uses a UFS
disk label for / in its /etc/fstab (as the only reliable thing a raw UFS
image can reference).
rman: Remove an outdated comment that no longer applies
Since commit 2dd1bdf1834c in 2016 the r_start and r_end fields have been
rman_res_t, which was briefly unsigned long, but commit da1b038af9f9
changed the typedef to be uintmax_t instead. C99 is also something we
assume these days.
When calling file_findfile with only a type it returns
the first file matching the type. But in fdt_apply_overlays we
then iterate on the next files and try loading them as dtb overlays.
Fix this by checking the type one more time.
Sponsored by: Diablotin Systems
Reported by: Mark Millard <marklmi@yahoo.com>
pf: allow table stats clearing and reading with ruleset rlock
Instead serialize against these operations with a dedicated lock.
Prior to the change, When pushing 17 mln pps of traffic, calling
DIOCRGETTSTATS in a loop would restrict throughput to about 7 mln. With
the change there is no slowdown.
Creating tables and zeroing their counters induces excessive IPIs (14
per table), which in turns kills single- and multi-threaded performance.
Work around the problem by extending per-CPU counters with a general
counter populated on "zeroing" requests -- it stores the currently found
sum. Then requests to report the current value are the sum of per-CPU
counters subtracted by the saved value.
Sample timings when loading a config with 100k tables on a 104-way box:
stock:
pfctl -f tables100000.conf 0.39s user 69.37s system 99% cpu 1:09.76 total
pfctl -f tables100000.conf 0.40s user 68.14s system 99% cpu 1:08.54 total
patched:
pfctl -f tables100000.conf 0.35s user 6.41s system 99% cpu 6.771 total
pfctl -f tables100000.conf 0.48s user 6.47s system 99% cpu 6.949 total
strscpy copies the src string, or as much of it as fits, into the dst
buffer. The dst buffer is always NUL terminated, unless it's zero-sized.
strscpy returns the number of characters copied (not including the
trailing NUL) or -E2BIG if len is 0 or src was truncated.
Currently drm-kmod replaces strscpy with strncpy that is not quite
correct as strncpy does not NUL-terminate truncated strings and returns
different values on exit.