Gleb Smirnoff [Thu, 26 Oct 2023 09:59:21 +0000 (02:59 -0700)]
bhyve: fix arguments to ioctl(VMIO_SIOCSIFFLAGS)
ioctl(2)'s with integer argument shall pass command argument by value,
not by pointer. The ioctl(2) manual page is not very clear about that.
See sys/kern/sys_generic.c:sys_ioctl() near IOC_VOID.
Kristof Provost [Fri, 6 Oct 2023 12:20:17 +0000 (14:20 +0200)]
pfctl: fix incorrect mask on dynamic address
A PF rule using an IPv4 address followed by an IPv6 address and then a
dynamic address, e.g. "pass from {192.0.2.1 2001:db8::1} to (pppoe0)",
will have an incorrect /32 mask applied to the dynamic address.
MFC after: 3 weeks
Obtained from: OpenBSD
See also: https://ftp.openbsd.org/pub/OpenBSD/patches/5.6/common/007_pfctl.patch.sig
Sponsored by: Rubicon Communications, LLC ("Netgate")
Event: Oslo Hackathon at Modirum
Brooks Davis [Thu, 26 Oct 2023 20:38:41 +0000 (21:38 +0100)]
libprocstat: improve conditional for 32-bit compat
Include support for translating 32-bit auxv vectors on non-64-bit
platforms that aren't riscv (which has no 32-bit ABI support and
probably never will).
Brooks Davis [Thu, 26 Oct 2023 20:38:41 +0000 (21:38 +0100)]
libprocstat: make sv_name not static
Making this variable static makes is_elf32_sysctl() and callers thread
unsafe.
Use a less absurd length for sv_name. The longest name in the system is
"FreeBSD ELF64 V2" which tips the scales at 16+1 bytes. We'll almost
certainly have other problems if we exceed 32 characters.
John Baldwin [Fri, 20 Oct 2023 21:52:38 +0000 (14:52 -0700)]
x86 msi: Enable/disable IDT vectors for MSI groups all at once
Unlike MSI-X, when a device uses multiple MSI interrupts, the entire
group of interrupts are enabled/disabled at once in the relevant PCI
config register. Currently, the interrupt code enables the IDT vector
for each MSI interrupt when a handler is first registered. If the PCI
device triggers an MSI interrupt which doesn't yet have a handler,
this can trigger a panic when the Xrsvd ISR executes rather than
treating it as a stray device interrupt.
To fix, enable all the IDT vectors for an MSI group when the first
interrupt handler is configured, and don't disable the IDT vectors
until the last interrupt handler for the group is torn down.
When migrating an MSI group between CPUs, enable/disable the entire
group of IDT vectors if at least one interrupt handler is configured
for the group.
John Baldwin [Mon, 16 Oct 2023 22:19:07 +0000 (15:19 -0700)]
acpi_pcib: Trust decoded bus range from _CRS over _BBN
Currently if _BBN doesn't match the first bus in the decoded bus range
from _CRS for a Host to PCI bridge, the driver fails to attach as a
defensive measure.
There is now firmware in the field where these do not match, and the
_BBN values are clearly wrong, so rather than failing attach, trust
the range from _CRS over _BBN.
John Baldwin [Mon, 16 Oct 2023 22:17:48 +0000 (15:17 -0700)]
bhyve: Replace many fprintf(stderr, ...) calls with EPRINTLN
EPRINTLN handles newlines appropriately when stdout/stderr have been
reused as the backend for a serial port.
For bhyverun.c itself, the rule this attempts to follow is to use
regular fprintf/perror/warn/err prior to init_pci() (which is when
serial ports are configured) and to switch to EPRINTLN afterwards.
John Baldwin [Fri, 13 Oct 2023 19:26:22 +0000 (12:26 -0700)]
bhyve: Some fwctl simplifications.
- Collapse IDENT_SEND/IDENT_WAIT states down to a single state.
- Remove unused 'len' argument to op_data callback. The value passed
in (total amount of remaining data to receive) didn't seem very useful
and no op_data implementations used it.
John Baldwin [Wed, 11 Oct 2023 21:21:12 +0000 (14:21 -0700)]
riscv: Tidy panic messages for exceptions
- Remove trailing newlines
- Be consistent about the format used to print pointer values
- Print the trap value for access faults (it is the faulting address
if non-zero) and illegal instructions (it is the first N bytes of
the decoded instruction if non-zero)
Jan Bramkamp [Mon, 4 Sep 2023 08:38:25 +0000 (10:38 +0200)]
bhyve: Use VMIO_SIOCSIFFLAGS instead of SIOCGIFFLAGS
Creating an IP socket to invoke the SIOCGIFFLAGS ioctl on is the only
thing preventing bhyve from working inside a bhyve jail with IPv4 and
IPv6 disabled restricting the jailed bhyve process to only access the
host network via a tap/vmnet device node.
Mark Johnston [Mon, 16 Oct 2023 21:35:07 +0000 (17:35 -0400)]
socket tests: Clean up the MSG_TRUNC regression tests a bit
- Fix style.
- Move test case-specific code out of the shared function and into the
individual test cases.
- Remove unneeded setting of SO_REUSEPORT.
- Avoid unnecessary copying.
- Use ATF_REQUIRE* instead of ATF_CHECK*. The former cause test
execution to stop after a failed assertion, which is what we want.
- Add a test case for AF_LOCAL/SOCK_SEQPACKET sockets.
The current xo_format string is incorrect. This restores the display
format prior to libxo-ification work while also explicitly marking
tv_sec and tv_usec as encoded output only.
Zhenlei Huang [Tue, 17 Oct 2023 07:05:25 +0000 (15:05 +0800)]
x86: Prefer consistent naming for loader tunables
The following loader tunables do have corresponding sysctl MIBs but
with inconsistent naming. That may be historical reason. Let's prefer
consistent naming for them so that it will be easier to maintain.
Zhenlei Huang [Fri, 20 Oct 2023 07:31:44 +0000 (15:31 +0800)]
amd64 pmap: Prefer consistent naming for loader tunable
The sysctl knob 'vm.pmap.allow_2m_x_ept' is loader tunable and have
public document entry in security(7) but is fetched from kernel
environment 'hw.allow_2m_x_ept'. That is inconsistent and obscure.
As there is public security advisory FreeBSD-SA-19:25.mcepsc [1],
people may refer to it and use 'hw.allow_2m_x_ept', let's keep old
name for compatibility.
Zhenlei Huang [Thu, 19 Oct 2023 17:18:25 +0000 (01:18 +0800)]
vmx: Prefer consistent naming for loader tunables
The following loader tunables do have corresponding sysctl MIBs but
with different names. That may be historical reason. Let's prefer
consistent naming for them so that it will be easier to read and
maintain.
Kyle Evans [Thu, 12 Oct 2023 02:51:07 +0000 (21:51 -0500)]
freebsd-update: create deep BEs by default
The -r flag to bectl needs to go away, and we need to just do the right
thing. In the meantime, we can apply an -r in freebsd-update as a
minimal fix to stop creating partial backups in these (non-default) deep
BE setups.
Zhenlei Huang [Thu, 19 Oct 2023 17:00:31 +0000 (01:00 +0800)]
pmap: Prefer consistent naming for loader tunable
The sysctl knob 'vm.pmap.pv_entry_max' becomes a loader tunable since 7ff48af7040f (Allow a specific setting for pv entries) but is fetched
from system environment 'vm.pmap.pv_entries'. That is inconsistent and
obscure.
This reverts 36e1b9702e21 (Correct the tunable name in the message).
Zhenlei Huang [Thu, 19 Oct 2023 15:23:33 +0000 (23:23 +0800)]
amd64: Fix two typos of loader tunables
To match the sysctl MIBs and document entries in security(7).
Fixes: 2dec2b4a34b4 amd64: flush L1 data cache on syscall return with an error
Fixes: 17edf152e556 Control for Special Register Buffer Data Sampling mitigation
Reviewed by: kib
MFC after: 1 day
Differential Revision: https://reviews.freebsd.org/D42249
Bojan Novković [Fri, 13 Oct 2023 05:14:36 +0000 (08:14 +0300)]
tty/teken: fix UTF8 sequence validation logic
This patch fixes UTF-8 sequence validation logic in
teken_utf8_bytes_to_codepoint() and fixes fallback behaviour in
ttydisc_rubchar() when an invalid UTF8 sequence is encountered. The code
previously used __bitcount() to extract sequence length information from
the leading byte. However, this assumption breaks for certain code
points that have additional bits set in the first half of the leading
byte (e.g. Cyrillic characters). This lead to incorrect behaviour when
deleting those characters using backspaces. The code now checks the
number of consecutive set bits in the leading byte starting from the
MSB, as per RFC 3629.
The use of bitcount() triggered a build error because it couldn't be
located. __bitcount() on the other hand is defined in sys/types.h, which
is included in teken/teken.h.
Bojan Novković [Sat, 7 Oct 2023 18:00:11 +0000 (21:00 +0300)]
tty: fix improper backspace behaviour for UTF8 characters when in canonical mode
This patch adds additional logic in ttydisc_rubchar() to properly handle
backspace behaviour for UTF-8 characters.
Currently, typing in a backspace after a UTF8 character will delete only
one byte from the byte sequence, leaving garbled output in the tty's
output queue. With this change all of the character's bytes are deleted.
This change is only active when the IUTF8 flag is set (see 19054eb6053189144aa962b2ecc1bf5087758a3e "(s)tty: add support for IUTF8
input flag")
The code uses the teken_wcwidth() function to properly handle character
column widths for different code points, and adds the
teken_utf8_bytes_to_codepoint() function that converts a UTF-8 byte
sequence to a codepoint, as specified in RFC3629.
Bojan Novković [Sat, 7 Oct 2023 17:59:57 +0000 (20:59 +0300)]
(s)tty: add support for IUTF8 input flag
This patch adds the necessary kernel and stty code to support setting
the IUTF8 flag for ttys. It is the first of two patches that fix
backspace behaviour for UTF-8 encoded characters when in canonical mode.
Mark Johnston [Mon, 16 Oct 2023 20:11:55 +0000 (16:11 -0400)]
ktrace: Handle uio_resid underflow via MSG_TRUNC
When recvmsg(2) is used with MSG_TRUNC on an atomic socket type (DGRAM
or SEQPACKET), soreceive_generic() and uipc_peek_dgram() may
intentionally underflow uio_resid so that userspace can find out how
many bytes it should have asked for.
If this happens, and KTR_GENIO is enabled, ktrgenio() will attempt to
copy in beyond the end of the output buffer's iovec. In general this
will silently cause the ktrace operation to fail since it'll result in
EFAULT from uiomove(). Let's be more careful and make sure not to try
and copy more bytes than we have.
Fixes: be1f485d7d6b ("sockets: add MSG_TRUNC flag handling for recvfrom()/recvmsg().")
Reported by: syzbot+30b4bb0c0bc0f53ac198@syzkaller.appspotmail.com
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D42099
Wei Hu [Fri, 20 Oct 2023 08:58:20 +0000 (08:58 +0000)]
Hyper-V: vmbus: check if signaling host is needed in vmbus_rxbr_read
It is observed that netvsc's send rings could stall on the latest
Azure Boost platforms. This is due to vmbus_rxbr_read() routine
doesn't check if host is waiting for more room to put data, which
leads to host side sleeping forever on this vmbus channel. The
problem was only observed on the latest platform because the host
requests larger buffer ring room to be available, which causes
the issue to happen much more easily.
Fix this by adding check in the vmbus_rxbr_read call and signaling
the host in the callers if check returns positively.
Reported by: NetApp
Tested by: whu
Sponsored by: Microsoft
Zhenlei Huang [Thu, 12 Oct 2023 10:14:49 +0000 (18:14 +0800)]
vm_phys: Add corresponding sysctl knob for loader tunable
The loader tunable 'vm.numa.disabled' does not have corresponding sysctl
MIB entry. Add it so that it can be retrieved, and `sysctl -T` will also
report it correctly.
Zhenlei Huang [Thu, 12 Oct 2023 10:14:49 +0000 (18:14 +0800)]
vm_page: Add corresponding sysctl knob for loader tunable
The loader tunable 'vm.pgcache_zone_max_pcpu' does not have corresponding
sysctl MIB entry. Add it so that it can be retrieved, and `sysctl -T`
will also report it correctly.
Zhenlei Huang [Thu, 12 Oct 2023 10:14:48 +0000 (18:14 +0800)]
kasan: Add corresponding sysctl knob for loader tunable
The loader tunable 'debug.kasan.disabled' does not have corresponding
sysctl MIB entry. Add it so that it can be retrieved, and `sysctl -T`
will also report it correctly.
MFC: Remove confDH_PARAMETERS settings in favor of using sendmail's
built-in default which was added in sendmail 8.15.2 (the config
line predates that 8.15.2 feature). This also alleviates the need
for admins to create the DH parameters file if they opt to use
Diffie-Hellman.
POSIX has accepted a proposal[1] to add glibc-compatible ptsname_r. It
indicates an error by returning the error number, rather than returning
-1 and setting errno. Update RETURN VALUES in ptsname_r's man page now
to encourage folks to test that the return value != 0 rather than == -1.
The loader tunable 'net.inet.sctp.tcbhashsize' and 'net.inet.sctp.chunkscale'
are only used during vnet initializing, thus it make no senses to make them
writable tunable.
Validate the values of loader tunables on vnet initialize, reset them to
theirs defaults if invalid to prevent potential kernel panics.
Alan Somers [Wed, 4 Oct 2023 18:48:01 +0000 (12:48 -0600)]
fusefs: sanitize FUSE_READLINK results for embedded NULs
If VOP_READLINK returns a path that contains a NUL, it will trigger an
assertion in vfs_lookup. Sanitize such paths in fusefs, rejecting any
and warning the user about the misbehaving server.
Mateusz Guzik [Wed, 11 Oct 2023 09:42:12 +0000 (09:42 +0000)]
vfs: further speed up continuous free vnode recycle
The primary bottleneck *was* vnode_list mtx, which got artificially
worsened due to the following work done with the lock held:
1. the global heavily modified numvnodes counter was being read,
inducing massive cache line ping pong
2. should the value fit limits (which it normally did) there would be an
avoidable write to vn_alloc_cyclecount, which is being read outside
of the lock, once more inducing traffic
But if vn_alloc_cyclecount is 0, which it normally is even when facing
vnode shortage, there is no need to check numvnodes nor set it to 0 again.
Another problem was numvnodes adjustment (which made the locked read
much worse). While it fundamentally does not scale as it is not
distributed in any fashion, it was avoidably slow. When bumping over the
vnode limit, it would be modified with atomics 3 times: inc + dec to
backpedal in vn_alloc, then final inc in vn_alloc_hard.
One can let some slop persist over calls to vnlru_free instead.
In principle each thread in the system could get here and bump it, so a
limit is put in place to keep things sane.
Bench setup same as in prior commits: zfs, 20 separate directory trees
each with 1 million files in total and 20 find(1) processes stating them
in parallel (one per each tree).
Total run time (in seconds) goes down as follows:
vnode limit 8388608 400000
before ~20 ~35
after ~8 ~15
With this in place the primary bottleneck is now ZFS.
vfs: prefix regular vnlru with a special case for free vnodes
Works around severe performance problems in certain corner cases, see
the commentary added.
Modifying vnlru logic has proven rather error prone in the past and a
release is near, thus take the easy way out and fix it without having to
dig into the current machinery.
Mateusz Guzik [Tue, 10 Oct 2023 16:19:53 +0000 (16:19 +0000)]
vfs: consult freevnodes in vnlru_kick_cond
If the count is high enough there is no point trying to produce more.
Not going there reduces traffic on the vnode_list mtx.
This further shaves total real time in a test mentioned in: 74be676d87745eb7 ("vfs: drop one vnode list lock trip during vnlru free
recycle") -- 20 instances of find each creating 1 million vnodes, while
total limit is set to 400k.