Fetch the sigfastblock value in syscalls that wait for signals
We have seen several cases of processes which have become "stuck" in
kern_sigsuspend(). When this occurs, the kernel's td_sigblock_val
is set to 0x10 (one block outstanding) and the userspace copy of the
word is set to 0 (unblocked). Because the kernel's cached value
shows that signals are blocked, kern_sigsuspend() blocks almost all
signals, which means the process hangs indefinitely in sigsuspend().
It is not entirely clear what is causing this condition to occur.
However, it seems to make sense to add some protection against this
case by fetching the latest sigfastblock value from userspace for
syscalls which will sleep waiting for signals. Here, the change is
applied to kern_sigsuspend() and kern_sigtimedwait().
John Baldwin [Fri, 12 Mar 2021 17:47:31 +0000 (09:47 -0800)]
amd64: Cleanups to setting TLS registers for Linux binaries.
- Use update_pcb_bases() when updating FS or GS base addresses to
permit use of FSBASE and GSBASE in Linux processes. This also sets
PCB_FULL_IRET. linux32 was setting PCB_32BIT which should be a
no-op (exec sets it).
- Remove write-only variables to construct unused segment descriptors
for linux32.
John Baldwin [Fri, 12 Mar 2021 17:45:18 +0000 (09:45 -0800)]
amd64: Only update fsbase/gsbase in pcb for curthread.
Before the pcb is copied to the new thread during cpu_fork() and
cpu_copy_thread(), the kernel re-reads the current register values in
case they are stale. This is done by setting PCB_FULL_IRET in
pcb_flags.
This works fine for user threads, but the creation of kernel processes
and kernel threads do not follow the normal synchronization rules for
pcb_flags. Specifically, new kernel processes are always forked from
thread0, not from curthread, so adjusting pcb_flags via a simple
instruction without the LOCK prefix can race with thread0 running on
another CPU. Similarly, kthread_add() clones from the first thread in
the relevant kernel process, not from curthread. In practice, Netflix
encountered a panic where the pcb_flags in the first kthread of the
KTLS process were trashed due to update_pcb_bases() in
cpu_copy_thread() running from thread0 to create one of the other KTLS
threads racing with the first KTLS kthread calling fpu_kern_thread()
on another CPU. In the panicking case, the write to update pcb_flags
in fpu_kern_thread() was lost triggering an "Unregistered use of FPU
in kernel" panic when the first KTLS kthread later tried to use the
FPU.
Caleb St. John [Fri, 26 Mar 2021 18:00:14 +0000 (14:00 -0400)]
rpc.lockd: Unconditionally close fds as daemon
When lockd is configured with a debug level of > 0 and foreground == 0,
the process is daemonized with a truth noclose argument to daemon().
This doesn't seem to be the desired behavior because that prevents
stdout and stderr from being closed, however, stdout and stderr aren't
used anywhere else. Furthermore, the man pages state that with a higher
debug level it will use the syslog facilities to do so.
Submitted by: Caleb St. John
Discussed with: rmacklem
MFC after: 3 days
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D29415
Caleb St. John [Wed, 24 Mar 2021 20:33:41 +0000 (16:33 -0400)]
align nfsdumpstate column output
There are scenarios where an NFS client will mount an NFSv4 export
without specifying a callback address.
When running nfsdumpstate under this circumstance, the column output is
shifted incorrectly which places the "ClientID" value underneath the
"Clientaddr" column.
This diff is a small cosmetic change that prints a blank in the
"Clientaddr" column and ensures the data for the columns are aligned
appropriately.
Submitted by: Caleb St. John
Reviewed by: sef (previous version)
MFC after: 3 days
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D18958
Wei Hu [Fri, 12 Mar 2021 04:35:16 +0000 (04:35 +0000)]
Hyper-V: hn: Enable vSwitch RSC support in hn netvsc driver
Receive Segment Coalescing (RSC) in the vSwitch is a feature available in
Windows Server 2019 hosts and later. It reduces the per packet processing
overhead by coalescing multiple TCP segments when possible. This happens
mostly when TCP traffics are among different guests on same host.
This patch adds netvsc driver support for this feature.
The patch also updates NVS version to 6.1 as needed for RSC
enablement.
MFC after: 2 weeks
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D29075
Wei Hu [Wed, 24 Feb 2021 05:07:46 +0000 (05:07 +0000)]
Hyper-V: hn: Store host hash value in flowid
When rx packet contains hash value sent from host, store it in
the mbuf's flowid field so when the same mbuf is on the tx path,
the hash value can be used by the host to determine the outgoing
network queue.
Zero `struct weightened_nhop` fields in nhgrp_get_addition_group().
`struct weightened_nhop` has spare 32bit between the fields due to
the alignment (on amd64).
Not zeroing these spare bits results in duplicating nhop groups
in the kernel due to the way how comparison works.
Mark Johnston [Thu, 25 Mar 2021 21:55:20 +0000 (17:55 -0400)]
accept_filter: Fix filter parameter handling
For filters which implement accf_create, the setsockopt(2) handler
caches the filter name in the socket, but it also incorrectly frees the
buffer containing the copy, leaving a dangling pointer. Note that no
accept filters provided in the base system are susceptible to this, as
they don't implement accf_create.
Reported by: Alexey Kulaev <alex.qart@gmail.com>
Discussed with: emaste
Security: kernel use-after-free
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Mark Johnston [Tue, 23 Mar 2021 13:38:59 +0000 (09:38 -0400)]
pf: Handle unmapped mbufs when computing checksums
PR: 254419
Reviewed by: gallatin, kp
Tested by: Igor A. Valkov <viaprog@gmail.com>
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29378
Rick Macklem [Tue, 9 Mar 2021 00:08:02 +0000 (16:08 -0800)]
mountd(8): generate a syslog message when the "V4:" line is missing
Daniel reported that NFSv4 mounts were not working despite having
set "nfsv4_server_enable=YES" in /etc/rc.conf. Mountd was logging a
message that there was no /etc/exports file.
He noted that creating a /etc/exports file with a "V4:" line in it
was needed make NFSv4 mounts work.
At least one "V4:" line in one of the exports(5) file(s) is needed to
make NFSv4 mounts work. This patch fixes mountd.c so that it logs a
message indicting that there is no "V4:" line in any exports(5)
file when NFSv4 mounts are enabled.
To avoid this message being generated erroneously, /etc/rc.d/mountd
is updated to make sure vfs.nfsd.server_max_nfsvers is properly set
before mountd(8) is started.
Emmanuel Vadot [Sat, 27 Mar 2021 11:04:51 +0000 (12:04 +0100)]
release: amd64: Fix ISO/USB hybrid image
Recent mkimg changes forces to have partitions given in explicit order.
This is so we can have the first partition starting at a specific offset
and the next ones starting after without having to specify an offset.
Switch the partition in the mkisoimage.sh script so the first one created
is the isoboot one.
PR: 254490
Reported by: Michael Dexter <editor@callfortesting.org
Tested by: Vincent Milum Jr <freebsd@darkain.com>
MFC after: Right now
Jessica Clarke [Sat, 20 Mar 2021 17:58:10 +0000 (17:58 +0000)]
elftoolchain: Support building on Arm-based Macs
Currently macOS and DragonFlyBSD get their own special case and only
handle x86. Since all the FreeBSD cases should be general enough for
macOS and DragonFlyBSD (and the x86 ones are identical to the existing
ones) we can just delete the special cases and reuse the FreeBSD ones.
Note that upstream has since removed all the architecture-specific
checks in this file, with the only code relevant to us being an
endianness check that uses the generic compiler-provided macros. Thus
this patch will not be upstreamed, and will be dropped in a future
vendor import.
Jessica Clarke [Sat, 20 Mar 2021 13:00:34 +0000 (13:00 +0000)]
tools/build: Improve host-symlinks failure mode
Since set -e is enabled by sys.mk, if the tool cannot be found in PATH
then the entire shell command line fails, causing us to not print the
error message below and instead silently (due to the @) fail, only
getting the usual "Error code 1" print from bmake. Thus, provide a dummy
default that will never exist (the same as is used by meta2deps.sh) if
which fails so that we get the error message as intended.
D Scott Phillips [Thu, 18 Mar 2021 16:08:52 +0000 (00:08 +0800)]
bhyve: support relocating fbuf and passthru data BARs
We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.
We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.
[1]: https://github.com/freebsd/uefi-edk2/pull/9/
Originally by: scottph
Sponsored by: Intel Corporation
Reviewed by: grehan
Approved by: philip (mentor)
Differential Revision: https://reviews.freebsd.org/D24066
Plug nexthop group refcount leak.
In case with batch route delete via rib_walk_del(), when
some paths from the multipath route gets deleted, old
multipath group were not freed.
Robert Watson [Sun, 21 Mar 2021 00:01:54 +0000 (00:01 +0000)]
Tune DTrace 'aframes' for the FBT and profile providers on arm64.
In both cases, too few frames were trimmed, leading to exception handling
or DTrace internals being exposed in stack traces exposed by D's stack()
primitive.
Reviewed by: emaste, andrew
Differential Revision: https://reviews.freebsd.org/D29356
Lawrence Stewart [Wed, 24 Mar 2021 04:25:49 +0000 (15:25 +1100)]
random(9): Restore historical [0,2^31-1] output range and related man documention.
Commit SVN r364219 / Git 8a0edc914ffd changed random(9) to be a shim around
prng32(9) and inadvertently caused random(9) to begin returning numbers in the
range [0,2^32-1] instead of [0,2^31-1], where the latter has been the documented
range for decades.
The increased output range has been identified as the source of numerous bugs in
code written against the historical output range e.g. ipfw "prob" rules and
stats(3) are known to be affected, and a non-exhaustive audit of the tree
identified other random(9) consumers which are also likely affected.
As random(9) is deprecated and slated for eventual removal in 14.0, consumers
should gradually be audited and migrated to prng(9).
On FreeBSD/arm fill_fpregs, fill_dbregs are stubs that zero the reg
struct and return success. set_fpregs and set_dbregs do nothing and
return success.
Provide the same implementation for arm64 COMPAT_FREEBSD32.
Reviewed by: andrew
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29314
Mark Johnston [Sun, 21 Mar 2021 18:18:10 +0000 (14:18 -0400)]
rtsold: Fix validation of RDNSS options
The header specifies the size of the option in multiples of eight bytes.
The option consists of an eight-byte header followed by one or more IPv6
addresses, so the option is invalid if the size is not equal to 1+2n for
some n>0. Check this.
The bug can cause random stack data to be formatted as an IPv6 address
and passed to resolvconf(8), but a host able to trigger the bug may also
specify arbitrary addresses this way.
Reported by: Q C <cq674350529@gmail.com>
Sponsored by: The FreeBSD Foundation
wpa: import fix for P2P provision discovery processing vulnerability
Latest version available from: https://w1.fi/security/2021-1/
Vulnerability
A vulnerability was discovered in how wpa_supplicant processes P2P
(Wi-Fi Direct) provision discovery requests. Under a corner case
condition, an invalid Provision Discovery Request frame could end up
reaching a state where the oldest peer entry needs to be removed. With
a suitably constructed invalid frame, this could result in use
(read+write) of freed memory. This can result in an attacker within
radio range of the device running P2P discovery being able to cause
unexpected behavior, including termination of the wpa_supplicant process
and potentially code execution.
Vulnerable versions/configurations
wpa_supplicant v1.0-v2.9 with CONFIG_P2P build option enabled
An attacker (or a system controlled by the attacker) needs to be within
radio range of the vulnerable system to send a set of suitably
constructed management frames that trigger the corner case to be reached
in the management of the P2P peer table.
Alexander Motin [Wed, 17 Mar 2021 14:30:40 +0000 (10:30 -0400)]
nvme: Replace potentially long DELAY() with pause().
In some cases like broken hardware nvme(4) may wait minutes for
controller response before timeout. Doing so in a tight spin loop
made whole system unresponsive.
- Call vm_object_reference() before vm_map_lookup_done().
- Use vm_mmap_to_errno() to convert vm_map_* return values to errno.
- Fix memory leak of e->obj.
Nathan Whitehorn [Tue, 23 Mar 2021 13:19:42 +0000 (09:19 -0400)]
Fix scripted installs on EFI systems after default mounting of the ESP.
Because the ESP mount point (/boot/efi) is in mtree, tar will attempt to
extract a directory at that point post-mount when the system is installed.
Normally, this is fine, since tar can happily set whatever properties it
wants. For FAT32 file systems, however, like the ESP, tar will attempt to
set mtime on the root directory, which FAT does not support, and tar will
interpret this as a fatal error, breaking the install (see
https://github.com/libarchive/libarchive/issues/1516). This issue would
also break scripted installs on bare-metal POWER8, POWER9, and PS3
systems, as well as some ARM systems.
This patch solves the problem in two ways:
- If stdout is a TTY, use the distextract stage instead of tar, as in
interactive installs. distextract solves this problem internally and
provides a nicer UI to boot, but requires a TTY.
- If stdout is not a TTY, use tar but, as a stopgap for 13.0, exclude
boot/efi from tarball extraction and then add it by hand. This is a
hack, and better solutions (as in the libarchive ticket above) will
obsolete it, but it solves the most common case, leaving only
unattended TTY-less installs on a few tier-2 platforms broken.
In addition, fix a bug with fstab generation uncovered once the tar issue
is fixed that umount(8) can depend on the ordering of lines in fstab in a
way that mount(8) does not. The partition editor now writes out fstab in
mount order, making sure umount (run at the end of scripted, but not
interactive, installs) succeeds.
PR: 254395
Approved by: re (gjb)
Reviewed by: gjb, imp
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D29380
Kristof Provost [Thu, 11 Mar 2021 10:37:05 +0000 (11:37 +0100)]
pf: pool/kpool conversion code
stuct pf_pool and struct pf_kpool are different. We should not simply
bcopy() them.
Happily it turns out that their differences were all pointers, and the
userspace provided pointers were overwritten by the kernel, so this did
actually work correctly, but we should fix it anyway.
MFC 6eb60f5b7f7d:
Use the word "LinuxKPI" instead of "Linux compatibility", to not confuse with
user-space Linux compatibility support. No functional change.
John Baldwin [Thu, 11 Feb 2021 21:51:20 +0000 (13:51 -0800)]
iscsi: Mark iSCSI CAM sims as non-pollable.
Previously, iscsi_poll() just panicked. This meant if you got a panic
on a box when using the iSCSI initiator, the attempt to shutdown would
trigger a nested panic and never write out a core. Now, CCB's sent to
iSCSI devices (such as the sychronize-cache request in dashutdown())
just fail with a timeout during a panic shutdown.
John Baldwin [Thu, 11 Feb 2021 21:51:01 +0000 (13:51 -0800)]
cam: Don't permit crashdumps on non-pollable devices.
If a disk's SIM doesn't support polling, then it can't be used to
store crashdumps. Leave d_dump NULL in that case so that dumpon(8)
fails gracefully rather than having dumps fail at crash time.
John Baldwin [Thu, 11 Feb 2021 21:49:43 +0000 (13:49 -0800)]
cam: Permit non-pollable sims.
Some CAM sim drivers do not support polling (notably iscsi(4)).
Rather than using a no-op poll routine that always times out requests,
permit a SIM to set a NULL poll callback. cam_periph_runccb() will
fail polled requests non-pollable sims immediately as if they had
timed out.
Mitchell Horne [Mon, 15 Mar 2021 13:46:03 +0000 (10:46 -0300)]
armv8crypto: note derivation in armv8_crypto_wrap.c
This file inherits some boilerplate and structure from the analogous
file in aesni(4), aesni_wrap.c. Note the derivation and the copyright
holders of that file.
For example, the AES-XTS bits added in 4979620ece984 were ported from
aesni(4).
Mark Johnston [Mon, 8 Mar 2021 17:39:06 +0000 (12:39 -0500)]
iflib: Make if_shared_ctx_t a pointer to const
This structure is shared among multiple instances of a driver, so we
should ensure that it doesn't somehow get treated as if there's a
separate instance per interface. This is especially important for
software-only drivers like wg.
DEVICE_REGISTER() still returns a void * and so the per-driver sctx
structures are not yet defined with the const qualifier.
Reviewed by: gallatin, erj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29102
Leandro Lupori [Tue, 9 Mar 2021 15:11:58 +0000 (12:11 -0300)]
ofwfb: fix boot on LE
Some framebuffer properties obtained from the device tree were not being
properly converted to host endian.
Replace OF_getprop calls by OF_getencprop where needed to fix this.
This fixes boot on PowerPC64 LE, when using ofwfb as the system console.
Reviewed by: bdragon
Sponsored by: Eldorado Research Institute (eldorado.org.br)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27475
Mike Karels [Thu, 18 Mar 2021 00:19:24 +0000 (19:19 -0500)]
genet: Fix problem with forwarding some TCP/IPv6 packets
TCP/IPv6 packets to be forwarded can be laid out with only the Ethernet
header in the first mbuf, and these packets are lost. There was a
previous hack to pullup ICMPv6 packets with such a layout for the
same reason. Generalize, and pullup any IPv6 packets with only the
Ethernet header in the first mbuf. Possibly this should also include
IPv4, but that situation has not been observed to fail.
PR: 254060
Reported by: denis at h3q.com
MFC after: 3 days
Alan Somers [Sat, 27 Feb 2021 15:59:40 +0000 (08:59 -0700)]
Speed up geom_stats_resync in the presence of many devices
The old code had a O(n) loop, where n is the size of /dev/devstat.
Multiply that by another O(n) loop in devstat_mmap for a total of
O(n^2).
This change adds DIOCGMEDIASIZE support to /dev/devstat so userland can
quickly determine the right amount of memory to map, eliminating the
O(n) loop in userland.
This change decreases the time to run "gstat -bI0.001" with 16,384 md
devices from 29.7s to 4.2s.
Also, fix a memory leak first reported as PR 203097.
Gordon Bergling [Sat, 13 Mar 2021 18:28:26 +0000 (19:28 +0100)]
find(1): Refine the HISTORY within the manual page.
A simple find command appeared in Version 1 AT&T UNIX and was removed in
Version 3 AT&T UNIX. It was rewritten for Version 5 AT&T UNIX and later
be enhanced for the Programmer's Workbench (PWB). These changes were
later incorporated in AT&T UNIX v7.
Kristof Provost [Wed, 10 Mar 2021 21:56:11 +0000 (22:56 +0100)]
pf: Fully remove interrupt events on vnet cleanup
swi_remove() removes the software interrupt handler but does not remove
the associated interrupt event.
This is visible when creating and remove a vnet jail in `procstat -t
12`.
We can remove it manually with intr_event_destroy().
Kristof Provost [Wed, 10 Mar 2021 14:11:59 +0000 (15:11 +0100)]
uma: allow uma_zfree_pcu(..., NULL)
We already allow free(NULL) and uma_zfree(..., NULL). Make
uma_zfree_pcpu(..., NULL) work as well.
This also means that counter_u64_free(NULL) will work.
Kristof Provost [Wed, 3 Mar 2021 10:06:49 +0000 (11:06 +0100)]
altq: Increase maximum number of CBQ and HFSC classes
In some configurations we need more classes than ALTQ supports by
default. Increase the maximum number of classes we allow.
This will only cost us a comparatively trivial amount of memory, so
there's little reason not to do so.
If ever we find we want even more we may want to consider turning these
defines into a tunable, but for now do the easy thing.
shu [Wed, 3 Feb 2021 16:51:45 +0000 (16:51 +0000)]
linux: make timerfd_settime(2) set expirations count to zero
On Linux, read(2) from a timerfd file descriptor returns an unsigned
8-byte integer (uint64_t) containing the number of expirations
that have occurred, if the timer has already expired one or more
times since its settings were last modified using timerfd_settime(),
or since the last successful read(2). That's to say, once we do
a read or call timerfd_settime(), timer fd's expiration count should
be zero. Some Linux applications create timerfd and add it to epoll
with LT mode, when event comes, they do timerfd_settime instead
of read to stop event source from trigger. On FreeBSD,
timerfd_settime(2) didn't set the count to zero, which caused high
CPU utilization.
Mariusz Zaborski [Sat, 13 Mar 2021 11:56:17 +0000 (12:56 +0100)]
zfs: bring back possibility to rewind the checkpoint from
Add parsing of the rewind options.
When I was upstreaming the change [1], I omitted the part where we
detect that the pool should be rewind. When the FreeBSD repo has
synced with the OpenZFS, this part of the code was removed.
Michael Tuexen [Thu, 18 Mar 2021 20:25:47 +0000 (21:25 +0100)]
vtnet: fix TSO for TCP/IPv6
The decision whether a TCP packet is sent over IPv4 or IPv6 was
based on ethertype, which works correctly. In D27926 the criteria
was changed to checking if the CSUM_IP_TSO flag is set in the
csum-flags and then considering it to be TCP/IPv4.
However, the TCP stack sets the flag to CSUM_TSO for IPv4 and IPv6,
where CSUM_TSO is defined as CSUM_IP_TSO|CSUM_IP6_TSO.
Therefore TCP/IPv6 packets gets mis-classified as TCP/IPv4,
which breaks TSO for TCP/IPv6.
This patch bases the check again on the ethertype.
This fix is instantly MFCed.
netmap: fix memory leak in NETMAP_REQ_PORT_INFO_GET
The netmap_ioctl() function has a reference counting bug in case of
NETMAP_REQ_PORT_INFO_GET command. When `hdr->nr_name[0] == '\0'`,
the function does not decrease the refcount of "nmd", which is
increased by netmap_mem_find(), causing a refcount leak.
Reported by: Xiyu Yang <sherllyyang00@gmail.com>
Submitted by: Carl Smith <carl.smith@alliedtelesis.co.nz>
MFC after: 3 days
PR: 254311
Mateusz Guzik [Wed, 17 Mar 2021 21:33:47 +0000 (22:33 +0100)]
vfs: fix vnlru marker handling for filtered/unfiltered cases
The global list has a marker with an invariant that free vnodes are
placed somewhere past that. A caller which performs filtering (like ZFS)
can move said marker all the way to the end, across free vnodes which
don't match. Then a caller which does not perform filtering will fail to
find them. This makes vn_alloc_hard sleep for 1 second instead of
reclaiming, resulting in significant stalls.
Fix the problem by requiring an explicit marker by callers which do
filtering.
As a temporary measure extend vnlru_free to restart if it fails to
reclaim anything.
Big thanks go to the reporter for testing several iterations of the
patch.
Scott Long [Thu, 18 Mar 2021 07:07:56 +0000 (07:07 +0000)]
base: remove if_wg(4) and associated utilities, manpage
After length decisions, we've decided that the if_wg(4) driver and
related work is not yet ready to live in the tree. This driver has
larger security implications than many, and thus will be held to
more scrutiny than other drivers.