John Baldwin [Thu, 25 Jun 2020 23:57:30 +0000 (23:57 +0000)]
Enter and exit the network epoch for async IPsec callbacks.
When an IPsec packet has been encrypted or decrypted, the next step in
the packet's traversal through the network stack is invoked from a
crypto worker thread, not from the original calling thread. These
threads need to enter the network epoch before passing packets down to
IP output routines or up to transport protocols.
David Bright [Thu, 25 Jun 2020 21:34:43 +0000 (21:34 +0000)]
Add CAP_EVENT to pidfiles.
CAP_EVENT was omitted on pidfiles (in
pidfile_open()). There seems no reason why a process that creates
and writes a pidfile cannot monitor events on that file. This mod adds
the capability.
Mark Johnston [Thu, 25 Jun 2020 20:30:30 +0000 (20:30 +0000)]
Implement an approximation of Linux MADV_DONTNEED semantics.
Linux MADV_DONTNEED is not advisory: it has side effects for anonymous
memory, and some system software depends on that. In particular,
MADV_DONTNEED causes anonymous pages to be discarded. If the mapping is
a private mapping of a named object then subsequent faults are to
repopulate the range from that object, otherwise pages will be
zero-filled. For mappings of non-anonymous objects, Linux MADV_DONTNEED
can be implemented in the same way as our MADV_DONTNEED.
This implementation differs from Linux semantics in its handling of
private mappings, inherited through fork(), of non-anonymous objects.
After applying MADV_DONTNEED, subsequent faults will repopulate the
mapping from the parent object rather than the root of the shadow chain.
PR: 230160
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D25330
John Baldwin [Thu, 25 Jun 2020 20:17:34 +0000 (20:17 +0000)]
Use zfree() instead of explicit_bzero() and free().
In addition to reducing lines of code, this also ensures that the full
allocation is always zeroed avoiding possible bugs with incorrect
lengths passed to explicit_bzero().
Dimitry Andric [Thu, 25 Jun 2020 20:04:35 +0000 (20:04 +0000)]
Fix copy/paste mistake in kvm_getswapinfo(3)
It seems this manpage was copied from kvm_getloadavg(3), but the
DIAGNOSTICS section was not updated completely. Update the section with
correct information about a return value of -1.
Gordon Tetlow [Thu, 25 Jun 2020 19:35:37 +0000 (19:35 +0000)]
Revert OPENSSL_NO_SSL3_METHOD to keep ABI compatibility.
This define caused a couple of symbols to disappear. To keep ABI
compatibility, we are going to keep the symbols exposed, but leave SSLv3 as
not in the default config (this is what OPENSSL_NO_SSL3 achieves). The
ramifications of this is an application can still use SSLv3 if it
specifically calls the SSLv3_method family of APIs.
Doug Moore [Thu, 25 Jun 2020 17:44:14 +0000 (17:44 +0000)]
Eliminate the color field from the RB element struct. Identify the
color of a node (or, really, the color of the link from the parent to
the node) by using one of the last two bits of the parent pointer in
that parent node. Adjust rebalancing methods to account for where
colors are stored, and the fact that null children have a color too.
Adjust RB_PARENT and RB_SET_PARENT to account for this change.
Mark Johnston [Thu, 25 Jun 2020 15:21:21 +0000 (15:21 +0000)]
Call swap_pager_freespace() from vm_object_page_remove().
All vm_object_page_remove() callers, except
linux_invalidate_mapping_pages() in the LinuxKPI, free swap space when
removing a range of pages from an object. The LinuxKPI case appears to
be an unintentional omission that could result in leaked swap blocks, so
unconditionally free swap space in vm_object_page_remove() to protect
against similar bugs in the future.
Reviewed by: alc, kib
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25329
Pawel Biernacki [Thu, 25 Jun 2020 12:35:20 +0000 (12:35 +0000)]
bhyve: allow for automatic destruction on power-off
Introduce -D flag that allows for the VM to be destroyed on guest initiated
power-off by the bhyve(8) process itself.
This is quality of life change that allows for simpler deployments without
the need for bhyvectl --destroy.
Conrad Meyer [Thu, 25 Jun 2020 00:18:42 +0000 (00:18 +0000)]
bhyve(8): For prototyping, reattempt decode in userspace
If userspace has a newer bhyve than the kernel, it may be able to decode
and emulate some instructions vmm.ko is unaware of. In this scenario,
reset decoder state and try again.
Enji Cooper [Wed, 24 Jun 2020 18:51:01 +0000 (18:51 +0000)]
Add `kern.features.witness`
Adding `kern.features.witness` helps expose whether or not the kernel has
`options WITNESS` enabled, so the `feature_present(3)` API can be used
to query whether or not witness(9) is built into the kernel.
This support is helpful with userspace applications (generally speaking,
tests), as it can be queried to determine whether or not tests related
to WITNESS should be run.
Conrad Meyer [Wed, 24 Jun 2020 17:03:42 +0000 (17:03 +0000)]
Add WITH_CLANG_FORMAT option
clang-format is enabled conditional on either WITH_CLANG_EXTRAS or
WITH_CLANG_FORMAT. Some sources in libclang are build conditional on
either rule, and obviously the clang-format binary itself depends on the
rule.
Mitchell Horne [Wed, 24 Jun 2020 15:21:12 +0000 (15:21 +0000)]
Only invalidate the early DTB mapping if it exists
This temporary mapping will become optional. Booting via loader(8)
means that the DTB will have already been copied into the kernel's
staging area, and is therefore covered by the early KVA mappings.
Mitchell Horne [Wed, 24 Jun 2020 15:20:00 +0000 (15:20 +0000)]
Handle load from loader(8)
In locore, we must detect and handle different arguments passed by
loader(8) compared to what we recieve when booting directly via SBI
firmware. Currently we receive the hart ID in a0 and a pointer to the
device tree blob in a1. loader(8) provides only a pointer to its
metadata in a0.
The solution to this is to add an additional entry point, _alt_start.
This will be placed first in the .text section, so SBI firmware will
enter here, and jump to the common pagetable setup shortly after. Since
loader(8) understands our ELF kernel, it will enter at the ELF's entry
address, which points to _start. This approach leads to very little
guesswork as to which way we booted.
Fix-up initriscv() to parse the loader's metadata, continuing to use
fake_preload_metadata() in the SBI direct boot case.
Michael Tuexen [Wed, 24 Jun 2020 14:47:51 +0000 (14:47 +0000)]
Fix the acconting for fragmented unordered messages when using
interleaving.
This was reported for the userland stack in
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=19321
TCP: make after-idle work for transactional sessions.
The use of t_rcvtime as proxy for the last transmission
fails for transactional IO, where the client requests
data before the server can respond with a bulk transfer.
Set aside a dedicated variable to actually track the last
locally sent segment going forward.
Mitchell Horne [Wed, 24 Jun 2020 13:11:19 +0000 (13:11 +0000)]
Enable long double tests on RISC-V
Some of the NetBSD contributed tests are gated behind the
__HAVE_LONG_DOUBLE flag. This flag seems to be defined only for
platforms whose long double is larger than their double. I could not
find this explicitly documented anywhere, but it is implied by the
definitions in NetBSD's sys/arch/${arch}/include/math.h headers, and the
following assertion from the UBSAN code:
#ifdef __HAVE_LONG_DOUBLE
long double LD;
ASSERT(sizeof(LD) > sizeof(uint64_t));
#endif
RISC-V has 128-bit long doubles, so enable the tests on this platform,
and update the comments to better explain the purpose of this flag.
Marcin Wojtas [Wed, 24 Jun 2020 12:15:27 +0000 (12:15 +0000)]
Fix AccessWidth and BitWidth parsing in SPCR table
The ACPI Specification defines a Generic Address Structure (GAS),
which is used to describe UART controller register layout in the
SPCR table. The driver responsible for parsing it (uart_cpu_acpi)
wrongly associates the Access Size field to the uart_bas's regshft
and the register BitWidth to the regiowidth - according to
the definitions it should be opposite.
This problem remained hidden most likely because the majority of platforms
use 32-bit registers (BitWidth) which are accessed with the according
size (Dword). However on Marvell Armada 8k / Cn913x platforms,
the 32-bit registers should be accessed with Byte granulity, which
unveiled the issue.
This patch fixes above by proper values assignment and slightly improved
parsing.
Note that handling of the AccessWidth set to EFI_ACPI_6_0_UNDEFINED is
needed to work around a buggy SPCR table on EC2 x86 "bare metal" instances.
Cy Schubert [Wed, 24 Jun 2020 01:51:05 +0000 (01:51 +0000)]
MFV r362565:
Update 4.2.8p14 --> 4.2.8p15
Summary: Systems that use a CMAC algorithm in ntp.keys will not release
a bit of memory on each packet that uses a CMAC keyid, eventually causing
ntpd to run out of memory and fail. The CMAC cleanup from
https://bugs.ntp.org/3447, part of ntp-4.2.8p11, introduced a bug whereby
the CMAC data structure was no longer completely removed.
Kyle Evans [Tue, 23 Jun 2020 23:52:43 +0000 (23:52 +0000)]
stand: remove redundant declarations
These are picked out by the amd64-gcc6 build; time() is declared in <time.h>
and delay() is declared in <bootstrap.h>. These are the correct places for
these in stand/, so remove the duplicate declarations and make sure the
delay() consumer in libefi that depended on the extra delay() declaration
includes <bootstrap.h>.
Doug Moore [Tue, 23 Jun 2020 22:47:54 +0000 (22:47 +0000)]
In r362552, RB_SET_PARENT is defined, and use in parens in
RB_CLEAR_NODE. But it is not an expression, and ought not to be
enclosed in parens. Remove them.
Kirk McKusick [Tue, 23 Jun 2020 21:44:00 +0000 (21:44 +0000)]
Optimize g_journal's superblock update by noting that the summary
information is neither read nor written so it need not be written
out when updating the superblock.
Colin Percival [Tue, 23 Jun 2020 21:11:40 +0000 (21:11 +0000)]
Clean up some function and variable names.
The change from "slave" processes to "minion" processes to "worker"
processes left some less-than-coherent names:
1. "enslave" turned into the ungrammatical "enworker".
2. "slp" (SLave Pointer) turned into "mlp" (Minion [L] Pointer?).
Convert "enworker" to "create_workers" (the function in question forks
off 3 worker processes), and replace "mlp" with "wp" (Worker Pointer)
and "tmlp" with "twp" (Temporary Worker Pointer).
In the current iflib_netmap_rxsync, there is nothing that prevents
kring->nr_hwtail to overrun kring->nr_hwcur during the descriptor
import phase. This may cause errors in netmap applications, such as:
em1 RX0: fail 'head < kring->nr_hwcur || head > kring->nr_hwtail'
h 795 c 795 t 282 rh 795 rc 795 rt 282 hc 282 ht 282
Doug Moore [Tue, 23 Jun 2020 20:02:55 +0000 (20:02 +0000)]
Define RB_SET_PARENT to do all assignments to rb parent
pointers. Define RB_SWAP_CHILD to replace the child of a parent with
its twin, and use it in 4 places. Use RB_SET in rb_link_node to remove
the only linuxkpi reference to color, and then drop color- and
parent-related definitions that are defined and used only in rbtree.h.
This is intended to be entirely cosmetic, with no impact on program
behavior, and leave RB_PARENT and RB_SET_PARENT as the only ways to
read and write rb parent pointers.
Conrad Meyer [Tue, 23 Jun 2020 18:25:31 +0000 (18:25 +0000)]
kmod.mk: Don't split out debug symbols if requested
Ports bsd.kmod.mk explicitly sets MK_KERNEL_SYMBOLS=no to prevent auto-
splitting of debuginfo from kernel modules. If that knob is set, don't
split out a .ko.debug and .ko from .ko.full; just generate a .ko with
debuginfo and leave it be.
Otherwise, with DEBUG_FLAGS set and MK_KERNEL_SYMBOLS=no, we would helpfully
strip out the debuginfo from the .ko.full and then not install it. That is
not the desired result a WITH_DEBUG port kmod build.
Conrad Meyer [Tue, 23 Jun 2020 16:43:48 +0000 (16:43 +0000)]
sort(1): Fix two wchar-related bugs in radixsort
Sort(1)'s radixsort implementation was broken for multibyte LC_CTYPEs in at
least two ways:
* In actual radix sort, it would only bucket the least significant
byte from each wchar, ignoring the 24 most-significant bits of each
unicode character.
* In degenerate cases / "fast paths," it would fall back to another
sorting algorithm (default: mergesort) with a bogus comparator
offset. The string comparison functions in sort(1) take an offset
in units of the operating character size. However, radixsort was
passing an offset in units of bytes. The byte offset must be
divided by sizeof(wchar_t).
Ryan Moeller [Tue, 23 Jun 2020 16:29:59 +0000 (16:29 +0000)]
libdevdctl: Force full match of "timestamp" field name
OpenZFS generates events with a "zio_timestamp" field, which gets mistaken for
"timestamp" by libdevdctl due to imprecise string matching. Then later it is
assumed a "timestamp" field exists when it doesn't and an exception is thrown.
Add a space to the search string so we match exactly "timestamp" rather than
anything with that as a suffix.
Approved by: mav (mentor)
MFC after: 3 days
Sponsored by: iXsystems, Inc.
Tom Jones [Tue, 23 Jun 2020 15:14:54 +0000 (15:14 +0000)]
pkg: Provide a friendlier message when bootstrap fails due to address resolution
The current message when bootstapping pkg fails for any reason implies that pkg
is not available. We have the error code from fetch so if bootstrap failed due
to address resolution say so.
Toomas Soome [Tue, 23 Jun 2020 06:42:39 +0000 (06:42 +0000)]
MFOpenZFS: Add basic zfs ioc input nvpair validation
We want newer versions of libzfs_core to run against an existing
zfs kernel module (i.e. a deferred reboot or module reload after
an update).
Programmatically document, via a zfs_ioc_key_t, the valid arguments
for the ioc commands that rely on nvpair input arguments (i.e. non
legacy commands from libzfs_core). Automatically verify the expected
pairs before dispatching a command.
This initial phase focuses on the non-legacy ioctls. A follow-on
change can address the legacy ioctl input from the zfs_cmd_t.
The zfs_ioc_key_t for zfs_keys_channel_program looks like:
Introduce four input errors to identify specific input failures
(in addition to generic argument value errors like EINVAL, ERANGE,
EBADF, and E2BIG).
ZFS_ERR_IOC_CMD_UNAVAIL the ioctl number is not supported by kernel
ZFS_ERR_IOC_ARG_UNAVAIL an input argument is not supported by kernel
ZFS_ERR_IOC_ARG_REQUIRED a required input argument is missing
ZFS_ERR_IOC_ARG_BADTYPE an input argument has an invalid type
Andriy Gapon [Tue, 23 Jun 2020 04:58:36 +0000 (04:58 +0000)]
teach ena driver about RSS kernel option
Networking is broken if the driver configures its (virtual) hardware to
use a hash algorithm (or a key) different from the one that the network
stack (software RSS) uses. This can be seen with connections initiated
from the host. The PCB will be placed into the hash table based on the
hash value calculated by the software. The hardware-calculated hash
value in reponse packets will be different, so the PCB won't be found.
Tested with a kernel compiled with 'options RSS' on an instance with ena
driver.
John Baldwin [Mon, 22 Jun 2020 23:20:43 +0000 (23:20 +0000)]
Add support to the crypto framework for separate AAD buffers.
This permits requests to provide the AAD in a separate side buffer
instead of as a region in the crypto request input buffer. This is
useful when the main data buffer might not contain the full AAD
(e.g. for TLS or IPsec with ESN).
Unlike separate IVs which are constrained in size and stored in an
array in struct cryptop, separate AAD is provided by the caller
setting a new crp_aad pointer to the buffer. The caller must ensure
the pointer remains valid and the buffer contents static until the
request is completed (e.g. when the callback routine is invoked).
As with separate output buffers, not all drivers support this feature.
Consumers must request use of this feature via a new session flag.
To aid in driver testing, kern.crypto.cryptodev_separate_aad can be
set to force /dev/crypto requests to use a separate AAD buffer.
Discussed with: cem
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D25288
The assumption in zio_ddt_free() is that ddt_phys_select() must
always find a match. However, if that fails due to a damaged
DDT or some other reason the code will NULL dereference in
ddt_phys_decref().
While this should never happen it has been observed on various
platforms. The result is that unless your willing to patch the
ZFS code the pool is inaccessible. Therefore, we're choosing
to more gracefully handle this case rather than leave it fatal.
Eugene Grosbein [Mon, 22 Jun 2020 17:52:13 +0000 (17:52 +0000)]
Followup to r362502: rc.conf(5): unobsolete gif_interfaces
There are cases when gif_interfaces cannot be replaced
with cloned_interfaces, such as tunnels with external IPv6 addresses
and internal IPv4 or vice versa. Such configuration requires
extra invocation of ifconfig(8) and supported with gif_interfaces only.
Eugene Grosbein [Mon, 22 Jun 2020 17:25:21 +0000 (17:25 +0000)]
network.subr: unobsolete gif_interfaces
There are cases when gif_interfaces cannot be replaced
with cloned_interfaces, such as tunnels with external IPv6 addresses
and internal IPv4 or vice versa. Such configuration requires
extra invocation of ifconfig(8) and supported with gif_interfaces only.
Mark Johnston [Mon, 22 Jun 2020 14:01:31 +0000 (14:01 +0000)]
Move the definition of SCTP's system_base_info into sctp_crc32.c.
This file is the only SCTP source file compiled into the kernel when
SCTP_SUPPORT is configured. sctp_delayed_checksum() references a couple
of counters defined in system_base_info, so the change allows these
counters to be referenced in a kernel compiled without "options SCTP".
Andrew Turner [Mon, 22 Jun 2020 10:49:50 +0000 (10:49 +0000)]
Translaate the PCI address when activating a resource
When the PCI address != physical address we need to translate from the
former to the latter before passing to the parent to map into the kernels
virtual address space.