bhyve: implement rdmsr for MSR_IA32_FEATURE_CONTROL
Without the -w option, Windows guests crash on boot. This is caused by a rdmsr
of MSR_IA32_FEATURE_CONTROL. Windows checks this MSR to determine enabled VMX
features. This MSR isn't emulated in bhyve, so a #GP exception is injected
which causes Windows to crash.
Fix by returning a rdmsr of MSR_IA32_FEATURE_CONTROL with Lock Bit set and
VMX disabled to informWindows that VMX isn't available.
Ensure that the mount command shows "with quotas" when quotas are enabled.
When quotas are enabled with the quotaon(8) command, it sets the
MNT_QUOTA flag in the mount structure mnt_flag field. The mount
structure holds a cached copy of the filesystem statfs structure
in mnt_stat that includes a copy of the mnt_flag field in
mnt_stat.f_flags. The mnt_stat structure may not be updated for
hours. Since the mount command requests mount details using the
MNT_NOWAIT option, it gets the mount's mnt_stat statfs structure
whose f_flags field does not yet show the MNT_QUOTA flag being set
in mnt_flag.
The fix is to have quotaon(8) set the MNT_QUOTA flag in both mnt_flag
and in mnt_stat.f_flags so that it will be immediately visible to
callers of statfs(2).
Mark Johnston [Wed, 14 Apr 2021 16:57:24 +0000 (12:57 -0400)]
uma: Introduce per-domain reclamation functions
Make it possible to reclaim items from a specific NUMA domain.
- Add uma_zone_reclaim_domain() and uma_reclaim_domain().
- Permit parallel reclamations. Use a counter instead of a flag to
synchronize with zone_dtor().
- Use the zone lock to protect cache_shrink() now that parallel reclaims
can happen.
- Add a sysctl that can be used to trigger reclamation from a specific
domain.
Currently the new KPIs are unused, so there should be no functional
change.
Reviewed by: mav
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29685
Mark Johnston [Wed, 14 Apr 2021 16:56:39 +0000 (12:56 -0400)]
domainset: Define additional global policies
Add global definitions for first-touch and interleave policies. The
former may be useful for UMA, which implements a similar policy without
using domainset iterators.
No functional change intended.
Reviewed by: mav
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29104
Notable upstream pull request merges:
#11742 When specifying raidz vdev name, parity count should match
#11744 Use a helper function to clarify gang block size
#11771 Support running FreeBSD buildworld on Arm-based macOS hosts
This is the last update that will be MFCed into stable/13.
From now on, the tracking of OpenZFS branches will be different:
- main continues tracking openzfs/zfs/master
- stable/13 is going to track openzfs/zfs/zfs-2.1-release
Found by: syzkaller
Reported and reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29746
realtimer_expire: avoid proc lock recursion when called from itimer_proc_continue()
It is fine to drop the process lock there, process cannot exit until its
timers are cleared.
Found by: syzkaller
Reported and reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29746
Martin Matuska [Wed, 14 Apr 2021 06:03:07 +0000 (08:03 +0200)]
Update vendor/openzfs to openzfs/zfs/master@3522f57b6
Notable upstream pull request merges:
#11742 When specifying raidz vdev name, parity count should match
#11744 Use a helper function to clarify gang block size
#11771 Support running FreeBSD buildworld on Arm-based macOS hosts
Mark Johnston [Tue, 13 Apr 2021 21:40:27 +0000 (17:40 -0400)]
malloc: Add state transitions for KASAN
- Reuse some REDZONE bits to keep track of the requested and allocated
sizes, and use that to provide red zones.
- As in UMA, disable memory trashing to avoid unnecessary CPU overhead.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29461
Mark Johnston [Tue, 13 Apr 2021 21:40:11 +0000 (17:40 -0400)]
vfs: Add KASAN state transitions for vnodes
vnodes are a bit special in that they may exist on per-CPU lists even
while free. Add a KASAN-only destructor that poisons regions of each
vnode that are not expected to be accessed after a free.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29459
Mark Johnston [Tue, 13 Apr 2021 21:40:01 +0000 (17:40 -0400)]
kmem: Add KASAN state transitions
Memory allocated with kmem_* is unmapped upon free, so KASAN doesn't
provide a lot of benefit, but since allocations are always a multiple of
the page size we can create a redzone when the allocation request size
is not a multiple of the page size.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29458
Mark Johnston [Tue, 13 Apr 2021 21:39:50 +0000 (17:39 -0400)]
uma: Add KASAN state transitions
- Add a UMA_ZONE_NOKASAN flag to indicate that items from a particular
zone should not be sanitized. This is applied implicitly for NOFREE
and cache zones.
- Add KASAN call backs which get invoked:
1) when a slab is imported into a keg
2) when an item is allocated from a zone
3) when an item is freed to a zone
4) when a slab is freed back to the VM
In state transitions 1 and 3, memory is poisoned so that accesses will
trigger a panic. In state transitions 2 and 4, memory is marked
valid.
- Disable trashing if KASAN is enabled. It just adds extra CPU overhead
to catch problems that are detected by KASAN.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29456
Mark Johnston [Tue, 13 Apr 2021 21:39:35 +0000 (17:39 -0400)]
amd64: Add MD bits for KASAN
- Initialize KASAN before executing SYSINITs.
- Add a GENERIC-KASAN kernel config, akin to GENERIC-KCSAN.
- Increase the kernel stack size if KASAN is enabled. Some of the
ASAN instrumentation increases stack usage and it's enough to
trigger stack overflows in ZFS.
- Mark the trapframe as valid in interrupt handlers if it is
assigned to td_intr_frame. Otherwise, an interrupt in a function
which creates a poisoned alloca region can trigger false positives.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29455
Mark Johnston [Tue, 13 Apr 2021 20:30:05 +0000 (16:30 -0400)]
amd64: Implement a KASAN shadow map
The idea behind KASAN is to use a region of memory to track the validity
of buffers in the kernel map. This region is the shadow map. The
compiler inserts calls to the KASAN runtime for every emitted load
and store, and the runtime uses the shadow map to decide whether the
access is valid. Various kernel allocators call kasan_mark() to update
the shadow map.
Since the shadow map tracks only accesses to the kernel map, accesses to
other kernel maps are not validated by KASAN. UMA_MD_SMALL_ALLOC is
disabled when KASAN is configured to reduce usage of the direct map.
Currently we have no mechanism to completely eliminate uses of the
direct map, so KASAN's coverage is not comprehensive.
The shadow map uses one byte per eight bytes in the kernel map. In
pmap_bootstrap() we create an initial set of page tables for the kernel
and preloaded data.
When pmap_growkernel() is called, we call kasan_shadow_map() to extend
the shadow map. kasan_shadow_map() uses pmap_kasan_enter() to allocate
memory for the shadow region and map it.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29417
Mark Johnston [Tue, 13 Apr 2021 21:39:19 +0000 (17:39 -0400)]
Add the KASAN runtime
KASAN enables the use of LLVM's AddressSanitizer in the kernel. This
feature makes use of compiler instrumentation to validate memory
accesses in the kernel and detect several types of bugs, including
use-after-frees and out-of-bounds accesses. It is particularly
effective when combined with test suites or syzkaller. KASAN has high
CPU and memory usage overhead and so is not suited for production
environments.
The runtime and pmap maintain a shadow of the kernel map to store
information about the validity of memory mapped at a given kernel
address.
The runtime implements a number of functions defined by the compiler
ABI. These are prefixed by __asan. The compiler emits calls to
__asan_load*() and __asan_store*() around memory accesses, and the
runtime consults the shadow map to determine whether a given access is
valid.
kasan_mark() is called by various kernel allocators to update state in
the shadow map. Updates to those allocators will come in subsequent
commits.
The runtime also defines various interceptors. Some low-level routines
are implemented in assembly and are thus not amenable to compiler
instrumentation. To handle this, the runtime implements these routines
on behalf of the rest of the kernel. The sanitizer implementation
validates memory accesses manually before handing off to the real
implementation.
The sanitizer in a KASAN-configured kernel can be disabled by setting
the loader tunable debug.kasan.disable=1.
Obtained from: NetBSD
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29416
Alexander Motin [Tue, 13 Apr 2021 15:19:10 +0000 (11:19 -0400)]
Fix race in case of device destruction.
During device destruction it is possible that open() succeed, but
fdevname() return NULL, that can't be assigned to string variable.
Fix that by adding explicit NULL check.
Also while there switch from fdevname() to fdevname_r().
linux: adjust ordering of Linux auxv and add dummy AT_HWCAP2
This should be a no-op; the purpose of this is to reduce
a spurious difference between Linuxulator and Linux, to make
debugging core dumps slightly easier.
Note that AT_HWCAP2 we pass to Linux binaries is always 0,
instead of being equal to 'cpu_feature2'. This matches what
I've observed under Ubuntu Focal VM.
Alex Richardson [Tue, 13 Apr 2021 11:36:24 +0000 (12:36 +0100)]
Remove history.immutable from .arcconfig
The `history.immutable` setting prevents arcanist from updating
the commit messages with the Differential URL and therefore
makes updating patches awkward with a rebase workflow.
In case this new behaviour is not wanted the old one can be restored
by running `arc set-config --local history.immutable true`.
Test Plan: `arc diff --create HEAD^` adds the metadata now.
pf: Implement the NAT source port selection of MAP-E Customer Edge
MAP-E (RFC 7597) requires special care for selecting source ports
in NAT operation on the Customer Edge because a part of bits of the port
numbers are used by the Border Relay to distinguish another side of the
IPv4-over-IPv6 tunnel.
John Baldwin [Mon, 12 Apr 2021 21:27:42 +0000 (14:27 -0700)]
OCF: Remove support for asymmetric cryptographic operations.
There haven't been any non-obscure drivers that supported this
functionality and it has been impossible to test to ensure that it
still works. The only known consumer of this interface was the engine
in OpenSSL < 1.1. Modern OpenSSL versions do not include support for
this interface as it was not well-documented.
Reviewed by: cem
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D29736
Warner Losh [Mon, 12 Apr 2021 19:41:20 +0000 (13:41 -0600)]
hptmv: use .o files directly
uudecode the .o.uu files and commit directly to the tree. Adjust the build
infrastructure to cope with the new location, both for the kernel and modules.
Warner Losh [Mon, 12 Apr 2021 19:41:14 +0000 (13:41 -0600)]
hpt27xx: store the .o files directly in the tree
Store the .o files directly in the tree. We no longer need to play uuencode
games like we did in the CVS days. Adjust the build infrastructure to match.
Warner Losh [Mon, 12 Apr 2021 19:40:43 +0000 (13:40 -0600)]
hptnr: Store the .o files directly in the repo
We no longer need to use uuencode to uuencode files in our tree. Store the .o
file directly instead. Adjust the build to cope with the new arrangement.
John Baldwin [Mon, 12 Apr 2021 18:43:34 +0000 (11:43 -0700)]
bhyve: Move the gdb_active check to gdb_cpu_suspend().
The check needs to be in the public routine (gdb_cpu_suspend()), not
in the internal routine called from various places
(_gdb_cpu_suspend()). All the other callers of _gdb_cpu_suspend()
already check gdb_active, and this breaks the use of snapshots when
the debug server is not enabled since gdb_cpu_suspend() tries to lock
an uninitialized mutex.
Reported by: Darius Mihai, Elena Mihailescu
Reviewed by: elenamihailescu22_gmail.com
Fixes: 621b5090487de9fed1b503769702a9a2a27cc7bb
Differential Revision: https://reviews.freebsd.org/D29538
Gleb Smirnoff [Fri, 19 Mar 2021 05:05:22 +0000 (22:05 -0700)]
syncache: simplify syncache_add() KPI to return struct socket pointer
directly, not overwriting the listen socket pointer argument.
Not a functional change.
Gleb Smirnoff [Fri, 19 Mar 2021 02:06:13 +0000 (19:06 -0700)]
tcp_input/syncache: acquire only read lock on PCB for SYN,!ACK packets
When packet is a SYN packet, we don't need to modify any existing PCB.
Normally SYN arrives on a listening socket, we either create a syncache
entry or generate syncookie, but we don't modify anything with the
listening socket or associated PCB. Thus create a new PCB lookup
mode - rlock if listening. This removes the primary contention point
under SYN flood - the listening socket PCB.
Sidenote: when SYN arrives on a synchronized connection, we still
don't need write access to PCB to send a challenge ACK or just to
drop. There is only one exclusion - tcptw recycling. However,
existing entanglement of tcp_input + stacks doesn't allow to make
this change small. Consider this patch as first approach to the problem.
It was unused since 405c3050f10, which removed iBCS support.
This also moves the 'linux' rc script slightly earlier, which
might help in some setups. The original version of this patch
moved it even more, before 'mountcritlocal', which would fixe
mount(8) errors due to missing /dev/shm in setups with entries
for /path/to/chroot/dev/shm without the "late" flag; however,
in the end 'kldxref' turned out to depend on 'mountcritlocal'
anyway.
Mark Johnston [Mon, 12 Apr 2021 13:32:30 +0000 (09:32 -0400)]
Rename struct device to struct _device
types.h defines device_t as a typedef of struct device *. struct device
is defined in subr_bus.c and almost all of the kernel uses device_t.
The LinuxKPI also defines a struct device, so type confusion can occur.
This causes bugs and ambiguity for debugging tools. Rename the FreeBSD
struct device to struct _device.
Mark Johnston [Mon, 12 Apr 2021 13:32:08 +0000 (09:32 -0400)]
qlnxr: Properly initialize the Linux device structure
The driver needs to provide a LinuxKPI device structure to register
itself with the IB subsystem. It was erroneously using a copy of its
FreeBSD device structure for this purpose.
Use linux_pci_attach_device() instead, following the example of the
Chelsio iwarp driver. Also ensure that we don't leak the faked device
during detach.
Reviewed by: hselasky
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29595
Rick Macklem [Sun, 11 Apr 2021 23:51:25 +0000 (16:51 -0700)]
nfsd: cut the Linux NFSv4.1/4.2 some slack w.r.t. RFC5661
Recent testing of network partitioning a FreeBSD NFSv4.1
server from a Linux NFSv4.1 client identified problems
with both the FreeBSD server and Linux client.
Sometimes, after some Linux NFSv4.1/4.2 clients establish
a new TCP connection, they will advance the sequence number
for a session slot by 2 instead of 1.
RFC5661 specifies that a server should reply
NFS4ERR_SEQ_MISORDERED for this case.
This might result in a system call error in the client and
seems to disable future use of the slot by the client.
Since advancing the sequence number by 2 seems harmless,
allow this case if vfs.nfs.linuxseqsesshack is non-zero.
Note that, if the order of RPCs is actually reversed,
a subsequent RPC with a smaller sequence number value
for the slot will be received. This will result in
a NFS4ERR_SEQ_MISORDERED reply.
This has not been observed during testing.
Setting vfs.nfs.linuxseqsesshack to 0 will provide
RFC5661 compliant behaviour.
This fix affects the fairly rare case where a NFSv4
Linux client does a TCP reconnect and then apparently
erroneously increments the sequence number for the
session slot twice during the reconnect cycle.
Rick Macklem [Sun, 11 Apr 2021 21:47:36 +0000 (14:47 -0700)]
param.h: bump __FreeBSD_version for commit 7763814fc9c2
Commit 7763814fc9c2 changed the internal KAPI between the krpc
and NFS. As such, the krpc, nfscommon and nfscl modules must
all be rebuilt from sources.
Rick Macklem [Sun, 11 Apr 2021 21:34:57 +0000 (14:34 -0700)]
nfsv4 client: do the BindConnectionToSession as required
During a recent testing event, it was reported that the NFSv4.1/4.2
server erroneously bound the back channel to a new TCP connection.
RFC5661 specifies that the fore channel is implicitly bound to a
new TCP connection when an RPC with Sequence (almost any of them)
is done on it. For the back channel to be bound to the new TCP
connection, an explicit BindConnectionToSession must be done as
the first RPC on the new connection.
Since new TCP connections are created by the "reconnect" layer
(sys/rpc/clnt_rc.c) of the krpc, this patch adds an optional
upcall done by the krpc whenever a new connection is created.
The patch also adds the specific upcall function that does a
BindConnectionToSession and configures the krpc to call it
when required.
This is necessary for correct interoperability with NFSv4.1/NFSv4.2
servers when the nfscbd daemon is running.
If doing NFSv4.1/NFSv4.2 mounts without this patch, it is
recommended that the nfscbd daemon not be running and that
the "pnfs" mount option not be specified.