amd64: stop doing special allocation for the AP startup trampoline
There is no reason now why do we need to allocate trampoline page very
early in the boot process. The only requirement for the page is that
it is below 1M to be usable by the real mode during init. This can be
handled by vm_alloc_contig() when we do the startup.
Also assert that startup trampoline fits into single page. In principle
we can do multi-page allocation if needed, but it is not.
Move the alloc_ap_trampoline() function and the boot_address variable to
i386/mp_machdep.c. Keep existing mechanism of early alloc on i386.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D31343
Mark Johnston [Thu, 29 Jul 2021 14:22:37 +0000 (10:22 -0400)]
amd64: Set GS.base before calling init_secondary() on APs
KMSAN instrumentation requires thread-local storage to track
initialization state for function parameters and return values. This
buffer is accessed as part of each function prologue. It is provided by
the KMSAN runtime, which looks up a pointer in the current thread's
structure.
When KMSAN is configured, init_secondary() is instrumented, but this
means that GS.base must be initialized first, otherwise the runtime
cannot safely access curthread. Work around this by loading GS.base
before calling init_secondary(), so that the runtime can at least check
curthread == NULL and return a pointer to some dummy storage. Note that
init_secondary() still must reload GS.base after calling lgdt(), which
loads a selector into %gs, which in turn clears the base register.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31336
Mark Johnston [Thu, 29 Jul 2021 13:46:25 +0000 (09:46 -0400)]
link_elf_obj: Invoke fini callbacks
This is required for KASAN: when a module is unloaded, poisoned regions
(e.g., pad areas between global variables) are left as such, so if they
are reused as KLDs are loaded, false positives can arise.
Reported by: pho, Jenkins
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31339
FUTEX_LOCK_PI2 was added to support clock selection as FUTEX_LOCK_PI uses a
CLOCK_REALTIME based absolute value since it was implemented, but it does not
require that the FUTEX_CLOCK_REALTIME bit is set, because that was introduced
later.
Linux futex documentation explicitly states that EINVAL is returned if
the futex is not 4-byte aligned. Check futex alignment as a Linux do
and return EINVAL.
The bitset is a Linux emulation layer extension. This 32-bit mask, in which at
least one bit must be set, is used to select which threads should be woken up.
The bitset is stored in the umtx_q structure, which is used to enqueue the waiter
into the umtx waitqueue. Put the bitset into the hole, that appeared on LP64 due
to data alignment, to prevent the growth of the struct umtx_q.
Wojciech Macek [Thu, 29 Jul 2021 09:02:43 +0000 (11:02 +0200)]
Fix mac_veriexec version mismatch
mac_veriexec sets its version to 1, but the mac_veriexec_shaX modules which depend on it expect MAC_VERIEXEC_VERSION = 2.
Be consistent and use MAC_VERIEXEC_VERSION everywhere.
This unbreaks loading of mac_veriexec modules at boot time.
virtio: enable VTNET_LEGACY_TX when ALTQ is enabled.
ALTQ only works on network drivers which use if_start (rather than
if_transmit). vtnet uses if_start if built with VTNET_LEGACY_TX. Default
to that the kernel is built with ALTQ enabled, to reduce user surprise.
gcc failed as it didn't inlined the builtins and generates calls to
the libgcc, ld can't find libgcc as cross-toolchain libgcc is not installed.
To avoid this add internal vDSO ffs functions without optimized builtins.
The left side of the MIN() expression is the (signed) result of pointer
subtraction (ptrdiff_t). The right hand side is the also the (signed)
result of pointer subtraction, additionally subtracting the element size
('es'), which is unsigned size_t. This coerces the right-hand
expression into an unsigned value. MIN(signed, unsigned) triggers
-Wsign-compare.
Sorting elements of size greater than SSIZE_MAX is nonsensical, so we
can instead treat the element size as ssize_t, leaving the right-hand
result the same signedness as the left.
makesyscalls was rewritten in Lua and introduced in d3276301ab. In the
time since, no objections have risen and a warning was introduced long
ago on invocation of makesyscalls.sh that it would be removed before
FreeBSD 13. Belatedly follow through on that.
Alexander Motin [Thu, 29 Jul 2021 01:18:50 +0000 (21:18 -0400)]
Refactor/optimize cpu_search_*().
Remove cpu_search_both(), unused for many years. Without it there is
less sense for the trick of compiling common cpu_search() into separate
cpu_search_lowest() and cpu_search_highest(), so split them completely,
making code more readable. While there, split iteration over children
groups and CPUs, complicating code for very small deduplication.
Stop passing cpuset_t arguments by value and avoid some manipulations.
Since MAXCPU bump from 64 to 256, what was a single register turned
into 32-byte memory array, requiring memory allocation and accesses.
Splitting struct cpu_search into parameter and result parts allows to
even more reduce stack usage, since the first can be passed through
on recursion.
Remove CPU_FFS() from the hot paths, precalculating first and last CPU
for each CPU group in advance during initialization. Again, it was
not a problem for 64 CPUs before, but for 256 FFS needs much more code.
With these changes on 80-thread system doing ~260K uncached ZFS reads
per second I observe ~30% reduction of time spent in cpu_search_*().
debugnet: Fix false-positive assertions for dp_state
debugnet_handle_arp:
An assertion is present to ensure the pcb is only modified when the state is
DN_STATE_INIT. Because debugnet_arp_gw() is asynchronous it is possible for
ARP replies to come in after the gateway address is known and the state
already changed.
debugnet_handle_ip:
Similarly it is possible for packets to come in, from the expected
server, during the gateway mac discovery phase. This can happen from
testing disconnects / reconnects in quick succession. This later
causes some acks to be sent back but hit an assertion because the
state is wrong.
Rick Macklem [Wed, 28 Jul 2021 22:48:27 +0000 (15:48 -0700)]
nfscl: Cache an open stateid for the "oneopenown" mount option
For NFSv4.1/4.2, if the "oneopenown" mount option is used,
there is, at most, only one open stateid for each NFS vnode.
When an open stateid for a file is acquired, set a pointer to
the open structure in the NFS vnode. This pointer can be used to
acquire the open stateid without searching the open linked list
when the following is true:
- No delegations have been issued for the file. Since delegations
can outlive an NFS vnode for a file, use the global
NFSMNTP_DELEGISSUED flag on the mount to determine this.
- No lock stateid has been issued for the file. To determine
this, a new NFS vnode flag called NMIGHTBELOCKED is set when a lock
stateid is issued, which can then be tested.
When this open structure pointer can be used, it avoids the need to
acquire the NFSCLSTATELOCK() and searching the open structure list for
an open. The NFSCLSTATELOCK() can be highly contended when there are
a lot of opens issued for the NFSv4.1/4.2 mount.
This patch only affects NFSv4.1/4.2 mounts when the "oneopenown"
mount option is used.
Rick Macklem [Wed, 28 Jul 2021 22:23:05 +0000 (15:23 -0700)]
nfscl: Set correct lockowner for "oneopenown" mount option
For NFSv4.1/4.2, the client may use either an open, lock or
delegation stateid as the stateid argument for an I/O operation.
RFC 5661 defines an order of preference of delegation, then lock
and finally open stateid for the argument, although NFSv4.1/4.2
servers are expected to handle any stateid type.
For the "oneopenown" mount option, the lock owner was not being
correctly generated and, as such, the I/O operation would use an
open stateid, even when a lock stateid existed. Although this
did not and should not affect an NFSv4.1/4.2 server's behaviour,
this patch makes the behaviour for "oneopenown" the same as when
the mount option is not specified.
Found during inspection of packet captures. No failure during
testing against NFSv4.1/4.2 servers of the unpatched code occurred.
Ed Maste [Wed, 28 Jul 2021 20:02:49 +0000 (16:02 -0400)]
pkgbase: improve pkg --version parsing
In some cases `pkg --version` might produce unexpected or additional
output. Use a regex /^[0-9.]+$/ to match only the line containing the
version number.
Reported by: Michael Butler on freebsd-current@
Fixes: 4e224e4be7c3 ("pkgbase: accommodate pkg < 1.17")
Sponsored by: The FreeBSD Foundation
Alexander Motin [Wed, 28 Jul 2021 20:15:43 +0000 (16:15 -0400)]
Do not expose to scheduler caches of single CPU.
Before this change my dual-Xeon(R) Gold 6242R always reported 3 levels
or topology (root, package/L3 and core/L2). But with SMT disabled
core/L2 matches thread, so additional topology level only causes more
traversal work. With this change SMT case is reported same as before,
while non-SMT is reported with only 2 much more simple levels.
compilert-rt: build out-of-line LSE atomics helpers for aarch64
Both clang >= 12 and gcc >= 10.1 now default to -moutline-atomics for
aarch64. This requires a bunch of helper functions in libcompiler_rt.a,
to avoid link errors like "undefined symbol: __aarch64_ldadd8_acq_rel".
(Note: of course you can use -mno-outline-atomics as a workaround too,
but this would negate the potential performance benefit of the faster
LSE instructions.)
Bump __FreeBSD_version so ports maintainers can easily detect this.
if_bridge used to only allow MTU changes if the new MTU matched that of
all member interfaces. This doesn't really make much sense, in that we
really shouldn't be allowed to change the MTU of bridge member in the
first place.
Instead we now change the MTU of all member interfaces. If one fails we
revert all interfaces back to the original MTU.
We do not address the issue where bridge member interface MTUs can be
changed here.
John Hood [Wed, 28 Jul 2021 19:43:02 +0000 (13:43 -0600)]
loader: support.4th resets the read buffer incorrectly
Large nextboot.conf files (over 80 bytes) are not read correctly by the
Forth loader, causing file parsing to abort, and nextboot configuration
fails to apply.
Simple repro:
nextboot -e foo=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
shutdown -r now
That will cause the bug to cause a parse failure but shouldn't otherwise
affect the boot. Depending on your loader configuration, you may also
have to set beastie_disable and/or reduce the number of modules loaded
to see the error on a small console screen. 12.0 or CURRENT users will
also have to explicitly use the Forth loader instead of the Lua loader.
The error will look something like:
Warning: syntax error on file /boot/loader.conf.local
foo="xxxxxxxxxxxxxxnextboot_enable="YES"
^
/boot/support.4th has crude file I/O buffering, which uses a buffer
'read_buffer', defined to be 80 bytes by the 'read_buffer_size'
constant. The loader first tastes nextboot.conf, reading and parsing
the first line in it for nextboot_enable="YES". If this is true, then
it reopens the file and parses it like other loader .conf files.
Unfortunately, the file I/O buffering code does not fully reset the
buffer state in the reset_line_reading word. If the last file was read
to the end, that doesn't matter; the file buffer is treated as empty
anyway. But in the nextboot.conf case, the loader will not read to the
end of file if it is over 80 bytes, and the file buffer may be reused
when reading the next file. When the file is reread, the corrupt text
may cause file parsing to abort on bad syntax (if the corrupt line has
<>2 quotes in it), the wrong variable to be set, no variable to be set
at all, or (if the splice happens to land at a line ending) something
approximating normal operation.
The bug is very old, dating back to at least 2000 if not before, and is
still present in 12.0 and CURRENT r345863 (though it is now hidden by
the Lua loader by default).
Suggested one-line attached. This does change the behavior of the
reset_line_reading word, which is exported in the line-reading
dictionary (though the export is not documented in loader man pages).
But repo history shows it was probably exported for the PNP support
code, which was never included in the loader build, and was removed 5
months ago.
One thing that puzzles me: how has this bug gone unnoticed/unfixed for
nearly 2 decades? I find it hard to believe that nobody's tried to do
something interesting with nextboot, like load a kernel and filesystem,
which is what I'm doing.
Warner Losh [Wed, 28 Jul 2021 19:47:05 +0000 (13:47 -0600)]
genoffset: simplify and rewrite in sh
genoffset used the fully generic ASSYM macro to generate the offsets
needed for the thread_lite structure. However, since these are offsets
into a structure, they will always be necessarily small and positive. As
such, just create a simple character array of the right size and use a
naming convention such that we can recover the field name, structure
name and type. Use nm -t d and sort -n to sort these into order, then
loop over the resutls to generate the thread_lite structure.
Roy Marples [Wed, 28 Jul 2021 15:46:59 +0000 (08:46 -0700)]
socket: Implement SO_RERROR
SO_RERROR indicates that receive buffer overflows should be handled as
errors. Historically receive buffer overflows have been ignored and
programs could not tell if they missed messages or messages had been
truncated because of overflows. Since programs historically do not
expect to get receive overflow errors, this behavior is not the
default.
This is really really important for programs that use route(4) to keep
in sync with the system. If we loose a message then we need to reload
the full system state, otherwise the behaviour from that point is
undefined and can lead to chasing bogus bug reports.
Add zfskeys rc.d script for auto-loading encryption keys
ZFS in 13 supports encryption, but for the use case where keys are
available in plaintext on disk there is no mechanism for automatically
loading keys on startup.
This script will, by default, look for any dataset with encryption and
keylocation prefixed with file://. It will attempt to unlock, timing
out after 10 seconds for each dataset found.
User can optionally specify explicitly which datasets to attempt to
unlock.
Also supports (optionally by force) unmounting filesystems and unloading
associated keys.
This is no more or less than returning the smaller of two values. Since
this is what min() does, use that to shrink max_nr_grant_frames() down
to the single line.
xen/xen-os: move inclusion of machine/xen-os.h later
Several of x86 enable/disable functions depend upon the xen*domain()
functions. As such the xen*domain() functions need to be declared
before machine/xen-os.h.
Officially declare direct inclusion of machine/xen/xen-os.h verboten as
such will break these functions/macros. Remove one such soon to be
broken inclusion.
For embedded devices reserved addresses will be known in advance. More
recently added devices will also likely be correctly updated. As a
result using any available address is reasonable on non-x86.
Functions tend to get renamed and unless the developer is careful
often debugging messages are missed. As such using func is far
superior. Replace several instances of hard-coded function names.
xen/intr: use struct xenisrc * as xen_intr_handle_t
Since xen_intr_handle_t is meant to be an opaque handle and the only
use is retrieving the associated struct xenisrc *, directly use it as
the opaque handler.
Also add a wrapper function for converting the other direction. If some
other value becomes appropriate in the future, these two functions will
be the only spots needing modification.
xen/control: gate x86 specific code in the preprocessor
Commit 152265223048 was implemented strictly for x86. Unfortunately
one of the pieces was mixed into a common area breaking other
architectures. For now disable these bits on !x86, this should be
cleaned up later.
xen: create VM_MEMATTR_XEN for Xen memory mappings
The requirements for pages shared with Xen/other VMs may vary from
architecture to architecture. As such create a macro which various
architectures can use.
Remove a use of PAT_WRITE_BACK in xenstore.c. This is a x86-ism which
shouldn't have been present in a common area.
Julien Grall [Tue, 14 Jan 2014 01:41:10 +0000 (17:41 -0800)]
xen: move x86/xen/xenpv.c to dev/xen/bus/xenpv.c
Minor changes are necessary to make this processor-independent, but
moving the file out of x86 and into common is the first step (so
others don't add /more/ x86-isms).
Mark Johnston [Wed, 28 Jul 2021 14:16:42 +0000 (10:16 -0400)]
pf: Validate user string nul-termination before copying
Some pf ioctl handlers use strlcpy() to copy strings when converting
from user structures to their in-kernel representations. strlcpy()
ensures that the destination will be nul-terminated, but it assumes that
the source is nul-terminated. In particular, it returns the full length
of the source string, so if the source is not nul-terminated, strlcpy()
will keep scanning until it finds a nul byte, and it may encounter an
unmapped page first. Add a helper to validate user strings before
copying.
There are also places where we look up a ruleset using a user-provided
anchor string. In some ioctl handlers we were already nul-terminating
the string, avoiding the same problem, but in other places we were not.
Fix those by nul-terminating as well. Aside from being consistent,
anchors have a maximum length of MAXPATHLEN - 1 so calling strnlen()
might not be so desirable.
Reported by: syzbot+35a1549b4663e9483dd1@syzkaller.appspotmail.com
Reviewed by: kp
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31169
Mark Johnston [Wed, 28 Jul 2021 14:16:25 +0000 (10:16 -0400)]
pf: Initialize arrays before copying out to userland
A number of pf ioctls populate an array of structures and copy it out.
They have the following structures:
- caller specifies the size of its output buffer
- ioctl handler allocates a kernel buffer of the same size
- ioctl handler populates the buffer, possibly leaving some items
initialized if the caller provided more space than needed
- ioctl handler copies the entire buffer out to userland
Thus, if more space was provided than is required, we end up copying out
uninitialized kernel memory. Simply zero the buffer at allocation time
to prevent this.
Reported by: KMSAN
Reviewed by: kp
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31313
Add fsleep() function now required by rtw88. This seems to be
making a decision depending on time to sleep on how to sleep.
Given our compat framework already is lenient on how long to sleep,
this is a cut down version.
MFC after: 10 days
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D31322
Michal Meloun [Fri, 9 Jul 2021 17:33:36 +0000 (19:33 +0200)]
booti: Enable loading the kernel image to any address aligned to 2 MB
We've supported this for a long time, plus most u-boot setups quietly expect
it. Otherwise they fail with different levels of memory overwrites.
MFC after: 2 weeks
ibcore: Kernel space update based on Linux 5.7-rc1.
Overview:
This is the first stage of a RDMA stack upgrade introducing kernel
changes only based on Linux 5.7-rc1.
This patch is based on about four main areas of work:
- Update of the IB uobjects system:
- The memory holding so-called AH, CQ, PD, SRQ and UCONTEXT objects
is now managed by ibcore. This also require some changes in the
kernel verbs API. The updated verbs changes are typically about
initialize and deinitialize objects, and remove allocation and
free of memory.
- Update of the uverbs IOCTL framework:
- The parsing and handling of user-space commands has been
completely refactored to integrate with the updated IB uobjects
system.
- Various changes and updates to the generic uverbs interfaces in
device drivers including the new uAPI surface.
- The mlx5_ib_devx.c in mlx5ib and related mlx5 core changes.
Dependencies:
- The mlx4ib driver code has been updated with the minimum changes
needed.
- The mlx5ib driver code has been updated with the minimum changes
needed including DV support.
Compatibility:
- All user-space facing APIs are backwards compatible after this
change.
- All kernel-space facing RDMA APIs are backwards compatible after
this change, with exception of ib_create_ah() and ib_destroy_ah()
which takes a new flag.
- The "ib_device_ops" structure exist, but only contains the driver ID
and some structure sizes.
Differences from Linux:
- Infiniband drivers must use the INIT_IB_DEVICE_OPS() macro to set
the sizes needed for allocating various IB objects, when adding
IB device instances.
Security:
- PRIV_NET_RAW is needed to use raw ethernet transmit features.
- PRIV_DRIVER is needed to use other privileged operations.
to restore ABI compatibility for pre-10.x binaries.
It restores _umtx_lock() and _umtx_unlock() syscalls, and UMTX_OP_LOCK/
UMTX_OP_UNLOCK umtx_op(2) operations. UMUTEX_ERROR_CHECK flag is left
out for now, I do not think it makes a difference.
PR: 218571
Reviewed by: brooks (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31220
Andrew Turner [Tue, 27 Jul 2021 19:28:33 +0000 (19:28 +0000)]
Teach the arm64 kernel to identify the Arm AEM
The Arm Architecture Envelope Model is a simulator that models the
architecture rather than any specific implementation. Add its part ID
macro and add it to the list of Arm CPUs we can decode.