Brooks Davis [Tue, 10 Nov 2020 21:12:32 +0000 (21:12 +0000)]
Be more tolerant of share/mk and kern.mk mismatch
When building out-of-tree modules, it appears that the system share/mk
is used, but sys/conf/kern.mk is used. That results in MK_INIT_ALL_ZERO
being undefined. In the interest of maximum compatability, check
that MK_INIT_ALL_* and COMPILER_FEATURES are defined before comparing
their values.
John Baldwin [Tue, 10 Nov 2020 19:54:39 +0000 (19:54 +0000)]
Clear tp->tod in t4_pcb_detach().
Otherwise, a socket can have a non-NULL tp->tod while TF_TOE is clear.
In particular, if a newly accepted socket falls back to non-TOE due to
an active open failure, the non-TOE socket will still have tp->tod set
even though TF_TOE is clear.
Brooks Davis [Tue, 10 Nov 2020 19:15:13 +0000 (19:15 +0000)]
Support initializing stack variables on function entry
There are two options:
- WITH_INIT_ALL_ZERO: Zero all variables on the stack.
- WITH_INIT_ALL_PATTERN: Initialize variables with well-defined patterns.
The exact pattern are a compiler implementation detail and vary by type.
They are somewhat documented in the LLVM commit message:
https://reviews.llvm.org/rL349442
I've used WITH_INIT_ALL_* to match Microsoft's InitAll feature rather
than naming them after the LLVM specific compiler flags.
In a range of consumer products, options like these are used in
both debug and production builds with debugs builds using patterns
(intended to provoke crashes on use of uninitialized values) and
production using zeros (deemed more likely to lead to harmless
misbehavior or NULL-pointer dereferences).
When destroying a UMA zone which has a reserve (set with
uma_zone_reserve()), messages like the following appear on the console:
"Freed UMA keg (Test zone) was not empty (0 items). Lost 528 pages of
memory."
When keg_drain_domain() is draining the zone, it tries to keep the number
of items specified in the reservation. However, when we are destroying the
UMA zone, we do not need to keep those items. Therefore, when destroying a
non-secondary and non-cache zone, we should reset the keg reservation to 0
prior to draining the zone.
Perhaps it made sense in 1998 (r32836), but now it feels a bit out of
place. We tend to avoid documenting non-essential ports variables in
the manual page (we try to document them in the Porter's Handbook instead).
Bjoern A. Zeeb [Mon, 9 Nov 2020 23:34:32 +0000 (23:34 +0000)]
arm64: bs_sr_<N> take II
In r367327 generic_bs_sr_<n> were derived from mips. Given we are calling
generic_bs_w_<n> and no write directly, we do not have to do the address
calculations ourselves as eneric_bs_w_<n> will do a str val [bsh, offset].
All we actually have to do is increment offset.
Mateusz Guzik [Mon, 9 Nov 2020 23:05:28 +0000 (23:05 +0000)]
threads: reimplement tid allocation on top of a bitmap
There are workloads with very bursty tid allocation and since unr tries very
hard to have small-sized bitmaps it keeps reallocating memory. Just doing
buildkernel gives almost 150k calls to free coming from unr.
This also gets rid of the hack which tried to postpone TID reuse.
Michael Tuexen [Mon, 9 Nov 2020 21:49:40 +0000 (21:49 +0000)]
RFC 7323 specifies that:
* TCP segments without timestamps should be dropped when support for
the timestamp option has been negotiated.
* TCP segments with timestamps should be processed normally if support
for the timestamp option has not been negotiated.
This patch enforces the above.
Emmanuel Vadot [Mon, 9 Nov 2020 13:20:14 +0000 (13:20 +0000)]
LinuxKPI: Implement ACPI bits required by drm-kmod in base system
It includes:
ACPI_HANDLE() implementation.
AC and VIDEO ACPI events notification support.
Replacement of hand-rolled GPLed _DSM method evaluation helpers
with in-base ones.
Brandon Bergren [Sun, 8 Nov 2020 23:34:06 +0000 (23:34 +0000)]
[PowerPC] Fix powerpc64le boot after HPT superpages addition
The HPT is always stored in big-endian, as it is accessed directly by the
hardware as well as the kernel. As such, it is necessary to convert values
to and from native endian when running on LE.
Some unconverted accesses snuck in accidentally with r367417.
Apply the appropriate conversions to fix boot hanging on powerpc64le.
Mitchell Horne [Sun, 8 Nov 2020 18:49:23 +0000 (18:49 +0000)]
igmp: convert igmpstat to use PCPU counters
Currently there is no locking done to protect this structure. It is
likely okay due to the low-volume nature of IGMP, but allows for
the possibility of underflow. This appears to be one of the only
holdouts of the conversion to counter(9) which was done for most
protocol stat structures around 2013.
This also updates the visibility of this stats structure so that it can
be consumed from elsewhere in the kernel, consistent with the vast
majority of VNET_PCPUSTAT structures.
Reviewed by: kp
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D27023
Prevent premature SACK block transmission during loss recovery
Under specific conditions, a window update can be sent with
outdated SACK information. Some clients react to this by
subsequently delaying loss recovery, making TCP perform very
poorly.
The goal of the fib support is to provide multiple independent
routing tables, isolated from each other.
net.add_addr_allfibs default tries to shift gears in the opposite
direction, unconditionally inserting all addresses to all of the fibs.
There are use cases when this is necessary, however this is not a
default expected behaviour, especially compared to other implementations.
Provide WARNING message for the setups with multiple fibs to notify
potential users of the feature.
Dimitry Andric [Sun, 8 Nov 2020 12:47:35 +0000 (12:47 +0000)]
Merge commit 354d3106c from llvm git (by Kai Luo):
[PowerPC] Skip combining (uint_to_fp x) if x is not simple type
Current powerpc64le backend hits
```
Combining: t7: f64 = uint_to_fp t6
llc: llvm-project/llvm/include/llvm/CodeGen/ValueTypes.h:291:
llvm::MVT llvm::EVT::getSimpleVT() const: Assertion `isSimple() &&
"Expected a SimpleValueType!"' failed.
```
This patch fixes it by skipping combination if `t6` is not simple
type.
Fixed https://bugs.llvm.org/show_bug.cgi?id=47660.
- add more linux socket options (sorted by value)
- map those IPv4 / IPv6 socket options which exist in FreeBSD
+ most of them visually verified to have the same type/layout of arguments
+ not tested with linux programs to behave as intended
- be more human readable for known options which are not handled
- be more verbose for unhandled socket message flags we know about
- print the jail ID in linux_msg if run in a jail
- add possibility to print debug message about known missing parts only once
- add multiple levels of sysctl linux.debug:
1: print debug messages, tell about unimplemented stuff (only once)
2: like 1, but also print messages about implemented but not tested
stuff (only once)
3+: like 2, but no rate limiting of messages
- increase default linux debug level from 1 to 3
We are a lot more verbose in as we need to be (e.g. some of the IP socket
options which are the same, and share the same memory layout, and are
believed to work). The reason is that we have no good testsuite to test those
linux-bits. The LTP or other test suites like the python one, are not fully
up to the task we need. As such the excessive messages about emulated but not
tested socket options.
IMO any MFC (possible, but most probably not by me) should set the default
debug level to 1.
Kyle Evans [Sun, 8 Nov 2020 04:24:29 +0000 (04:24 +0000)]
imgact_binmisc: limit the extent of match on incoming entries
imgact_binmisc matches magic/mask from imgp->image_header, which is only a
single page in size mapped from the first page of an image. One can specify
an interpreter that matches on, e.g., --offset 4096 --size 256 to read up to
256 bytes past the mapped first page.
The limitation is that we cannot specify a magic string that exceeds a
single page, and we can't allow offset + size to exceed a single page
either. A static assert has been added in case someone finds it useful to
try and expand the size, but it does seem a little unlikely.
While this looks kind of exploitable at a sideways squinty-glance, there are
a couple of mitigating factors:
1.) imgact_binmisc is not enabled by default,
2.) entries may only be added by the superuser,
3.) trying to exploit this information to read what's mapped past the end
would be worse than a root canal or some other relatably painful
experience, and
4.) there's no way one could pull this off without it being completely
obvious.
The first page is mapped out of an sf_buf, the implementation of which (or
lack thereof) depends on your platform.
Thomas Munro [Sun, 8 Nov 2020 02:50:34 +0000 (02:50 +0000)]
Add collation version support to querylocale(3).
Provide a way to ask for an opaque version string for a locale_t, so
that potential changes in sort order can be detected. Similar to
ICU's ucol_getVersion() and Windows' GetNLSVersionEx(), this API is
intended to allow databases to detect when text order-based indexes
might need to be rebuilt.
The CLDR version is extracted from CLDR source data by the Makefile
under tools/tools/locale, written into the machine-generated Makefile
under shared/colldef, passed to localedef -V, and then written into
LC_COLLATE file headers. The initial version is 34.0.
tools/tools/locale was recently updated to pull down 35.0, but the
output hasn't been committed under share/colldef yet, so that will
provide the first observable change when it happens. Other versioning
schemes are possible in future, because the format is unspecified.
Reviewed by: bapt, 0mp, kib, yuripv (albeit a long time ago)
Differential Revision: https://reviews.freebsd.org/D17166
Warner Losh [Sun, 8 Nov 2020 02:20:21 +0000 (02:20 +0000)]
Be explicit about recompiling all the modules...
Add a note about always recompiling all modules on every new kernel
change / update. In addition, suggest using /usr/local/sys/modules
so this happens automatically.
Michael Tuexen [Sat, 7 Nov 2020 21:17:49 +0000 (21:17 +0000)]
The ioctl() calls using FIONREAD, FIONWRITE, FIONSPACE, and SIOCATMARK
access the socket send or receive buffer. This is not possible for
listening sockets since r319722.
Because send()/recv() calls fail on listening sockets, fail also ioctl()
indicating EINVAL.
Kyle Evans [Sat, 7 Nov 2020 18:07:55 +0000 (18:07 +0000)]
imgact_binmisc: move some calculations out of the exec path
The offset we need to account for in the interpreter string comes in two
variants:
1. Fixed - macros other than #a that will not vary from invocation to
invocation
2. Variable - #a, which is substitued with the argv0 that we're replacing
Note that we don't have a mechanism to modify an existing entry. By
recording both of these offset requirements when the interpreter is added,
we can avoid some unnecessary calculations in the exec path.
Most importantly, we can know up-front whether we need to grab
calculate/grab the the filename for this interpreter. We also get to avoid
walking the string a first time looking for macros. For most invocations,
it's a swift exit as they won't have any, but there's no point entering a
loop and searching for the macro indicator if we already know there will not
be one.
While we're here, go ahead and only calculate the argv0 name length once per
invocation. While it's unlikely that we'll have more than one #a, there's no
reason to recalculate it every time we encounter an #a when it will not
change.
I have not bothered trying to benchmark this at all, because it's arguably a
minor and straightforward/obvious improvement.
Bryan Drewery [Sat, 7 Nov 2020 17:18:44 +0000 (17:18 +0000)]
syslogd: Stop trying to send remote messages through special sockets
Specifically this was causing the /dev/klog fd and the signal pipe
handling fd to get a sendmsg(2) called on them and always returned
[ENOTSOCK].
r310350 combined these sockets into the main socket list and properly
skipped AF_UNSPEC at the sendmsg(2) call but later in r344739 it was
broken such that these special sockets were no longer excluded since
the AF_UNSPEC check specifically excluded these special sockets. Only
these special sockets have sl_sa = NULL. The sl_family checks should
be redundant now but are left in case of future changes so the intent
is clearer.
Mateusz Guzik [Sat, 7 Nov 2020 16:57:53 +0000 (16:57 +0000)]
rms: several cleanups + debug read lockers handling
This adds a dedicated counter updated with atomics when INVARIANTS
is used. As a side effect one can reliably determine the lock is held
for reading by at least one thread, but it's still not possible to
find out whether curthread has the lock in said mode.
Kyle Evans [Sat, 7 Nov 2020 16:41:59 +0000 (16:41 +0000)]
imgact_binmisc: reorder members of struct imgact_binmisc_entry (NFC)
This doesn't change anything at the moment since the out-of-order elements
were a pair of uint32_t, but future additions may have caused unnecessary
padding by following the existing precedent.
Kyle Evans [Sat, 7 Nov 2020 15:38:01 +0000 (15:38 +0000)]
vt: resolve conflict between VT_ALT_TO_ESC_HACK and DBG
When using the ALT+CTRL+ESC sequence to break into kdb, the keyboard is
completely borked when you return. watch(8) shows that it's working, but
it's inserting escape sequences.
Further investigation revealed that VT_ALT_TO_ESC_HACK is the default and
directly conflicts with this sequence, so upon return from the debugger
ALKED is set.
If they triggered the break to debugger, it's safe to assume they didn't
mean to use VT_ALT_TO_ESC_HACK, so just unset it to reduce the surprise when
the keyboard seems non-functional upon return.
If we hadn't been traced in the first place when syscallenter()
started executing, we can ignore TDB_USERWR. TDB_USERWR can get set,
sure, but if it does, it's because the debugger raced with the syscall,
and it cannot depend on winning that race.
Kyle Evans [Sat, 7 Nov 2020 05:10:46 +0000 (05:10 +0000)]
imgact_binmisc: abstract away the list lock (NFC)
This module handles relatively few execs (initial qemu-user-static, then
qemu-user-static handles exec'ing itself for binaries it's already running),
but all execs pay the price of at least taking the relatively expensive
sx/slock to check for a match when this module is loaded. Future work will
almost certainly swap this out for another lock, perhaps an rmslock.
The RLOCK/WLOCK phrasing was chosen based on what the callers are really
wanting, rather than using the verbiage typically appropriate for an sx.
Kyle Evans [Sat, 7 Nov 2020 04:10:23 +0000 (04:10 +0000)]
imgact_binmisc: validate flags coming from userland
We may want to reserve bits in the future for kernel-only use, so start
rejecting any that aren't the two that we're currently expecting from
userland.
Kyle Evans [Sat, 7 Nov 2020 03:43:45 +0000 (03:43 +0000)]
binmiscctl(8): miscellaneous cleanup
- Bad whitespace in Makefile.
- Reordered headers, sys/ first.
- Annotated fatal/usage __dead2 to help `make analyze` out a little bit.
- Spell a couple of sizeof constructs as "nitems" and "howmany" instead.
Kyle Evans [Sat, 7 Nov 2020 03:29:04 +0000 (03:29 +0000)]
epoch: support non-preemptible epochs checking in_epoch()
Previously, non-preemptible epochs could not check; in_epoch() would always
fail, usually because non-preemptible epochs don't imply THREAD_NO_SLEEPING.
For default epochs, it's easy enough to verify that we're in the given
epoch: if we're in a critical section and our record for the given epoch
is active, then we're in it.
This patch also adds some additional INVARIANTS bookkeeping. Notably, we set
and check the recorded thread in epoch_enter/epoch_exit to try and catch
some edge-cases for the caller. It also checks upon freeing that none of the
records had a thread in the epoch, which may make it a little easier to
diagnose some improper use if epoch_free() took place while some other
thread was inside.
This version differs slightly from what was just previously reviewed by the
below-listed, in that in_epoch() will assert that no CPU has this thread
recorded even if it *is* currently in a critical section. This is intended
to catch cases where the caller might have somehow messed up critical
section nesting, we can catch both if they exited the critical section or if
they exited, migrated, then re-entered (on the wrong CPU).
Kyle Evans [Sat, 7 Nov 2020 03:28:32 +0000 (03:28 +0000)]
imgact_binmisc: minor re-organization of imgact_binmisc_exec exits
Notably, streamline error paths through the existing 'done' label, making it
easier to quickly verify correct cleanup.
Future work might add a kernel-only flag to indicate that a interpreter uses
#a. Currently, all executions via imgact_binmisc pay the penalty of
constructing sname/fname, even if they will not use it. qemu-user-static
doesn't need it, the stock rc script for qemu-user-static certainly doesn't
use it, and I suspect these are the vast majority of (if not the only)
current users.
Conrad Meyer [Fri, 6 Nov 2020 22:04:57 +0000 (22:04 +0000)]
linux(4): Fix loadable modules after r367395
Move dtrace SDT definitions into linux_common module code. Also, build
linux_dummy.c into the linux_common kld -- we don't need separate
versions of these stubs for 32- and 64-bit emulation.
Mateusz Guzik [Fri, 6 Nov 2020 21:33:59 +0000 (21:33 +0000)]
malloc: move malloc_type_internal into malloc_type
According to code comments the original motivation was to allow for
malloc_type_internal changes without ABI breakage. This can be trivially
accomplished by providing spare fields and versioning the struct, as
implemented in the patch below.
The upshots are one less memory indirection on each alloc and disappearance
of mt_zone.
Toomas Soome [Fri, 6 Nov 2020 21:27:54 +0000 (21:27 +0000)]
efifb: vt_generate_cons_palette() takes max color, not mask
vt_generate_cons_palette() does take max values of RGB component colours, not
mask. Also we need to set info->fb_cmsize, or vt_fb_init() will re-initialize
the info->fb_cmap.
Remove 'struct trapframe' pointer from mips64's 'struct syscall_args'.
While here, use MAXARGS. This brings its 'struct syscall_args' in sync
with most other architectures.
Leandro Lupori [Fri, 6 Nov 2020 18:50:00 +0000 (18:50 +0000)]
Fix powerpc and LINT builds
Fix build errors introduced by r367417 and r367390:
- Guard label reached only by powerpc64
- Guard vm_reserv_level_iffullpop call, that is not defined on powerpc
variants that don't support superpages
- Add missing hwpmc file, for when hwpmc is built into kernel
Rick Macklem [Fri, 6 Nov 2020 16:33:42 +0000 (16:33 +0000)]
Add support for the new mountd -R option.
r376026 added a new "-R" option to mountd, which tells it to
not support the Mount protocol (not used by NFSv4) and not
register with rpcbind.
Rpcbind is considered a security issue by some sites now.
This patch adds a new yes/no variable called nfsv4_server_only.
When that is set, make vfs.nfsd.server_min_vers=4 and set "=R"
for mountd.
Setting vfs.nfsd.server_min_vers=4 tells nfsd to not register with rpcbind.
While here, add a check for "load_kld nfsd" failing to nfsd.
Leandro Lupori [Fri, 6 Nov 2020 14:12:45 +0000 (14:12 +0000)]
Implement superpages for PowerPC64 (HPT)
This change adds support for transparent superpages for PowerPC64
systems using Hashed Page Tables (HPT). All pmap operations are
supported.
The changes were inspired by RISC-V implementation of superpages,
by @markj (r344106), but heavily adapted to fit PPC64 HPT architecture
and existing MMU OEA64 code.
While these changes are not better tested, superpages support is disabled by
default. To enable it, use vm.pmap.superpages_enabled=1.
In this initial implementation, when superpages are disabled, system
performance stays at the same level as without these changes. When
superpages are enabled, buildworld time increases a bit (~2%). However,
for workloads that put a heavy pressure on the TLB the performance boost
is much bigger (see HPC Challenge and pgbench on D25237).
Reviewed by: jhibbits
Sponsored by: Eldorado Research Institute (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D25237