bdrewery [Sat, 7 Nov 2020 17:18:44 +0000 (17:18 +0000)]
syslogd: Stop trying to send remote messages through special sockets
Specifically this was causing the /dev/klog fd and the signal pipe
handling fd to get a sendmsg(2) called on them and always returned
[ENOTSOCK].
r310350 combined these sockets into the main socket list and properly
skipped AF_UNSPEC at the sendmsg(2) call but later in r344739 it was
broken such that these special sockets were no longer excluded since
the AF_UNSPEC check specifically excluded these special sockets. Only
these special sockets have sl_sa = NULL. The sl_family checks should
be redundant now but are left in case of future changes so the intent
is clearer.
mjg [Sat, 7 Nov 2020 16:57:53 +0000 (16:57 +0000)]
rms: several cleanups + debug read lockers handling
This adds a dedicated counter updated with atomics when INVARIANTS
is used. As a side effect one can reliably determine the lock is held
for reading by at least one thread, but it's still not possible to
find out whether curthread has the lock in said mode.
kevans [Sat, 7 Nov 2020 16:41:59 +0000 (16:41 +0000)]
imgact_binmisc: reorder members of struct imgact_binmisc_entry (NFC)
This doesn't change anything at the moment since the out-of-order elements
were a pair of uint32_t, but future additions may have caused unnecessary
padding by following the existing precedent.
kevans [Sat, 7 Nov 2020 15:38:01 +0000 (15:38 +0000)]
vt: resolve conflict between VT_ALT_TO_ESC_HACK and DBG
When using the ALT+CTRL+ESC sequence to break into kdb, the keyboard is
completely borked when you return. watch(8) shows that it's working, but
it's inserting escape sequences.
Further investigation revealed that VT_ALT_TO_ESC_HACK is the default and
directly conflicts with this sequence, so upon return from the debugger
ALKED is set.
If they triggered the break to debugger, it's safe to assume they didn't
mean to use VT_ALT_TO_ESC_HACK, so just unset it to reduce the surprise when
the keyboard seems non-functional upon return.
trasz [Sat, 7 Nov 2020 13:09:51 +0000 (13:09 +0000)]
Move TDB_USERWR check under 'if (traced)'.
If we hadn't been traced in the first place when syscallenter()
started executing, we can ignore TDB_USERWR. TDB_USERWR can get set,
sure, but if it does, it's because the debugger raced with the syscall,
and it cannot depend on winning that race.
kevans [Sat, 7 Nov 2020 05:10:46 +0000 (05:10 +0000)]
imgact_binmisc: abstract away the list lock (NFC)
This module handles relatively few execs (initial qemu-user-static, then
qemu-user-static handles exec'ing itself for binaries it's already running),
but all execs pay the price of at least taking the relatively expensive
sx/slock to check for a match when this module is loaded. Future work will
almost certainly swap this out for another lock, perhaps an rmslock.
The RLOCK/WLOCK phrasing was chosen based on what the callers are really
wanting, rather than using the verbiage typically appropriate for an sx.
kevans [Sat, 7 Nov 2020 04:10:23 +0000 (04:10 +0000)]
imgact_binmisc: validate flags coming from userland
We may want to reserve bits in the future for kernel-only use, so start
rejecting any that aren't the two that we're currently expecting from
userland.
kevans [Sat, 7 Nov 2020 03:43:45 +0000 (03:43 +0000)]
binmiscctl(8): miscellaneous cleanup
- Bad whitespace in Makefile.
- Reordered headers, sys/ first.
- Annotated fatal/usage __dead2 to help `make analyze` out a little bit.
- Spell a couple of sizeof constructs as "nitems" and "howmany" instead.
kevans [Sat, 7 Nov 2020 03:29:04 +0000 (03:29 +0000)]
epoch: support non-preemptible epochs checking in_epoch()
Previously, non-preemptible epochs could not check; in_epoch() would always
fail, usually because non-preemptible epochs don't imply THREAD_NO_SLEEPING.
For default epochs, it's easy enough to verify that we're in the given
epoch: if we're in a critical section and our record for the given epoch
is active, then we're in it.
This patch also adds some additional INVARIANTS bookkeeping. Notably, we set
and check the recorded thread in epoch_enter/epoch_exit to try and catch
some edge-cases for the caller. It also checks upon freeing that none of the
records had a thread in the epoch, which may make it a little easier to
diagnose some improper use if epoch_free() took place while some other
thread was inside.
This version differs slightly from what was just previously reviewed by the
below-listed, in that in_epoch() will assert that no CPU has this thread
recorded even if it *is* currently in a critical section. This is intended
to catch cases where the caller might have somehow messed up critical
section nesting, we can catch both if they exited the critical section or if
they exited, migrated, then re-entered (on the wrong CPU).
kevans [Sat, 7 Nov 2020 03:28:32 +0000 (03:28 +0000)]
imgact_binmisc: minor re-organization of imgact_binmisc_exec exits
Notably, streamline error paths through the existing 'done' label, making it
easier to quickly verify correct cleanup.
Future work might add a kernel-only flag to indicate that a interpreter uses
#a. Currently, all executions via imgact_binmisc pay the penalty of
constructing sname/fname, even if they will not use it. qemu-user-static
doesn't need it, the stock rc script for qemu-user-static certainly doesn't
use it, and I suspect these are the vast majority of (if not the only)
current users.
cem [Fri, 6 Nov 2020 22:04:57 +0000 (22:04 +0000)]
linux(4): Fix loadable modules after r367395
Move dtrace SDT definitions into linux_common module code. Also, build
linux_dummy.c into the linux_common kld -- we don't need separate
versions of these stubs for 32- and 64-bit emulation.
mjg [Fri, 6 Nov 2020 21:33:59 +0000 (21:33 +0000)]
malloc: move malloc_type_internal into malloc_type
According to code comments the original motivation was to allow for
malloc_type_internal changes without ABI breakage. This can be trivially
accomplished by providing spare fields and versioning the struct, as
implemented in the patch below.
The upshots are one less memory indirection on each alloc and disappearance
of mt_zone.
tsoome [Fri, 6 Nov 2020 21:27:54 +0000 (21:27 +0000)]
efifb: vt_generate_cons_palette() takes max color, not mask
vt_generate_cons_palette() does take max values of RGB component colours, not
mask. Also we need to set info->fb_cmsize, or vt_fb_init() will re-initialize
the info->fb_cmap.
trasz [Fri, 6 Nov 2020 19:19:51 +0000 (19:19 +0000)]
Remove 'struct trapframe' pointer from mips64's 'struct syscall_args'.
While here, use MAXARGS. This brings its 'struct syscall_args' in sync
with most other architectures.
luporl [Fri, 6 Nov 2020 18:50:00 +0000 (18:50 +0000)]
Fix powerpc and LINT builds
Fix build errors introduced by r367417 and r367390:
- Guard label reached only by powerpc64
- Guard vm_reserv_level_iffullpop call, that is not defined on powerpc
variants that don't support superpages
- Add missing hwpmc file, for when hwpmc is built into kernel
rmacklem [Fri, 6 Nov 2020 16:33:42 +0000 (16:33 +0000)]
Add support for the new mountd -R option.
r376026 added a new "-R" option to mountd, which tells it to
not support the Mount protocol (not used by NFSv4) and not
register with rpcbind.
Rpcbind is considered a security issue by some sites now.
This patch adds a new yes/no variable called nfsv4_server_only.
When that is set, make vfs.nfsd.server_min_vers=4 and set "=R"
for mountd.
Setting vfs.nfsd.server_min_vers=4 tells nfsd to not register with rpcbind.
While here, add a check for "load_kld nfsd" failing to nfsd.
luporl [Fri, 6 Nov 2020 14:12:45 +0000 (14:12 +0000)]
Implement superpages for PowerPC64 (HPT)
This change adds support for transparent superpages for PowerPC64
systems using Hashed Page Tables (HPT). All pmap operations are
supported.
The changes were inspired by RISC-V implementation of superpages,
by @markj (r344106), but heavily adapted to fit PPC64 HPT architecture
and existing MMU OEA64 code.
While these changes are not better tested, superpages support is disabled by
default. To enable it, use vm.pmap.superpages_enabled=1.
In this initial implementation, when superpages are disabled, system
performance stays at the same level as without these changes. When
superpages are enabled, buildworld time increases a bit (~2%). However,
for workloads that put a heavy pressure on the TLB the performance boost
is much bigger (see HPC Challenge and pgbench on D25237).
Reviewed by: jhibbits
Sponsored by: Eldorado Research Institute (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D25237
jhb [Thu, 5 Nov 2020 23:31:58 +0000 (23:31 +0000)]
Check cipher key lengths during probesession.
OCF drivers in general should perform as many session parameter checks
as possible during probesession rather than when creating a new
session. I got this wrong for aesni(4) in r359374. In addition,
aesni(4) was performing the check for digest-only requests and failing
to create digest-only sessions as a result.
kib [Thu, 5 Nov 2020 20:52:49 +0000 (20:52 +0000)]
Suspend all writeable local filesystems on power suspend.
This ensures that no writes are pending in memory, either metadata or
user data, but not including dirty pages not yet converted to fs writes.
Only filesystems declared local are suspended.
Note that this does not guarantee absence of the metadata errors or
leaks if resume is not done: for instance, on UFS unlinked but opened
inodes are leaked and require fsck to gc.
allanjude [Thu, 5 Nov 2020 17:10:14 +0000 (17:10 +0000)]
VirtIO: Make sure the guest knows the TRIM alignment requirements
If bhyve is used to emulate 512e access in guest OS, then discard addresses should be properly aligned.
Otherwise ioctl DIOCGDELETE fails for 512b requires on devices with 4K sector size.
see g_dev_ioctl() in sys/geom/geom_dev.c
luporl [Thu, 5 Nov 2020 16:47:23 +0000 (16:47 +0000)]
pmcstat: fix PPC kernel symbol resolution
PowerPC kernel is of DYN type and it has a base address where it is
initially loaded, before being relocated. As the start address passed to
pmcstat_image_link() is where the kernel was relocated to, but the symbols
always use the original base address, we need to subtract it to get the
correct offset.
Reviewed by: jhibbits
Sponsored by: Eldorado Research Institute (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D26114
markj [Thu, 5 Nov 2020 15:55:23 +0000 (15:55 +0000)]
Add qat(4)
This provides an OpenCrypto driver for Intel QuickAssist devices. The
driver was initially ported from NetBSD and comes with a few
improvements:
- support for GMAC/AES-GCM, AES-CTR and AES-XTS, and support for
SHA/HMAC-authenticated encryption
- support for detaching the driver
- various bug fixes
- DH895X support
andrew [Thu, 5 Nov 2020 09:55:55 +0000 (09:55 +0000)]
Stop trying to bounce in memory allocated by bus dma
Memory allocated by bus_dmamem_alloc will take into account any alignment
requirements of the CPU it's running on. Stop trying to bounce in this case
as there is no bounce zone allocated.
Reported by: manu, tuexen
Tested by: manu
Sponsored by: Innovate UK
cem [Thu, 5 Nov 2020 06:48:51 +0000 (06:48 +0000)]
Add sbuf streaming mode to pseudofs(9), use in linprocfs(5)
Add a pseudofs node flag 'PFS_AUTODRAIN', which automatically emits sbuf
contents to the caller when the sbuf buffer fills. This is only
permissible if the corresponding PFS node fill function can sleep
whenever it appends to the sbuf.
linprocfs' /proc/self/maps node happens to meet this requirement.
Streaming out the file as it is composed avoids truncating the output
and also avoids preallocating a very large buffer.
kevans [Thu, 5 Nov 2020 04:19:48 +0000 (04:19 +0000)]
imgact_binmisc: fix up some minor nits
- Removed a bunch of redundant headers
- Don't explicitly initialize to 0
- The !error check prior to setting imgp->interpreter_name is redundant, all
error paths should and do return or go to 'done'. We have larger problems
otherwise.
mjg [Thu, 5 Nov 2020 03:25:23 +0000 (03:25 +0000)]
zfs: lz4: add optional kmem_alloc support
lz4 port from illumos to Linux added a 16KB per-CPU cache to accommodate for
the missing 16KB malloc. FreeBSD supports this size, making the extra cache
harmful as it can't share buckets.
mhorne [Thu, 5 Nov 2020 00:52:52 +0000 (00:52 +0000)]
riscv: set kernel_pmap hart mask more precisely
In pmap_bootstrap(), we fill kernel_pmap->pm_active since it is
invariably active on all harts. However, this marks it as active even
for harts that don't exist in the system, which can cause issue when the
mask is passed to the SBI firmware via sbi_remote_sfence_vma().
Specifically, the SBI spec allows SBI_ERR_INVALID_PARAM to be returned
when an invalid hart is set in the mask.
The latest version of OpenSBI does not have this issue, but v0.6 does,
and this is triggering a recently added KASSERT in CI. Switch to only
setting bits in pm_active for harts that enter the system.
jmg [Wed, 4 Nov 2020 23:26:15 +0000 (23:26 +0000)]
fix the docs, this was always wrong... In some cases, DISTDIR is set
automatically by tools via /etc/make.conf, so remind people (me) where
to find where it's set..
It would be nice for someone to document what DISTDIR is better than:
where the file for a distribution gets installed
mjg [Wed, 4 Nov 2020 23:11:54 +0000 (23:11 +0000)]
pipe: fix POLLHUP handling if no events were specified
Linux allows polling without any events specified and it happens to be the case
in FreeBSD as well. POLLHUP has to be delivered regardless of the event mask
and this works fine if the condition is already present. However, if it is
missing, selrecord is only called if the eventmask has relevant bits set. This
in particular leads to a conditon where pipe_poll can return 0 events and
neglect to selrecord, while kern_poll takes it as an indication it has to go to
sleep, but then there is nobody to wake it up.
While the problem seems systemic to *_poll handlers the least we can do is fix
it up for pipes.
jkim [Wed, 4 Nov 2020 22:41:54 +0000 (22:41 +0000)]
Make the tests work without COMPAT_FREEBSD12 in kernel.
sysctl 'kern.cryptodevallowsoft' was renamed to 'kern.crypto.allow_soft' in
r359374 and the prevous one is only available in kernel built with
"options COMPAT_FREEBSD12".
Fix one case where #else was not corerctly processed and simplify the
conditions logic.
Fix parsing of day and month names in the locale specified in the calendar
file. The previous version would expect those names to match the locale of
the user.
Mention that comments are now correctly processed and that // is supported
in addition to /* ... */.
wulf [Wed, 4 Nov 2020 21:52:10 +0000 (21:52 +0000)]
atkbdc(4): Add quirk for "System76 lemur Pro" laptops.
Currently atkbdc(4) assumes all coreboot BIOSes belonging to Chromebooks
and unconditionally sets a number of quirks to workaround known issues.
Exclude "System76" laptops from this set as they appeared to be a
traditional hardware ("lemur Pro" is a rebranded Clevo chassis) with
coreboot firmware on board. KBDC_QUIRK_KEEP_ACTIVATED quirk activated for
Chromebook platform makes keyboard on this devices inoperable.
"Purism Librem" laptops may require the same exclusion too.
mjg [Wed, 4 Nov 2020 21:18:08 +0000 (21:18 +0000)]
rms: fixup concurrent writer handling and add more features
Previously the code had one wait channel for all pending writers.
This could result in a buggy scenario where after a writer switches
the lock mode form readers to writers goes off CPU, another writer
queues itself and then the last reader wakes up the latter instead
of the former.
Use a separate channel.
While here add features to reliably detect whether curthread has
the lock write-owned. This will be used by ZFS.
markj [Wed, 4 Nov 2020 16:42:20 +0000 (16:42 +0000)]
amd64: Make it easier to configure exception stack sizes
The amd64 kernel handles certain types of exceptions on a dedicated
stack. Currently the sizes of these stacks are all hard-coded to
PAGE_SIZE, but for at least NMI handling it can be useful to use larger
stacks. Add constants to intr_machdep.h to make this easier to tweak.
No functional change intended.
Reviewed by: kib
MFC after: 1 week
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D27076
markj [Wed, 4 Nov 2020 16:30:56 +0000 (16:30 +0000)]
vmspace: Convert to refcount(9)
This is mostly mechanical except for vmspace_exit(). There, use the new
refcount_release_if_last() to avoid switching to vmspace0 unless other
processes are sharing the vmspace. In that case, upon switching to
vmspace0 we can unconditionally release the reference.
Remove the volatile qualifier from vm_refcnt now that accesses are
protected using refcount(9) KPIs.
Reviewed by: alc, kib, mmel
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27057
markj [Wed, 4 Nov 2020 16:30:30 +0000 (16:30 +0000)]
refcount(9): Add refcount_release_if_last() and refcount_load()
The former is intended for use in vmspace_exit(). The latter is to
encourage use of explicit loads rather than relying on the volatile
qualifier. This works better with kernel sanitizers, which can
intercept atomic(9) calls, and makes tricky lockless code easier to read
by not forcing the reader to remember which variables are declared
volatile.
Reviewed by: kib, mjg, mmel
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27056
arichardson [Wed, 4 Nov 2020 14:31:52 +0000 (14:31 +0000)]
Fix bad libbxo format strings in jls
The existing format string for the empty case was trying to read varargs
values that weren't passed to xo_emit. This appears to work on x86 (since
the next argument is probably a pointer an empty string), but for CHERI
we can bound variadic arguments and detect a read past the end.
While touching these lines also use the libxo 'a' modifier to avoid having to
construct the libxo format string using asprintf.
Found by: CHERI
Reviewed By: allanjude
Differential Revision: https://reviews.freebsd.org/D26885
Implement the bs_sr_<N> generic functions based on the generic
mips implementation calling the generic bs_w_<N> functions in a loop.
ral(4) (rt2860.c) panics in RAL_SET_REGION_4() because bs_sr_4()
is NULL. It seems ral(4) and ti(4) might be the only consumers of
these functions I could find quickly so keeping them in C rather than asm.
Reported by: Steve Wheeler (https://redmine.pfsense.org/issues/11021)
Reviewed by: mmel
MFC after: 3 days
dim [Wed, 4 Nov 2020 11:23:19 +0000 (11:23 +0000)]
Turn on WITH_LLVM_CXXFILT by default
LLVM's demangler supports more modern C++ constructs such as lambdas and
unnamed types, and is actively maintained. The command line tool is
usable as a drop-in replacement for GNU c++filt, or elftoolchain's
cxxfilt. The latter is still available by using WITHOUT_LLVM_CXXFILT, if
needed.
dim [Wed, 4 Nov 2020 11:13:36 +0000 (11:13 +0000)]
Update libcxxrt's private copy of elftoolchain demangler
This updates the private copy of libelftc_dem_gnu3.c in libcxxrt with
the most recent version from upstream r3877. Similar to r367322, this
fixes a number of possible assertions, and allows it to correctly
demangle several names that it could not handle before.
dim [Wed, 4 Nov 2020 11:02:05 +0000 (11:02 +0000)]
Merge elftoolchain r3877 (by jkoshy):
Incorporate fixes from Dimitry Andric:
- Use a BUFFER_GROW() macro to avoid rounding errors in capacity
calculations.
- Fix a bug introduced in [r3531].
- Fix handling of nested template parameters.
Ticket: #581
This should fix a number of assertions on elftoolchain's cxxfilt, and
allow it to correctly demangle several names that it could not handle
before.
Obtained from: https://sourceforge.net/p/elftoolchain/code/3877/
PR: 250702
MFC after: 3 days
andrew [Wed, 4 Nov 2020 10:21:30 +0000 (10:21 +0000)]
Allow the creation of 3 level page tables on arm64
The stage 2 arm64 page tables may need to start at a lower level. This
is because we may only be able to map a limited IPA range and trying
to use a full 4 levels will cause the CPU to fault in an unrecoverable
way.
To simplify the code we still allocate the full 4 levels, however level 0
will only ever be used to find the level 1 table used as the base. Handle
this by creating a dummy entry in the level 0 table to point to the level 1
table.
Sponsored by: Innovate UK
Differential Revision: https://reviews.freebsd.org/D26066