This change allow to open Netlink sockets in the non-vnet jails, even for
unpriviledged processes.
The security model largely follows the existing one. To be more specific:
* by default, every `NETLINK_ROUTE` command is **NOT** allowed in non-VNET
jail UNLESS `RTNL_F_ALLOW_NONVNET_JAIL` flag is specified in the command
handler.
* All notifications are **disabled** for non-vnet jails (requests to
subscribe for the notifications are ignored). This will change to be more
fine-grained model once the first netlink provider requiring this gets
committed.
* Listing interfaces (RTM_GETLINK) is **allowed** w/o limits (**including**
interfaces w/o any addresses attached to the jail). The value of this is
questionable, but it follows the existing approach.
* Listing ARP/NDP neighbours is **forbidden**. This is a **change** from the
current approach - currently we list static ARP/ND entries belonging to the
addresses attached to the jail.
* Listing interface addresses is **allowed**, but the addresses are filtered
to match only ones attached to the jail.
* Listing routes is **allowed**, but the routes are filtered to provide only
host routes matching the addresses attached to the jail.
* By default, every `NETLINK_GENERIC` command is **allowed** in non-VNET jail
(as sub-families may be unrelated to network at all).
It is the goal of the family author to implement the restriction if
necessary.
Both vnode_pager_input_smlfs() and vnode_pager_generic_getpages()
increment runningbufspace, but also both delegate io completion handling
on the pbuf to either plain bdone() or filesystem-specific strategy
routine. Accidentally, for e.g. UFS it is g_vfs_strategy()/g_vfs_done().
The later calls bufdone() which handles runningbufspace reclamation.
For plain bdone() io done handler, nothing would return
accounted b_runningbufspace back. Do it in the new
helper vnode_pager_input_bdone(), as well as in
vnode_pager_generic_getpages_done() explicitly.
Note that potential multiple calls to runningbufwakeup() for the same
pbuf or buf completion are safe. runningbufwakeup() clears accounting
for the buffer, so second and later calls are nop.
The problem was found due to tarfs using small vnode pager input but not
g_vfs_strategy().
Reported by: des
Reviewed by: markj, sjg
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39263
Mateusz Guzik [Sat, 25 Mar 2023 14:40:55 +0000 (14:40 +0000)]
vm: add unlocked page lookup before trying vm_fault_soft_fast
Shaves a read lock + tryupgrade trip most of the time.
Stats from doing a kernel build (counters not present in the tree):
vm.fault_soft_fast_ok: 262653
vm.fault_soft_fast_failed_other: 41
vm.fault_soft_fast_failed_no_page: 39595772
vm.fault_soft_fast_failed_page_busy: 1929
vm.fault_soft_fast_failed_page_invalid: 22183
Warner Losh [Tue, 14 Mar 2023 21:28:05 +0000 (15:28 -0600)]
checkstyle9.pl: Perl script to check if a change is approximately style(9)
This code is adapted from the QEMU checkpatch.pl script. It can check
either a patch, a file or a git branch. It tries to warn about things
that I believe might be style(9) violations. It's experimental, since I
heavily hacked on the qemu version to get it to not complain (much)
about iconic code in the tree. At the moment, it's use should be
considered expermental. It will likely miss violations, and complain
about code that's perfectly fine. It's offered as an experiment
and to make it easier for contributors to submit patches.
Andrew Gallatin [Sat, 25 Mar 2023 15:51:51 +0000 (11:51 -0400)]
LRO: Add missing checks for invalid IP addresses
LRO bypasses normal ip_input()/tcp_input() and lacks several checks
that are present in the normal path. Without these checks, it
is possible to trigger assertions added in b0ccf53f2455
Mateusz Guzik [Tue, 21 Mar 2023 04:23:15 +0000 (04:23 +0000)]
vfs: trylock vnode requeue
The quasi-LRU still gets in the way for example when doing an
incremental bzImage build, with vnode_list lock being at the
top of the profile. Further damage control the problem by trylocking.
Note the entire mechanism desperately wants to be reaped out in favor
of something(tm) which both scales in a multicore setting and provides
sensible replacement policy.
With this change everything vfs almost disappears from the on CPU
flamegraph, what is left is tons of contention in the VM.
Ed Maste [Fri, 24 Mar 2023 17:53:59 +0000 (13:53 -0400)]
makefs: emit NM records for all directory entries
We previously attempted to emit Rock Ridge NM records only when the name
represented by the Rock Ridge extensions would actually differ. We would
omit the record for an all-upper-case directory name, however Linux (and
perhaps other operating systems) map names with no NM record to
lowercase.
This affected only directories, as file names have an implicit ";1"
version number appended and thus always differ. To solve, just emit NM
records for all entries other than DOT and DOTDOT .
We could continue to omit the NM record for directories that would avoid
mapping (for example, one named 1234.567) but this does not seem worth
the complexity.
PR: 203531
Reported by: Thomas Schmitt <scdbackup@gmx.net
Reviewed by: kevans
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39258
Vico Chen [Sat, 25 Mar 2023 10:41:04 +0000 (13:41 +0300)]
linsysfs(4): Keep Linux compatible sysfs the same as Ubuntu
By checking Ubuntu, there is no `/sys/subsystem' in sysfs. To compatible
with Ubuntu, delete the 'subsystem' creation in Linux compatible module.
On the other hand, the sysfs `/sys/subsystem' cause failure for some
Linux udev cases. In Linux udev source code, there is a function named
`scan_devices_all', and it will scan `/sys/subsystem' if it is existed,
but now there are nothing in /sys/subsystem `, and it returns empty
to cause some use cases failed.
John Baldwin [Fri, 24 Mar 2023 18:49:06 +0000 (11:49 -0700)]
bhyve: Remove vmctx member from struct vm_snapshot_meta.
This is a userland-only pointer that isn't relevant to the kernel and
doesn't belong in the ioctl structure shared between userland and the
kernel. For the kernel, the old structure for the ioctl is still
supported under COMPAT_FREEBSD13.
This changes vm_snapshot_req() in libvmmapi to accept an explicit
vmctx argument.
It also changes vm_snapshot_guest2host_addr to take an explicit vmctx
argument. As part of this change, move the declaration for this
function and its wrapper macro from vmm_snapshot.h to snapshot.h as it
is a userland-only API.
John Baldwin [Fri, 24 Mar 2023 18:49:06 +0000 (11:49 -0700)]
libvmmapi: Add a struct vcpu and use it in most APIs.
This replaces the 'struct vm, int vcpuid' tuple passed to most API
calls and is similar to the changes recently made in vmm(4) in the
kernel.
struct vcpu is an opaque type managed by libvmmapi. For now it stores
a pointer to the VM context and an integer id.
As an immediate effect this removes the divergence between the kernel
and userland for the instruction emulation code introduced by the
recent vmm(4) changes.
Since this is a major change to the vmmapi API, bump VMMAPI_VERSION to
0x200 (2.0) and the shared library major version.
While here (and since the major version is bumped), remove unused
vcpu argument from vm_setup_pptdev_msi*().
Add new functions vm_suspend_all_cpus() and vm_resume_all_cpus() for
use by the debug server. The underyling ioctl (which uses a vcpuid of
-1) remains unchanged, but the userlevel API now uses separate
functions for global CPU suspend/resume.
Joseph Koshy [Fri, 24 Mar 2023 09:39:08 +0000 (09:39 +0000)]
pmcstat: Warn about text output format stability.
The formats for pmcstat(8)'s human-readable output are not part of its
user interface definition, and may change in the future. Highlight
this in its manual page.
Just owning the interlock is not enough for vget() to operate on the
vnode race-free with vgone(), the vnode should be held. Use
vget_prep()/vget_finish() to avoid vholding the vnode explicitly, and
drop LK_INTERLOCK.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39207
Cy Schubert [Fri, 24 Mar 2023 14:32:41 +0000 (07:32 -0700)]
rc: Chase bfb202c4554a and remove ifconfig down/up for wpa_supplicant
bfb202c4554a addresses the CTRL-EVENT-SCAN-FAILED. Upstream d807e289d
caused FreeBSD regression in driver_bsd.c, which this rc.d patch
worked around. As of bfb202c4554a this workaround is no longer needed.
Eric Joyner [Fri, 24 Mar 2023 07:01:01 +0000 (00:01 -0700)]
ice(4): Restore old conditional overwritten by last update
Commit 8923de590543 ("ice(4): Update to 1.37.7-k", 2023-02-13)
unintentionally overwrote the change made in commit 52f45d8acee9 ("net:
iflib: let the drivers use isc_capenable", 2021-12-28).
Signed-off-by: Eric Joyner <erj@FreeBSD.org>
Reported by: jhibbits@
MFC after: 3 days
Sponsored by: Intel Corporation
Jose Luis Duran [Fri, 24 Mar 2023 04:53:54 +0000 (21:53 -0700)]
ping: Fix an uninitialized variable
The variable oicmp, which holds the original ("quoted packet") ICMP
packet in a structured way, did not have a copy of the original ICMP
packet obtained from the raw data.
The code was accidentally removed in 20b41303140e. Bring it back.
Bjoern A. Zeeb [Thu, 23 Mar 2023 22:37:12 +0000 (22:37 +0000)]
WPA: driver_bsd.c: backout upstream IFF_ change and add logging
This reverts the state to our old supplicant logic setting or clearing
IFF_UP if needed. In addition this adds logging for the cases in which
we do (not) change the interface state.
Depending on testing this seems to help bringing WiFi up or not log
any needed changes (which would be the expected wpa_supplicant logic
now). People should look out for ``(changed)`` log entries (at least
if debugging the issue; this way we will at least have data points).
There is a hypothesis still pondered that the entire IFF_UP toggling
only exploits a race in net80211 (see further discssussions for more
debugging and alternative solutions see D38508 and D38753).
That may also explain why the changes to the rc startup script [1]
only helped partially for some people to no longer see the
continuous CTRL-EVENT-SCAN-FAILED.
It is highly likely that we will want further changes and until
we know for sure that people are seeing ''(changed)'' events
this should stay local. Should we need to upstream this we'll
likely need #ifdef __FreeBSD__ around this code.
Kyle Evans [Thu, 23 Mar 2023 21:26:06 +0000 (16:26 -0500)]
arm64: add KASAN support
This entails:
- Marking some obvious candidates for __nosanitizeaddress
- Similar trap frame markings as amd64, for similar reasons
- Shadow map implementation
The shadow map implementation is roughly similar to what was done on
amd64, with some exceptions. Attempting to use available space at
preinit_map_va + PMAP_PREINIT_MAPPING_SIZE (up to the end of that range,
as depicted in the physmap) results in odd failures, so we instead
search the physmap for free regions that we can carve out, fragmenting
the shadow map as necessary to try and fit as much as we need for the
initial kernel map. pmap_bootstrap_san() is thus after
pmap_bootstrap(), which still included some technically reserved areas
of the memory map that needed to be included in the DMAP.
The odd failure noted above may be a bug, but I haven't investigated it
all that much.
Initial work by mhorne with additional fixes from kevans and markj.
Reviewed by: andrew, markj
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D36701
Ed Maste [Thu, 23 Mar 2023 17:02:44 +0000 (13:02 -0400)]
makefs: correct El Torito bood record
The boot catalog pointer is a DWord, but we previously populated it via
cd9660_bothendian_dword which overwrote four unused bytes following it.
See El Torito 1.0 (1995) Figure 7 for details.
PR: 203531
Reported by: Coverity Scan
Reported by: Thomas Schmitt <scdbackup@gmx.net>
Reviewed by: kevans
CID: 977470
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39231
Kirk McKusick [Thu, 23 Mar 2023 20:03:20 +0000 (13:03 -0700)]
Improve chance of finding an alternate superblock in sbsearch(3).
When requesting a superblock read for the sole purpose of getting
the parameters needed to find if backup parameters have been stored,
specify UFS_NOCSUM as only the base superblock is needed. This
change reduces the number of checks that the superblock must pass.
Zachary Leaf [Thu, 2 Mar 2023 14:15:54 +0000 (14:15 +0000)]
arm64: add fault address to trapframe
It was previously possible for the fault address register to get
clobbered before it was saved. This small window occurred when an
additional exception was encountered inside the exception handler,
overwriting the previous value.
Commit f29942229d24 ("Read the arm64 far early in el0 exceptions")
patched this issue, but avoided changing the trapframe since this could
be considered a KBI change in FreeBSD 13.
Revert the above fix and save the fault address in the trapframe
instead. This saves the fault address even earlier in the exception
handling process, and is a more robust and simple fix.
Zachary Leaf [Fri, 24 Feb 2023 08:35:08 +0000 (08:35 +0000)]
arm64: extend ESR/SPSR registers to 64b
For the Exception Syndrome Register, ESR_ELx, the upper 32b were
previously unused, but now may contain additional exception info as of
Armv8.7 (FEAT_LS64).
Extend ESR from u32->u64 in exception handling code to support this. In
addition, also extend Saved Program Status Register SPSR_ELx in the same
way to allow for future extensions.
Reviewed by: andrew
Sponsored by: Arm Ltd
Differential Revision: https://reviews.freebsd.org/D38983
John Baldwin [Thu, 23 Mar 2023 16:31:29 +0000 (09:31 -0700)]
libpmc: Use LIB_CXX instead of explicit LDADD to link a C++ library.
This uses the C++ compiler as the linker instead of the C compiler
letting the compiler driver pick the right libraries. This is a no-op
on main and stable/13 but matters for stable/12 where the current
logic breaks for external GCC since it tries to use a non-existent
libstdc++.
Justin Hibbits [Thu, 16 Mar 2023 20:24:56 +0000 (16:24 -0400)]
IfAPI: Add iterator to complement if_foreach()
Summary:
Sometimes an if_foreach() callback can be trivial, or need a lot of
outer context. In this case a regular `for` loop makes more sense. To
keep things hidden in the new API, use an opaque `if_iter` structure
that can still be instantiated on the stack. The current implementation
uses just a single pointer out of the 4 alotted to the opaque context,
and the cleanup does nothing, but may be used in the future.
Ihor Antonov [Thu, 23 Mar 2023 02:37:12 +0000 (21:37 -0500)]
daemon: decouple init logic from main loop
main() func contained both initialization and main loop logic.
This made certain operations like restarting problematic and
required dirty hacks in form of goto jumps.
This commit moves the main loop logic into daemon_eventloop(),
cleans up main, and makes restart logic clear: daemon_mainloop()
is run in a loop with a restart condition checked at the end.
Bjoern A. Zeeb [Tue, 21 Mar 2023 21:25:28 +0000 (21:25 +0000)]
ifconfig: ifieee80211: print bssid name
In certain setups (e.g., autonomous APs) it is extremly helpful to have
a way to map the BSSIDs to names for both normal status output as well
as the scan list. This often allows a quicker overview than remembering
(or manually looking up) BSSIDs.
Call ether_ntohost() on the bssid and consult /etc/ethers
and print "(name)" after the bssid for the status output and "(name)"
at the end of the line after the IE list.
Mateusz Guzik [Tue, 21 Mar 2023 07:27:25 +0000 (07:27 +0000)]
vfs: decouple freevnodes from vnode batching
In principle one cpu can keep vholding vnodes, while another vdrops
them. In this case it may be the local count will keep growing in an
unbounded manner. Roll it up after a threshold instead.
John Baldwin [Wed, 22 Mar 2023 19:35:09 +0000 (12:35 -0700)]
sys: Stop enabling -Wnested-externs.
clang doesn't implement this warning, so violations are only caught by
GCC. It is also no longer a common practice to use this as it was in
the original BSD code, so the need for the warning is not as important
as when it was used to do cleanups 20 years ago. A recent commit
(c3179891f897d840f578a5139839fcacb587c96d) triggers this warning on
GCC, but that commit uses nested externs purposefully.
John Baldwin [Wed, 22 Mar 2023 19:34:34 +0000 (12:34 -0700)]
bhyve: Accept a variable-length string name for qemu_fwcfg_add_file.
It is illegal (UB?) to pass a shorter array to a function argument
that takes a fixed-length array. Do a runtime check for names that
are too long via strlen() instead.
Val Packett [Mon, 6 Feb 2023 19:03:58 +0000 (16:03 -0300)]
arpa: garbage collect ns_newmsg/ns_rdata decls
These were brought in by the libbind import, but these functions were
never actually implemented anywhere, only header declarations and symbol
map entries were imported.
Fixes: 046c3635cdb2 ("Bring final version of libbind:") Fixes: e45764721aed ("Update our stub resolver to final version of ...")
Reported by: ld.lld 16 being --no-undefined-version by default
Sponsored by: https://www.patreon.com/valpackett
Reviewed by: emaste
Pull request: https://github.com/freebsd/freebsd-src/pull/700
Differential Revision: https://reviews.freebsd.org/D38407
Brooks Davis [Wed, 22 Mar 2023 16:23:57 +0000 (16:23 +0000)]
amd64: reduce header pollution in _stdint.h
In 38d1ac34ff82bd2aeb308b52a65b686060e52873 SIGATOMIC_{MIN,MAX} were
defined in terms of LONG_{MIN,MAX}. Later, they were switched to
__LONG_{MIN,MAX} in 78fe75bc280264e7471b3069e148cae32e8ae211 where an
include of machine/_limits.h was added. Switch to using fixed width
INT64_{MIN,MAX} and remove the header pollution.
Brooks Davis [Wed, 22 Mar 2023 16:23:22 +0000 (16:23 +0000)]
riscv: Fix sig_atomic_t limit definitions
sig_atomic_t is defined as a long and thus is 64-bit on arm64. For some
reason its limit was incorrectly specified as a 32-bit number. This had
the unfortunate side effect of causing gnulib to override most of the
definitions in stdint.h. On CheriBSD this breaks all software that uses
gnulib in annoying and hard to debug ways.
Technically updating the limits might be an ABI change, but these
defines are largely unused (the only use in tree is in the libc++ test
suite where it's use an assertion that will fail due to this bug).
Further, since the underlying type remains the same, we're just
increasing the range of values a paranoid program might use.
Brooks Davis [Wed, 22 Mar 2023 16:22:21 +0000 (16:22 +0000)]
arm64: Fix sig_atomic_t limit definitions
sig_atomic_t is defined as a long and thus is 64-bit on arm64. For some
reason its limit was incorrectly specified as a 32-bit number. This had
the unfortunate side effect of causing gnulib to override most of the
definitions in stdint.h. On CheriBSD this breaks all software that uses
gnulib in annoying and hard to debug ways.
Technically updating the limits might be an ABI change, but these
defines are largely unused (the only use in tree is in the libc++ test
suite where it's use an assertion that will fail due to this bug).
Further, since the underlying type remains the same, we're just
increasing the range of values a paranoid program might use.
Ed Maste [Thu, 3 Nov 2022 17:17:40 +0000 (13:17 -0400)]
sftp: avoid leaking path arg in calls to make_absolute_pwd_glob
As Coverity reports:
Overwriting tmp in tmp = make_absolute_pwd_glob(tmp, remote_path)
leaks the storage that tmp points to.
Consume the first arg in make_absolute_pwd_glob, and add xstrdup() to
the one case which did not assign to the same variable that was passed
in. With this change make_absolute() and make_absolute_pwd_glob() have
the same semantics with respect to freeing the input string.
This change was reported to OpenSSH in
https://lists.mindrot.org/pipermail/openssh-unix-dev/2022-November/040497.html
but was not acted on. It appears that OpenBSD subsequently received a
Coverity report for the same issue (their Coverity ID 405196) but fixed
only the specific instance reported by Coverity.
This change reverts OpenBSD's sftp.c 1.228 / OpenSSH-portable
commit 36c6c3eff5e4.
Mark Johnston [Wed, 22 Mar 2023 13:02:54 +0000 (09:02 -0400)]
bhyve: Sleep briefly in the VMEXIT_DEBUG handler
As of commit 0bda8d3e9f7a ("vmm: permit some IPIs to be handled by
userspace") and commit 9cc9abf409cc ("bhyve: create all vcpus on
startup"), we have a misbehaviour where AP vCPU threads spin until they
receive a SIPI. In particular, since they are "suspended", they simply
call the VMEXIT_DEBUG handler in a loop, but the handler is a no-op by
default.
This is tricky to fix since the gdb stub isn't aware of whether a given
vCPU is supposed to be running. For 13.2's sake, introduce a simple
workaround wherein the VMEXIT_DEBUG handler sleeps for a short period.
This ensures that host CPU usage remains sane when VMs are starting
without penalizing users of VMEXIT_DEBUG too much.
Reviewed by: corvink, jhb
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39174
Mark Johnston [Wed, 22 Mar 2023 12:52:57 +0000 (08:52 -0400)]
fdescfs: Fix a file ref leak
In fdesc_lookup(), vn_vget_ino_gen() may fail without invoking the
callback, in which case the ref on fp is leaked. This happens if the
fdescfs mount is being concurrently unmounted. Moreover, we cannot
safely drop the ref while the dvp is locked.
So:
- Use a flag variable to indicate whether the ref is dropped.
- Reorganize things to handle the leak.
Reported by: C Turt <ecturt@gmail.com>
Reviewed by: mjg, kib
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39189
Warner Losh [Wed, 22 Mar 2023 02:25:58 +0000 (20:25 -0600)]
_endian.h: Include sys/ctypes.h for visibility macros
BYTE_ORDER, LITTLE_ENDIAN and BIG_ENDIAN will be required by the
forthcoming POSIX Issue 8. In addition, they are provided in the BSD
compilation environments. However, depending on the order includes
happend, sys/cdefs.h may or may not be included when endian.h is
included. Include it here so we can safely test __BSD_VISIBLE. Add
visibility when we're compiling in the future for issue 8, but since the
date number for issue 8 hasn't been fixed, use strictly greater than the
issue 7 date.of 200809.
This had the side effect of sometimes (in the traditional BSD
compliation environment)
#if BYTE_ORDER == LITTLE_ENDIAN
and
#if BYTE_ORDER == BIG_ENDIAN
both being true because none of these were defined. This fixes
that. It also fixes including it after <stdio.h> but not before.
The previous code unsuccesfully attempted to report a precise error for
each option in the user list. Moreover, commit 253b2ec199b broke some
ctrl-api-test (see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=260547).
With this patch we bail out as soon as an unrecoverable error is detected and
we properly check for copy boundaries. EOPNOTSUPP no longer immediately
returns an error, so that any other option in the list may be examined
by the caller code and a precise report of the (un)supported options can
be returned to the user.
With this patch, all ctrl-api-test unit tests pass again.
Mark Johnston [Tue, 21 Mar 2023 19:51:24 +0000 (15:51 -0400)]
ktls: Fix interlocking between ktls_enable_rx() and listen(2)
The TCP_TXTLS_ENABLE and TCP_RXTLS_ENABLE socket option handlers check
whether the socket is listening socket and fail if so, but this check is
racy. Since we have to lock the socket buffer later anyway, defer the
check to that point.
ktls_enable_tx() locks the send buffer's I/O lock, which will fail if
the socket is a listening socket, so no explicit checks are needed. In
ktls_enable_rx(), which does not acquire the I/O lock (see the review
for some discussion on this), use an explicit SOLISTENING() check after
locking the recv socket buffer.
Otherwise, a concurrent solisten_proto() call can trigger crashes and
memory leaks by wiping out socket buffers as ktls_enable_*() is
modifying them.
Also make sure that a KTLS-enabled socket can't be converted to a
listening socket, and use SOCK_(SEND|RECV)BUF_LOCK macros instead of the
old ones while here.
Add some simple regression tests involving listen(2).
Reported by: syzkaller
MFC after: 2 weeks
Reviewed by: gallatin, glebius, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38504