The two main uses of dev_t are in struct stat and as a parameter of the
mknod system calls.
As of version 2.6.0 of the Linux kernel, dev_t is a 32-bit quantity
with 12 bits set asaid for the major number and 20 for the minor number.
The in-kernel dev_t encoded as MMMmmmmm, where M is a hex digit of the
major number and m is a hex digit of the minor number.
The user-space dev_t encoded as mmmM MMmm, where M and m is the major
and minor numbers accordingly. This is downward compatible with legacy
systems where dev_t is 16 bits wide, encoded as MMmm.
In glibc dev_t is a 64-bit quantity, with 32-bit major and minor numbers,
encoded as MMMM Mmmm mmmM MMmm. This is downward compatible with the Linux
kernel and with legacy systems where dev_t is 16 bits wide.
In the FreeBSD dev_t is a 64-bit quantity. The major and minor numbers
are encoded as MMMmmmMm, therefore conversion of the device numbers between
Linux user-space and FreeBSD kernel required.
linux(4): Use Linux dev_t type for mknod syscalls dev argument
As of version 2.6.0 of the Linux kernel, dev_t is a 32-bit unsigned integer
on all platforms. Prior the 2.6 kernel dev_t type was an unsigned short.
However, since the firs commit of the Linuxulator, mknod syscall get int dev
argument.
Also, there is some confusion here, while the kernel declares a dev_t type
as a 32-bit sized, the user-space dev_t type can be size of 64 bits, e.g.,
in the Glibc library.
To avoid confusion and to help porting of the Linuxulator to other platforms
use explicit l_dev_t for dev argument of mknod syscalls.
linux(4): Move statx_copyout() close to linux_statx()
Just for future changes of the conditional Linuxulator build. We need
a small refactoring of the MI code to help porting Linuxulator to other
platforms.
Fix vfs_emptydir(). It would consider directories containing directories
with name of the form 'X.' (X being any authorized byte) as empty. Also,
it would cause VOP_READDIR() to return an error on directories
containing enough whiteouts. While here, use a more decently sized
buffer as done elsewhere.
Remove ad-hoc iteration on the directory's content and instead use the
newly exported vn_dir_next_dirent() function (this is what fixes the
second problem mentioned above).
vfs: vn_dir_next_dirent(): Simplify interface and harden
Simplify the old interface (one less argument, simpler termination test)
and add documentation about it. Add more sanity checks (mostly under
INVARIANTS, but also in the general case to prevent infinite
loops). Drop the explicit test on minimum directory entry size (without
INVARIANTS).
Deal with the impacts in callers (dirent_exists() and vop_stdvptocnp()).
dirent_exists() has been simplified a bit, preserving the exact same
semantics but for the return code whose meaning has been reversed (0 now
means the entry exists, ENOENT that it doesn't and other values are
genuine errors). While here, suppress gratuitous casts of malloc return
values.
vn_dir_next_dirent() has been tested by a 'make -j4 buildkernel' with a
temporary modification to the VFS cache causing vn_vptocnp() to always
call VOP_VPTOCNP() and finally vop_stdvptocnp() (observed with temporary
debug counters).
Export new _GENERIC_MINDIRSIZ and _GENERIC_MAXDIRSIZ on __BSD_VISIBLE,
and GENERIC_MINDIRSIZ and GENERIC_MAXDIRSIZ on _KERNEL.
vfs: Export get_next_dirent() as vn_dir_next_dirent()
Move internal-to-'vfs_default.c' get_next_dirent() to 'vfs_vnops.c' and
export it for use by other parts of the VFS. This is a preparatory
change for using it in vfs_emptydir().
- assumption that single-zone countries do not have description
is no longer correct; do not try to optimize this case as it's
only going to make the code more confusing and we now have menus
with a single zone selection because of this
- remove the single-country continent short cut, it also only serves
to confuse users as we now have such a continent
- instead add a single-zone contry short cut (see above), now all
single-zone countries fall here
- use the #@ continent overrides that zone1970.tab introduces (this is
visible at least fixing Iceland being currently listed under Africa)
- add Arctic Ocean "continent" coming only from the overrides at the
moment
- update baseline with the changes
Reviewed by: bapt, philip
Differential Revision: https://reviews.freebsd.org/D39606
Mark Johnston [Thu, 27 Apr 2023 16:58:56 +0000 (12:58 -0400)]
sockbuf: Add KMSAN checks to sbappend*()
Otherwise KMSAN only detects uninitialized memory when the contents of
the buffer are copied out to userspace or transmitted to a network
interface. At that point the KMSAN violation will be far removed from
its origin, so let's try to make debugging such problems a bit easier.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38101
Mark Johnston [Thu, 27 Apr 2023 13:42:36 +0000 (09:42 -0400)]
cap_net tests: Skip tests if there is no connectivity
When testing cap_connect() and name/addr lookup functions, skip tests if
we fail and the error is not ENOTCAPABLE. This makes the tests amenable
to running in CI without Internet connectivity.
Elliott Mitchell [Wed, 14 Dec 2022 21:59:17 +0000 (13:59 -0800)]
arm: remove passing trapframe to intr_ipi_dispatch()
This was needed before INTRNG was in place and handling the push of
curthread->td_intr_frame. Since INTRNG now handles this, there is no
longer and need for playing around with the frame inside IPI interrupts.
Elliott Mitchell [Wed, 14 Dec 2022 20:36:47 +0000 (12:36 -0800)]
arm: remove interrupt nesting by ipi_preempt()/ipi_hardclock()
This was needed when intr_ipi_dispatch() was called by hardware-specific
IPI interrupt routines which didn't save the trap frame. Now all ARM
interrupts pass through INTRNG which will have already saved the trap
frame and disabled preemption.
Remove the conditional trapframe/argument passing to the handlers.
Reviewed by: andrew
Differential Revision: https://reviews.freebsd.org/D37938
Gang ABDs without childred are legal, and they do have zero size.
For other ABD types zero size doesn't have much sense and likely
not working correctly now.
Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14795
Ed Maste [Mon, 24 Apr 2023 19:41:45 +0000 (15:41 -0400)]
ipv6: disable RFC 4620 nodeinfo by default
RFC 4620 is an experimental RFC that can be used to request information
about a host, including:
- the fully-qualified or single-component name
- some set of the Responder's IPv6 unicast addresses
- some set of the Responder's IPv4 unicast addresses
This is not something that should be made available by default.
PR: 257709
Submitted by: ruben@verweg.com
Reviewed by: melifaro
Relnotes: Yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39778
This is part one of a fix for booting with ZFS on arm64 using
accelerated checksum implementations. Checksum benchmarking will
attempt to use the FPU, so we currently panic quickly on boot. BLAKE3
is still broken, as it clobbers x18 and we promptly discover that fact
as soon as we attempt to fetch curthread in kfpu_end().
Note that _STANDALONE is special-cased here, but ideally we wouldn't be
building the code that uses kfpu_begin()/kfpu_end() at all in the loader
environment.
Discussed with: imp (a bit)
Differential Revision: https://reviews.freebsd.org/D39448
Similar to the PF_TAG_DUMMYNET we must also clear the route tag if
dummynet didn't keep the packet. In that case we'd continue immediately
and there'd be no need for the route tag. Keeping it could lead to
unexpected routing of traffic.
Mark Johnston [Wed, 26 Apr 2023 14:09:09 +0000 (10:09 -0400)]
callout: Move per-CPU callout state into the dpcpu region
This eliminates some static bloat in amd64 kernels and reduces the
penalty of increasing MAXCPU. The structures now also maintain NUMA
affinity. No functional change intended.
PR: 269572
Reviewed by: mjg, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39807
linux(4): Don't relie on process osreldata when testing features
The ELF note identifyies the operating-system ABI that the executable
was created for. The note data of the Glibc executable contains the
earliest release number of the Linux kernel that supports this ABI.
As of a current 2.37 version of Glibc, it is 3.2.0 for x86, 3.7.0
for Aarch64.
Glibc does not use this release number and the current kernel's
LINUX_VERSION_CODE to detect kernel features, using fallbacks to known
previous way in case of ENOSYS or something else instead.
A dynamically linked Glibc reads the current kernel's LINUX_VERSION_CODE
from the ELF note in the vDSO or fallback to uname syscall if the vDSO
can't be located and parse the release field in struct utsname. Glibc
uses the current kernel's LINUX_VERSION_CODE for "kernel too old" check.
While here use inlined LINUX_KERNVER for tests to improve readability,
as suggested by emaste@.
* Move LLT_ADDEDPROXY handling into lltable_link_entry() to
reduct duplication
* Use standard lltable_delete_addr() for entry deletion
* Add (forgotten) call to llt_post_resolved handler after
adding the entry via netlink.
vmm: fix HLT loop while vcpu has requested virtual interrupts
This fixes the detection of pending interrupts when pirval is 0 and the
pending bit is set
More information how this situation occurs, can be found here:
https://github.com/freebsd/freebsd-src/blob/c5b5f2d8086f540fefe4826da013dd31d4e45fe8/sys/amd64/vmm/intel/vmx.c#L4016-L4031
Reviewed by: corvink, markj
Fixes: 02cc877968bbcd57695035c67114a67427f54549 ("Recognize a pending virtual interrupt while emulating the halt instruction.")
MFC after: 1 week
Sponsored by: vStack
Differential Revision: https://reviews.freebsd.org/D39620
There are some use cases where bhyve has to prepare some special memory
regions. E.g. GPU passthrough for Intel integrated graphic devices needs
to reserve some memory for the graphic device. So, bhyve has to inform
the guest about those memory regions. This information can be passed by
the qemu fwcfg interface. As qemu creates an E820 table, we can reuse
the existing fwcfg item "etc/e820".
This commit is the first one of a series. It only adds a basic
implementation for the creation of the E820 table. Some subsequent
commits will add more items to the E820 table and register it as fwcfg
item.
Note that static hints no longer break loader hints
This commentary was carried over from the x86 version of the same code,
but has actually been inaccurate for a while now. As of FreeBSD 12.x,
all environments are used unless they disable each other. See 39d44f7f15c ("kern_environment: use any provided environments [...]")
for details.
In the early days of gbde, it linked against libmd. Shortly after
conception, phk replaced ARC4 with SHA-512, but libmd did not have SHA2
at the time thus he built a copy of sha2.c for gbde.
Fast forward 3 years, cperciva adds SHA2 to libmd -- this makes gbde's
build of sha2.c redundant, but it's (understandably) overlooked. Let's
simplify the gbde build now and just assume that libmd includes the most
optimal implementation.
Independent of all of the commands, bectl itself takes an `-r` flag that
specifies the BE root to use. This was originally added to facilitate
testing, but it was later discovered to be incredibly useful in other
scenarios; e.g., trying to recover some boot environments in rescue
media.
The "BE root" described here is the parent dataset that holds boot
environments, but I've no idea if that's an accepted definition for that
dataset.
VOP_CLOSE(): MNTK_EXTENDED_SHARED filesystems do not need excl lock
All in-tree implementations of VOP_CLOSE() for filesystems proclaiming
MNTK_EXTENDED_SHARED, are fine with the shared lock for the closed
vnode. I checked the following implementations:
ffs
ext2
ufs
null
tmpfs
devfs
fdescfs
cd9660
zfs
It seems that initial addition of FWRITE check was due to necessity of
handling the VV_TEXT vnode vflag. Since VOP_ADD_WRITECOUNT() only
requires shared lock, we can relax the locking requirement there.
Reviewed by: markj, Olivier Certner <olce.freebsd@certner.fr>
Tested by: Olivier Certner
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D39784
Cheng Cui [Tue, 25 Apr 2023 11:52:28 +0000 (07:52 -0400)]
Change the unit of srtt and rto to usec, inspired by these in struct "tcp_info". Therefore, no need hz and tcp_rtt_scale in the headline of the log. Update the man page as well.
Summary: Simplify srtt and rto values in siftr log.
The bit values are numbers given in octal representation, not decimal,
as one might assume from the description. Same goes for the base,
although this has an example.
Reviewed by: emaste
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39815
arm64/disassem.c: Fix typo sxts to sxts and amount for TYPE_02
The current implementation is wrong, since it unconditionally sets the
amount equal to the <size> field of the instruction. However, when the
<S> bit (scale) is not set, it must be zero.
Also fix a typo, sxts to sxtx, according to the Arm64 documentation.
tcp_hpts: move HPTS related fields from inpcb to tcpcb
This makes inpcb lighter and allows future cache line optimizations
of tcpcb. The reason why HPTS originally used inpcb is the compressed
TIME-WAIT state (see 0d7445193ab), that used to free a tcpcb, while the
associated connection is still on the HPTS ring.
The purge was intentionally removed in a540cdca3183. My assumption
was that the stacks that use the input queue always call the
tcp_handle_orphaned_packets() in their tfb_tcp_fb_fini method.
However, rack will skip doing that if t_fb_ptr is NULL and there are
scenarios when it is NULL, e.g. close(2) on a socket (but some
special close(2)). Instead of working out all possible scenarios
let's put this safebelt back.
al_eth: make function definitions consistent with declarations
The declarations for al_eth_lm_retimer_ds25_signal_detect() and
al_eth_lm_retimer_ds25_cdr_lock() say that these functions return
'al_bool', but the definitions actually return 'boolean_t'.
boolean_t: change to unsigned int to avoid signed bitfield warnings
This is the final part, which actually makes boolean_t unsigned. Note
that we do not change its size, nor do we try to change it directly to
bool, since that results in a lot of regressions.
Converting the remaining instances of boolean_t to plain C99 bool can
now be done in a piecemeal fashion, after which boolean_t may hopefully
be retired.
vm: fix a number of functions to match the expected prototypes
Noticed while attempting to make boolean_t unsigned: some vm-related
function declarations and defintions were using boolean_t where they
should have used int, and vice versa.
zfs: make zfs_vfs_held() definition consistent with declaration
Noticed while attempting to change boolean_t into an actual bool: in
include/sys/zfs_ioctl_impl.h, zfs_vfs_held() is declared to return a
boolean_t, but in module/os/freebsd/zfs/zfs_ioctl_os.c it is defined to
return an int. Make the definition match the declaration.
Mark Johnston [Tue, 25 Apr 2023 17:33:08 +0000 (13:33 -0400)]
vmm: Expose some more AVX512 CPUID bits to guests
This is required to announce support for some accelerated AES
operations. AVX512BW indicates support for the AVX512-FP16 extension
and AVX512VL indicates support for the use of AVX512 instructions with
vector lengths smaller than 512 bits.
VAES and VPCLMULQDQ extensions indicate that VEX-prefixed AES-NI and
pclmulqdq instructions are supported.
All of these bits are needed for OpenSSL to use VAES to accelerate
AES-GCM transforms.
A signed one-bit wide bit-field can take only the values 0 and -1. Clang
16 introduced a warning that "implicit truncation from 'int' to a
one-bit wide bit-field changes value from 1 to -1". Fix the warnings by
using C99 bool.
To quote a pending upstream PR:
This reverts commit 4c856fb to resolve a newly introduced deadlock which
in practice is more disruptive that the issue this commit intended to
address.
Causes deadlocks described in https://github.com/openzfs/zfs/issues/14775