Fix vfs_emptydir(). It would consider directories containing directories
with name of the form 'X.' (X being any authorized byte) as empty. Also,
it would cause VOP_READDIR() to return an error on directories
containing enough whiteouts. While here, use a more decently sized
buffer as done elsewhere.
Remove ad-hoc iteration on the directory's content and instead use the
newly exported vn_dir_next_dirent() function (this is what fixes the
second problem mentioned above).
vfs: vn_dir_next_dirent(): Simplify interface and harden
Simplify the old interface (one less argument, simpler termination test)
and add documentation about it. Add more sanity checks (mostly under
INVARIANTS, but also in the general case to prevent infinite
loops). Drop the explicit test on minimum directory entry size (without
INVARIANTS).
Deal with the impacts in callers (dirent_exists() and vop_stdvptocnp()).
dirent_exists() has been simplified a bit, preserving the exact same
semantics but for the return code whose meaning has been reversed (0 now
means the entry exists, ENOENT that it doesn't and other values are
genuine errors). While here, suppress gratuitous casts of malloc return
values.
vn_dir_next_dirent() has been tested by a 'make -j4 buildkernel' with a
temporary modification to the VFS cache causing vn_vptocnp() to always
call VOP_VPTOCNP() and finally vop_stdvptocnp() (observed with temporary
debug counters).
Export new _GENERIC_MINDIRSIZ and _GENERIC_MAXDIRSIZ on __BSD_VISIBLE,
and GENERIC_MINDIRSIZ and GENERIC_MAXDIRSIZ on _KERNEL.
vfs: Export get_next_dirent() as vn_dir_next_dirent()
Move internal-to-'vfs_default.c' get_next_dirent() to 'vfs_vnops.c' and
export it for use by other parts of the VFS. This is a preparatory
change for using it in vfs_emptydir().
- assumption that single-zone countries do not have description
is no longer correct; do not try to optimize this case as it's
only going to make the code more confusing and we now have menus
with a single zone selection because of this
- remove the single-country continent short cut, it also only serves
to confuse users as we now have such a continent
- instead add a single-zone contry short cut (see above), now all
single-zone countries fall here
- use the #@ continent overrides that zone1970.tab introduces (this is
visible at least fixing Iceland being currently listed under Africa)
- add Arctic Ocean "continent" coming only from the overrides at the
moment
- update baseline with the changes
Reviewed by: bapt, philip
Differential Revision: https://reviews.freebsd.org/D39606
Mark Johnston [Thu, 27 Apr 2023 16:58:56 +0000 (12:58 -0400)]
sockbuf: Add KMSAN checks to sbappend*()
Otherwise KMSAN only detects uninitialized memory when the contents of
the buffer are copied out to userspace or transmitted to a network
interface. At that point the KMSAN violation will be far removed from
its origin, so let's try to make debugging such problems a bit easier.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38101
Mark Johnston [Thu, 27 Apr 2023 13:42:36 +0000 (09:42 -0400)]
cap_net tests: Skip tests if there is no connectivity
When testing cap_connect() and name/addr lookup functions, skip tests if
we fail and the error is not ENOTCAPABLE. This makes the tests amenable
to running in CI without Internet connectivity.
Elliott Mitchell [Wed, 14 Dec 2022 21:59:17 +0000 (13:59 -0800)]
arm: remove passing trapframe to intr_ipi_dispatch()
This was needed before INTRNG was in place and handling the push of
curthread->td_intr_frame. Since INTRNG now handles this, there is no
longer and need for playing around with the frame inside IPI interrupts.
Elliott Mitchell [Wed, 14 Dec 2022 20:36:47 +0000 (12:36 -0800)]
arm: remove interrupt nesting by ipi_preempt()/ipi_hardclock()
This was needed when intr_ipi_dispatch() was called by hardware-specific
IPI interrupt routines which didn't save the trap frame. Now all ARM
interrupts pass through INTRNG which will have already saved the trap
frame and disabled preemption.
Remove the conditional trapframe/argument passing to the handlers.
Reviewed by: andrew
Differential Revision: https://reviews.freebsd.org/D37938
Gang ABDs without childred are legal, and they do have zero size.
For other ABD types zero size doesn't have much sense and likely
not working correctly now.
Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14795
Ed Maste [Mon, 24 Apr 2023 19:41:45 +0000 (15:41 -0400)]
ipv6: disable RFC 4620 nodeinfo by default
RFC 4620 is an experimental RFC that can be used to request information
about a host, including:
- the fully-qualified or single-component name
- some set of the Responder's IPv6 unicast addresses
- some set of the Responder's IPv4 unicast addresses
This is not something that should be made available by default.
PR: 257709
Submitted by: ruben@verweg.com
Reviewed by: melifaro
Relnotes: Yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39778
This is part one of a fix for booting with ZFS on arm64 using
accelerated checksum implementations. Checksum benchmarking will
attempt to use the FPU, so we currently panic quickly on boot. BLAKE3
is still broken, as it clobbers x18 and we promptly discover that fact
as soon as we attempt to fetch curthread in kfpu_end().
Note that _STANDALONE is special-cased here, but ideally we wouldn't be
building the code that uses kfpu_begin()/kfpu_end() at all in the loader
environment.
Discussed with: imp (a bit)
Differential Revision: https://reviews.freebsd.org/D39448
Similar to the PF_TAG_DUMMYNET we must also clear the route tag if
dummynet didn't keep the packet. In that case we'd continue immediately
and there'd be no need for the route tag. Keeping it could lead to
unexpected routing of traffic.
Mark Johnston [Wed, 26 Apr 2023 14:09:09 +0000 (10:09 -0400)]
callout: Move per-CPU callout state into the dpcpu region
This eliminates some static bloat in amd64 kernels and reduces the
penalty of increasing MAXCPU. The structures now also maintain NUMA
affinity. No functional change intended.
PR: 269572
Reviewed by: mjg, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39807
linux(4): Don't relie on process osreldata when testing features
The ELF note identifyies the operating-system ABI that the executable
was created for. The note data of the Glibc executable contains the
earliest release number of the Linux kernel that supports this ABI.
As of a current 2.37 version of Glibc, it is 3.2.0 for x86, 3.7.0
for Aarch64.
Glibc does not use this release number and the current kernel's
LINUX_VERSION_CODE to detect kernel features, using fallbacks to known
previous way in case of ENOSYS or something else instead.
A dynamically linked Glibc reads the current kernel's LINUX_VERSION_CODE
from the ELF note in the vDSO or fallback to uname syscall if the vDSO
can't be located and parse the release field in struct utsname. Glibc
uses the current kernel's LINUX_VERSION_CODE for "kernel too old" check.
While here use inlined LINUX_KERNVER for tests to improve readability,
as suggested by emaste@.
* Move LLT_ADDEDPROXY handling into lltable_link_entry() to
reduct duplication
* Use standard lltable_delete_addr() for entry deletion
* Add (forgotten) call to llt_post_resolved handler after
adding the entry via netlink.
vmm: fix HLT loop while vcpu has requested virtual interrupts
This fixes the detection of pending interrupts when pirval is 0 and the
pending bit is set
More information how this situation occurs, can be found here:
https://github.com/freebsd/freebsd-src/blob/c5b5f2d8086f540fefe4826da013dd31d4e45fe8/sys/amd64/vmm/intel/vmx.c#L4016-L4031
Reviewed by: corvink, markj
Fixes: 02cc877968bbcd57695035c67114a67427f54549 ("Recognize a pending virtual interrupt while emulating the halt instruction.")
MFC after: 1 week
Sponsored by: vStack
Differential Revision: https://reviews.freebsd.org/D39620
There are some use cases where bhyve has to prepare some special memory
regions. E.g. GPU passthrough for Intel integrated graphic devices needs
to reserve some memory for the graphic device. So, bhyve has to inform
the guest about those memory regions. This information can be passed by
the qemu fwcfg interface. As qemu creates an E820 table, we can reuse
the existing fwcfg item "etc/e820".
This commit is the first one of a series. It only adds a basic
implementation for the creation of the E820 table. Some subsequent
commits will add more items to the E820 table and register it as fwcfg
item.
Note that static hints no longer break loader hints
This commentary was carried over from the x86 version of the same code,
but has actually been inaccurate for a while now. As of FreeBSD 12.x,
all environments are used unless they disable each other. See 39d44f7f15c ("kern_environment: use any provided environments [...]")
for details.
In the early days of gbde, it linked against libmd. Shortly after
conception, phk replaced ARC4 with SHA-512, but libmd did not have SHA2
at the time thus he built a copy of sha2.c for gbde.
Fast forward 3 years, cperciva adds SHA2 to libmd -- this makes gbde's
build of sha2.c redundant, but it's (understandably) overlooked. Let's
simplify the gbde build now and just assume that libmd includes the most
optimal implementation.
Independent of all of the commands, bectl itself takes an `-r` flag that
specifies the BE root to use. This was originally added to facilitate
testing, but it was later discovered to be incredibly useful in other
scenarios; e.g., trying to recover some boot environments in rescue
media.
The "BE root" described here is the parent dataset that holds boot
environments, but I've no idea if that's an accepted definition for that
dataset.
VOP_CLOSE(): MNTK_EXTENDED_SHARED filesystems do not need excl lock
All in-tree implementations of VOP_CLOSE() for filesystems proclaiming
MNTK_EXTENDED_SHARED, are fine with the shared lock for the closed
vnode. I checked the following implementations:
ffs
ext2
ufs
null
tmpfs
devfs
fdescfs
cd9660
zfs
It seems that initial addition of FWRITE check was due to necessity of
handling the VV_TEXT vnode vflag. Since VOP_ADD_WRITECOUNT() only
requires shared lock, we can relax the locking requirement there.
Reviewed by: markj, Olivier Certner <olce.freebsd@certner.fr>
Tested by: Olivier Certner
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D39784
Cheng Cui [Tue, 25 Apr 2023 11:52:28 +0000 (07:52 -0400)]
Change the unit of srtt and rto to usec, inspired by these in struct "tcp_info". Therefore, no need hz and tcp_rtt_scale in the headline of the log. Update the man page as well.
Summary: Simplify srtt and rto values in siftr log.
The bit values are numbers given in octal representation, not decimal,
as one might assume from the description. Same goes for the base,
although this has an example.
Reviewed by: emaste
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39815
arm64/disassem.c: Fix typo sxts to sxts and amount for TYPE_02
The current implementation is wrong, since it unconditionally sets the
amount equal to the <size> field of the instruction. However, when the
<S> bit (scale) is not set, it must be zero.
Also fix a typo, sxts to sxtx, according to the Arm64 documentation.
tcp_hpts: move HPTS related fields from inpcb to tcpcb
This makes inpcb lighter and allows future cache line optimizations
of tcpcb. The reason why HPTS originally used inpcb is the compressed
TIME-WAIT state (see 0d7445193ab), that used to free a tcpcb, while the
associated connection is still on the HPTS ring.
The purge was intentionally removed in a540cdca3183. My assumption
was that the stacks that use the input queue always call the
tcp_handle_orphaned_packets() in their tfb_tcp_fb_fini method.
However, rack will skip doing that if t_fb_ptr is NULL and there are
scenarios when it is NULL, e.g. close(2) on a socket (but some
special close(2)). Instead of working out all possible scenarios
let's put this safebelt back.
al_eth: make function definitions consistent with declarations
The declarations for al_eth_lm_retimer_ds25_signal_detect() and
al_eth_lm_retimer_ds25_cdr_lock() say that these functions return
'al_bool', but the definitions actually return 'boolean_t'.
boolean_t: change to unsigned int to avoid signed bitfield warnings
This is the final part, which actually makes boolean_t unsigned. Note
that we do not change its size, nor do we try to change it directly to
bool, since that results in a lot of regressions.
Converting the remaining instances of boolean_t to plain C99 bool can
now be done in a piecemeal fashion, after which boolean_t may hopefully
be retired.
vm: fix a number of functions to match the expected prototypes
Noticed while attempting to make boolean_t unsigned: some vm-related
function declarations and defintions were using boolean_t where they
should have used int, and vice versa.
zfs: make zfs_vfs_held() definition consistent with declaration
Noticed while attempting to change boolean_t into an actual bool: in
include/sys/zfs_ioctl_impl.h, zfs_vfs_held() is declared to return a
boolean_t, but in module/os/freebsd/zfs/zfs_ioctl_os.c it is defined to
return an int. Make the definition match the declaration.
Mark Johnston [Tue, 25 Apr 2023 17:33:08 +0000 (13:33 -0400)]
vmm: Expose some more AVX512 CPUID bits to guests
This is required to announce support for some accelerated AES
operations. AVX512BW indicates support for the AVX512-FP16 extension
and AVX512VL indicates support for the use of AVX512 instructions with
vector lengths smaller than 512 bits.
VAES and VPCLMULQDQ extensions indicate that VEX-prefixed AES-NI and
pclmulqdq instructions are supported.
All of these bits are needed for OpenSSL to use VAES to accelerate
AES-GCM transforms.
A signed one-bit wide bit-field can take only the values 0 and -1. Clang
16 introduced a warning that "implicit truncation from 'int' to a
one-bit wide bit-field changes value from 1 to -1". Fix the warnings by
using C99 bool.
To quote a pending upstream PR:
This reverts commit 4c856fb to resolve a newly introduced deadlock which
in practice is more disruptive that the issue this commit intended to
address.
Causes deadlocks described in https://github.com/openzfs/zfs/issues/14775
Mark Johnston [Tue, 21 Mar 2023 13:36:58 +0000 (09:36 -0400)]
dtrace: Sync dis_tables.c with illumos
This brings in the following commits:
commit 584b574a3b16c6772c8204ec1d1c957c56f22a87
12174 i86pc: variable may be used uninitialized
Author: Toomas Soome <tsoome@me.com>
Reviewed by: John Levon <john.levon@joyent.com>
Reviewed by: Andrew Stormont <astormont@racktopsystems.com>
Approved by: Dan McDonald <danmcd@joyent.com>
commit a25e615d76804404e5fc63897a9196d4f92c3f5e
12371 dis x86 EVEX prefix mishandled
12372 dis EVEX encoding SIB mishandled
12373 dis support for EVEX vaes instructions
12374 dis support for EVEX vpclmulqdq instructions
12375 dis support for gfni instructions
Author: Robert Mustacchi <rm@fingolfin.org>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
commit c1e9bf00765d7ac9cf1986575e4489dd8710d9b1
12369 dis WBNOINVD support
Author: Robert Mustacchi <rm@joyent.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Andy Fiddaman <andy@omniosce.org>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Dan McDonald <danmcd@joyent.com>
commit e4f6ce7088a7dd335b9edf4774325f888692e5fb
10893 Need support for new Cascade Lake Instructions
Author: Robert Mustacchi <rm@joyent.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@joyent.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Gordon Ross <gwr@nexenta.com>
commit cff040f3ef42d16ae655969398f5a5e6e700b85e
10226 Need support for new EPYC ISA extensions
Author: Robert Mustacchi <rm@joyent.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@joyent.com>
Reviewed by: Jason King <jason.king@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Dan McDonald <danmcd@joyent.com>
commit d242cdf5288b86d9070d88791c8ee696612becdc
8492 AVX512 dis - legacy logical instructions
Author: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
commit 81b505b772ab015c588c56bb116239ee549b6eee
8384 AVX512 dis - EVEX prefix support
8385 32-bit avx dis test mishandles EVEX prefix
8386 32-bit bound dis is incorrect
Author: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
commit 92381362ae635a3bea638d87b7119f1623b6212e
8319 dis support for new xsave instructions
Author: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
commit a4e73d5d60e566669c550027fae2b1d87b4be2b4
8240 AVX512 dis - opmask instruction support
Author: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
959b2dfd39979fe8a9a315a52741d009eb168822
7825 want avx dis tests
7826 PCLMULQDQ psuedo-ops aren't properly described in dis
7827 dis tests for f16c, movbe, cpuid, msr, tsc, fence instrs
7828 sysenter and sysexit dis should be allowed in 64-bit x86
Author: Robert Mustacchi <rm@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Boris Lytochkin [Tue, 25 Apr 2023 12:38:36 +0000 (12:38 +0000)]
ipfw: add [fw]mark implementation for ipfw
Packet Mark is an analogue to ipfw tags with O(1) lookup from mbuf while
regular tags require a single-linked list traversal.
Mark is a 32-bit number that can be looked up in a table
[with 'number' table-type], matched or compared with a number with optional
mask applied before comparison.
Having generic nature, Mark can be used in a variety of needs.
For example, it could be used as a security group: mark will hold a security
group id and represent a group of packet flows that shares same access
control policy.
This change adds netlink create/modify/dump interfaces to the `if_clone.c`.
The previous attempt with storing the logic inside `netlink/route/iface_drivers.c`
did not quite work, as, for example, dumping interface-specific state
(like vlan id or vlan parent) required some peeking into the private interfaces.
The new interfaces are added in a compatible way - callers don't have to do anything
unless they are extended with Netlink.
The change is intended to be fully transparent to the users.
Similarly to route(8) and netstat(8), ndp can be build without
netlink by defining WITHOUT_NETLINK in make.conf.
The change is intended to be fully transparent to the users.
Similarly to route(8) and netstat(8), arp can be build without
netlink by defining WITHOUT_NETLINK in make.conf.
ipfw.8: improve description for interface matching
The manual describes "if*" form only while kernel uses fnmatch(3)
and allows use for more versatile shell-like patterns.
Note that explicitly and provide an example.