Bjoern A. Zeeb [Fri, 1 Dec 2023 01:37:25 +0000 (01:37 +0000)]
tools/net80211: add mlme_assoc
mlme_assoc is a tool to trigger net80211::ieee80211_sta_join1() calls
which in certain conditions cause problems to the LinuxKPI 802.11 compat
code (but also believed to possibly cause problems in case of race to
other firmware based drivers). This has proven to be a good reproducer
for the problem even on setups which otherwise could run for days without
hitting it.
Bjoern A. Zeeb [Fri, 3 Nov 2023 21:19:26 +0000 (21:19 +0000)]
Revert "Widen EPOCH(9) usage in PCI WLAN drivers."
This reverts commit b65f813c1ab99448278961c5ca80dc422b1eae29.
As a side effect this also seems to fix wtap which seems to have
lost the epoch over the input path in between.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Bjoern A. Zeeb [Sun, 29 Oct 2023 14:25:23 +0000 (14:25 +0000)]
net80211: move net_epoch into net80211
Move the net_epoch into net80211 around the if_input calls and out of
the driver (in this first case LinuxKPI). This reduces coverage but
also allows us to alloc in calls like (*ampdu_rx_start) which do not
actually pass data up the stack.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Tested by: few (rtwn, ath, iwlwifi, ...)
Reviewed by: adrian
Differential Revision: https://reviews.freebsd.org/D42427
Many languages use different case endings depending on whether the month
is referenced as a standalone word (nominative case), or in date context
(genitive, partitive, etc.). sort(1)'s -M option currently sorts months
by testing input against only the abbrevation format, which is
essentially a substring of the full format. While this works fine for
languages like English, where there are no cases, for languages where
there is a different case ending between the abbreviation/full and
standalone formats, it is not sufficient.
For example, in Greek, "May" can take the following forms:
RTLD_DEEPBIND: make lookup not just symbolic, but walk all refobj' DAGs
before starting the walk over the global list. Effectively we visit
needed objects first as well, instead of just the object itself.
This seems to better match the semantic offered by the glibc flag.
Reported by: kevans
PR: 275393
Reviewed by: kevans
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D42841
Mark Johnston [Thu, 30 Nov 2023 17:46:08 +0000 (12:46 -0500)]
ossl: Add support for armv7
OpenSSL provides implementations of several AES modes which use
bitslicing and can be accelerated on CPUs which support the NEON
extension. This patch adds arm platform support to ossl(4) and provides
an AES-CBC implementation, though bsaes_cbc_encrypt() only implements
decryption. The real goal is to provide an accelerated AES-GCM
implementation; this will be added in a subsequent patch.
Initially derived from https://reviews.freebsd.org/D37420.
Mark Johnston [Wed, 29 Nov 2023 20:08:12 +0000 (15:08 -0500)]
ossl: Fix some bugs in the fallback AES-GCM implementation
gcm_*_aesni() are used when the AVX512 implementation is not available.
Fix two bugs which manifest when handling operations spanning multiple
segments:
- Avoid underflow when the length of the input is smaller than the
residual.
- In gcm_decrypt_aesni(), ensure that we begin the operation at the
right offset into the input and output buffers.
Reviewed by: jhb
Fixes: 9b1d87286c78 ("ossl: Add a fallback AES-GCM implementation using AES-NI")
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D42838
Gleb Smirnoff [Thu, 30 Nov 2023 16:30:55 +0000 (08:30 -0800)]
sockets: don't malloc/free sockaddr memory on getpeername/getsockname
Just like it was done for accept(2) in cfb1e92912b4, use same approach
for two simplier syscalls that return socket addresses. Although,
these two syscalls aren't performance critical, this change generalizes
some code between 3 syscalls trimming code size.
Following example of accept(2), provide VNET-aware and INVARIANT-checking
wrappers sopeeraddr() and sosockaddr() around protosw methods.
Gleb Smirnoff [Thu, 30 Nov 2023 16:30:55 +0000 (08:30 -0800)]
sockets: don't malloc/free sockaddr memory on accept(2)
Let the accept functions provide stack memory for protocols to fill it in.
Generic code should provide sockaddr_storage, specialized code may provide
smaller structure.
While rewriting accept(2) make 'addrlen' a true in/out parameter, reporting
required length in case if provided length was insufficient. Our manual
page accept(2) and POSIX don't explicitly require that, but one can read
the text as they do. Linux also does that. Update tests accordingly.
Jamie Gritton [Thu, 30 Nov 2023 00:12:13 +0000 (16:12 -0800)]
jail: Don't allow jail_set(2) to resurrect dying jails.
Currently, a prison in "dying" state (removed but still holding
resources) can be brought back to alive state via "jail -d", or
the JAIL_DYING flag to jail_set(2). This seemed like a good idea
at the time.
Its main use was to improve support for specifying the jid when
creating a jail, which also seemed like a good idea at the time.
But resurrecting a jail that was partway through thr process of
shutting down is trouble waiting to happen.
This patch deprecates that flag, leaving it as a no-op for creating
jails (but still useful for looking at dying jails). It sill allows
creating a new jail with the same jid as a dying one, but will renumber
the old one in that case. That's imperfect, but allows for current
behavior.
The number of events we track can vary over time, but we only allocate
enough space for the exact number of events we are tracking when we
first begin, resulting in a trivially reproducable heap overflow. Fix
this by allocating enough space for the greatest possible number of
events (two per file) and clean up the code a bit.
Also add a test case which triggers the aforementioned heap overflow,
although we don't currently have a way to detect it.
Bjoern A. Zeeb [Wed, 29 Nov 2023 21:33:23 +0000 (21:33 +0000)]
iwlwififw: add firmware for the Bz/B200 chipset
The iwlwifi driver already supports the chipset as "Bz TBD"
(also in 14.0). Add the firmware for it. Successfully tested
for 0x8086/0x272b/0x8086/0x00f4 on arm64 thanks to donated
hardware [1].
vt(4): Call post-switch callback after replacing the backend
[Why]
For instance, it gives a chance to the new backend to refresh the
screen. This is needed by the vt_drmfb backend and `drm_fb_helper`.
This change was lost when I posted changes to reviews.freebsd.org and it
broken the amdgpu driver... Thanks to manu@ for reporting the problem
and wulf@ to find out the missing change!
Tested by: manu
Reviewed by: manu
Approved by: manu
Differential Revision: https://reviews.freebsd.org/D42834
zil_claim_clone_range() takes references on cloned blocks before ZIL
replay. Later zil_free_clone_range() drops them after replay or on
dataset destroy. The total balance is neutral. It means on actual
replay we must take additional references, which would stay in BRT.
Without this blocks could be freed prematurely when either original
file or its clone are destroyed. I've observed BRT being emptied
and the feature being deactivated after ZIL replay completion, which
should not have happened. With the patch I see expected stats.
Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15603
John Baldwin [Wed, 29 Nov 2023 18:31:47 +0000 (10:31 -0800)]
pci_cfgreg: Add a PCI domain argument to the low-level register API
This commit changes the API of pci_cfgreg(read|write) to add a domain
argument (referred to as a segment in ACPI parlance) (note that this
is not the same as a NUMA domain, but something PCI-specific). This
does not yet enable access to domains other than 0, but updates the
API to support domains.
Places that use hard-coded bus/slot/function addresses have been
updated to hardcode a domain of 0. A few places that have the PCI
domain (segment) available such as the acpi_pcib_acpi.c Host-PCI
bridge driver pass the PCI domain.
The hpt27xx(4) and hptnr(4) drivers fail to attach to a device not on
domain 0 since they provide APIs to their binary blobs that only
permit bus/slot/function addressing.
The x86 non-ACPI PCI bus drivers all hardcode a domain of 0 as they do
not support multiple domains.
Wraithh [Wed, 29 Nov 2023 17:55:17 +0000 (19:55 +0200)]
Fix zoneid when USER_NS is disabled
getzoneid() should return GLOBAL_ZONEID instead of 0 when USER_NS is disabled.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ilkka Sovanto <github@ilkka.kapsi.fi>
Closes #15560
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Warner Losh <imp@FreeBSD.org> Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes #15606
Igor Ostapenko [Wed, 29 Nov 2023 12:35:41 +0000 (13:35 +0100)]
pf: fix mem leaks upon vnet destroy
Add missing cleanup actions:
- remove user defined anchor rulesets
- remove user defined ether anchor rulesets
- remove tables linked to user defined anchors
- deal with wildcard anchor peculiarities to get them removed correctly
Mark Johnston [Wed, 29 Nov 2023 17:51:55 +0000 (12:51 -0500)]
ossl: Keep mutable AES-GCM state on the stack
ossl(4)'s AES-GCM implementation keeps mutable state in the session
structure, together with the key schedule. This was done for
convenience, as both are initialized together. However, some OCF
consumers, particularly ZFS, assume that requests may be dispatched to
the same session in parallel. Without serialization, this results in
incorrect output.
Fix the problem by explicitly copying per-session state onto the stack
at the beginning of each operation.
PR: 275306
Reviewed by: jhb
Fixes: 9a3444d91c70 ("ossl: Add a VAES-based AES-GCM implementation for amd64")
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D42783
Warner Losh [Wed, 29 Nov 2023 15:26:29 +0000 (08:26 -0700)]
openzfs: unbreak 32-bit builds.
32-bit builds are broken. fix that by using PRIu64 instead of a
bare '%lu.'
Feel free to revert when upstream has this fixed. I'm agnostic as to the
proper fix, but don't have the time to fight upstreaming this on top of
everything else.
Alan Somers [Wed, 12 Jul 2023 20:46:27 +0000 (14:46 -0600)]
zfsd: fault disks that generate too many I/O delay events
If ZFS reports that a disk had at least 8 I/O operations over 60s that
were each delayed by at least 30s (implying a queue depth > 4 or I/O
aggregation, obviously), fault that disk. Disks that respond this
slowly can degrade the entire system's performance.
Alexander Motin [Wed, 29 Nov 2023 01:50:30 +0000 (18:50 -0700)]
mpi3mr: Make these bus_dmamap_load calls synchronous
These calls "should" all be synchrounous. There's no bouncing that's
needed for them (at least in the typical case that we have a sane card
that has more bits of dma addresses decoded than we have memory), so
there's no errors possible. Ensure these calls are really synchronous
with BUS_DMA_NOWAIT flags (which should never fail now that the
bus_dmamem_alloc() has succeeded).
Warner Losh [Wed, 29 Nov 2023 01:49:49 +0000 (18:49 -0700)]
mpi3mr: Honor the dma mask from IOCFacts
The number of signficant bits that are decoded are returned in the flags
field of the IOCFacts structure from the device. Rather than assume the
worst with a pessimal 32-bit maximum, look at this value and pass it
along to all the dma map creation requests.
A lof of those creations are repetitive and could just inherit from the
base tag if we moved to the templated interface. This is called out as
desireable future work not done at this time.
In addition, due to a chicken and an egg problem, we have to allocate
some of the maps with a 32-bit loaddr. These are the ones we need to
read iocfacts. And they are fine to be so restricted: they are little
used after startup, and when they are used, bouncing is fine.
Warner Losh [Wed, 29 Nov 2023 01:49:39 +0000 (18:49 -0700)]
mpi3mr: Fix EINPROGRESS errors hanging the card
Move enqueueing of commands to bus_dmamap_load_ccb callback
Fix fundamental difference between FreeBSD and Linux. On Linux, your dma
load callback always happends before it returns, so drivers are written
to load the map, then submit to hardware. On FreeBSD, the callback may
be deferred and return EINPROGRESS. This means the callback is
responsible for queueing the request to the hardware is done after the
SGL list is created. Make a number of interrelated cahnages:
At the end of mpi3mr_prepare_sgls, add a call to mpi3mr_enqueue_request.
Split the hardware submission out from the end of mpi3mr_action_scsiio
and move it into a new routine mpi3mr_enqueue_request.
Move all error completion from the end of mpi3mr_action_scsiio to where
the error is detected. We cannot pass errors back from the
mpi3mr_enqueue_request to do this on a 'failed' mpi3mr in a centralized
place (since it has to be fire and forget).
Add comments about zero length SGLs never making it into
mpi3mr_prepare_sgls. Keep the code there for the moment, but we only set
cm->data to non-NULL when scsiio_req->DataLength is not zero. So the
datalength can't be zero and we can't send the zero SGLs.
Add commentts about other "impossible" tests in mpi3mr_prepare_sgls that
really should be simple asserts of some flavor.
Eliminate cm->error_code, since we can't pass data back from the
mpi3mr_prepare_sgl callback anymore.
In mpi3mr_map_request, call mpi3mr_enqueue_request for the no data case.
This seems to work even though we've not done the special zero length
handling that was in mpi3mr_prepare_sgls, giving further evidence to it
not actually being needed. This is needed for SCSI CDBs that have no
data to pass to the drive like TEST UNIT READY.
With this change, and the prior ones, we're now able to run with mpi3mr
on 128GB systems and very heavy disk load (so many buffers land > 4GB:
the driver instructs busdma to never use memory abouve 4GB, which may be
too conservative, but an issue for another time).
Warner Losh [Wed, 29 Nov 2023 01:49:30 +0000 (18:49 -0700)]
mpi3mr: Cleaup setting of status in processing scsiio requests
More uniformly use mpi3mr_set_ccbstatus in mpi3mr_action_scsiio. The
routine mostly used it, but also has setting of status by hand. In those
cases where we want to error out the request, use this routine.
As part of this, move setting CAM_SIM_QUEUED later in the function to
when we're sure it's been queued. Remove the places we clear it before
this.
Warner Losh [Wed, 29 Nov 2023 01:49:24 +0000 (18:49 -0700)]
mpi3mr: Only set callout_owned when we create a timeout
Since we assume there's a timeout to cancel when this is true, only set
it true when we set the timeout. Otherwise we may try to cancel a timeout
when there's been an error in submission.
Warner Losh [Wed, 29 Nov 2023 01:49:08 +0000 (18:49 -0700)]
mpi3mr: Reduce the scope of the reset_mutext
Reduce the scope of reset_mutext to protect the msleep in the watch dog
thread as well as the MPI3MR_FLAGS_SHUTDOWN bit. Use it to protect the
wakeup in mpi3mr_detach so this thread can exit sooner when we're trying
to do an orderly shutdown. Optimize the flow to check the sleep and
other conditions before going to sleep.
It's an open question if this should protect sc->unrecoverable, and if
we should wakeup the watchdog thread when we set it. We might also want
to move too booleans for the three flags that we have now in
mpi3mr_flags. There are a number of U8s that should really be bools and
we might want to also group them together to pack softc better.
Warner Losh [Wed, 29 Nov 2023 01:49:01 +0000 (18:49 -0700)]
mpi3mr: Remove unused fields in struct mpi3mr_cmd
All of these fields are either unused, or just initialized. Remove
them. This saves about 1MB of memory for the cards that I have which can
do 8k transactions at once.
Warner Losh [Wed, 29 Nov 2023 01:48:48 +0000 (18:48 -0700)]
mpi3mr: Don't hold fwevt_lock over call to taskqueue_drain
Holding fwevt_lock when we call taskqueue_drain can lead to deadlock
because it's draining a queue needs fwevt_lock to do work, so that other
thread will try to take out the lock and block, making the thread never
finish and taskqueue_drain never complete. There's a witness
warning/error for this which was exposed when the lock was converted to
a MTX_DEF lock from a MTX_SPIN prior to committing to the FreeBSD tree.
The lock appears to be to protect against additional items being added
to the event list while we're doing a reset. Since the taskqueue is
blocked, items can get added to the list, but won't be processed during
the reset, but there is still a (likely small) race between the
taskqueue_drain and the taskqueue_block calls where an interrupt could
fire on another CPU, resulting in a task being enqueued and started
before the block can take effect. The only way to fix that race is to
turn off interrupt processing during a reset. So we replace a deadlock
with a smaller race.
Shengqi Chen [Wed, 22 Nov 2023 13:58:47 +0000 (21:58 +0800)]
module/icp/asm-arm/sha2: auto detect __ARM_ARCH
This patch uses __ARM_ARCH set by compiler (both
GCC and Clang have this) whenever possible instead
of hardcoding it to 7. This change allows code to
compile on earlier ARM architectures such as armv5te.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes #15557
route add <host> -iface <netif>" for a netif without an IPv4/IPv6
address fails with EINVAL. Need to use a link-level ifaddr for gw if
an ifaddr for dst is not found as the rtsock-based implementation does.
Notable upstream pull request merges:
#15532 c1a47de86 zdb: Fix zdb '-O|-r' options with -e/exported zpool
#15535 cf3316633 ZVOL: Minor code cleanup
#15541 803a9c12c brt: lift internal definitions into _impl header
#15541 213d68296 zdb: show BRT statistics and dump its contents
#15543 a49087510 ZIL: Refactor TX_WRITE encryption similar to
TX_CLONE_RANGE
#15543 27d8c23c5 ZIL: Do not encrypt block pointers in lr_clone_range_t
#15549 67894a597 unnecessary alloc/free in dsl_scan_visitbp()
#15551 126efb588 FreeBSD: Fix the build on FreeBSD 12
#15563 acb33ee1c FreeBSD: Fix ZFS so that snapshots under .zfs/snapshot are
NFS visible
#15564 7bbd42ef4 Don't allow attach to a raidz child vdev
#15566 688514e47 dmu_buf_will_clone: fix race in transition back to NOFILL
#15571 30d581121 dnode_is_dirty: check dnode and its data for dirtiness
Mike Karels [Tue, 28 Nov 2023 19:47:37 +0000 (13:47 -0600)]
ifconfig: add -D option to print driver name for interface
Add -D option to add the drivername and unit number to ifconfig output
for normal display, including -a. Use ifconfig_get_orig_name() from
libifconfig to fetch the name. Note that this is the original name
for many drivers, but not for some exceptions like epair (which appends
'a' or 'b' to the unit number). epair interface pairs both display
as "epair0", etc. Make -v imply -D; might as well be fully verbose.
Mark Johnston [Tue, 28 Nov 2023 19:35:49 +0000 (14:35 -0500)]
ossl: Fix handling of separate AAD buffers in ossl_aes_gcm()
Consumers may optionally provide a reference to a separate buffer
containing AAD, but ossl_aes_gcm() didn't handle this and would thus
compute an incorrect digest.
Fixes: 9a3444d91c70 ("ossl: Add a VAES-based AES-GCM implementation for amd64")
Reviewed by: jhb
MFC after: 3 days
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D42736
Linux 6.6 compat: fix configure error with clang (#15558)
With Linux v6.6.x and clang 16, a configure step fails on a warning that
later results in an error while building, due to 'ts' being
uninitialized. Add a trivial initialization to silence the warning.
Signed-off-by: Jaron Kent-Dobias <jaron@kent-dobias.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Rob N [Tue, 28 Nov 2023 17:53:04 +0000 (04:53 +1100)]
dmu_buf_will_clone: fix race in transition back to NOFILL
Previously, dmu_buf_will_clone() would roll back any dirty record, but
would not clean out the modified data nor reset the state before
releasing the lock. That leaves the last-written data in db_data, but
the dbuf in the wrong state.
This is eventually corrected when the dbuf state is made NOFILL, and
dbuf_noread() called (which clears out the old data), but at this point
its too late, because the lock was already dropped with that invalid
state.
Any caller acquiring the lock before the call into
dmu_buf_will_not_fill() can find what appears to be a clean, readable
buffer, and would take the wrong state from it: it should be getting the
data from the cloned block, not from earlier (unwritten) dirty data.
Even after the state was switched to NOFILL, the old data was still not
cleaned out until dbuf_noread(), which is another gap for a caller to
take the lock and read the wrong data.
This commit fixes all this by properly cleaning up the previous state
and then setting the new state before dropping the lock. The
DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the
lock is down.
Sponsored-by: Klara, Inc. Sponsored-By: OpenDrives Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15566
Closes #15526
It is similar to VOP_GETWRITEMOUNT(), and for given vnode vp should
return the lower vnode which would actually handle write to vp.
Flags allow to specify FREAD or FWRITE for benefit of possible unionfs
implementation.
Reviewed by: markj, Olivier Certner <olce.freebsd@certner.fr>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D42603
EVFILT_TIMER: intialize stop timer list in type-stable proc init, instead of fork
Since kqueue timer may exist after the process that created it exited
(same scenario with rfork(2) as in PR 275286), make the tailq
p_kqtim_stop accessed by filt_timerdetach() type-stable.
Noted and reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D42777
Revert "kqueue: on process exit, force-clear its registered signal events"
This reverts commit 393ac29f0b8be068c8e46f76c2eeee07d20ea4df. A
different fix is following, which preserves semantic, required by the
sys.kqueue.proc3_test.proc3 test.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
PR: 275286
Differential revision: https://reviews.freebsd.org/D42777
Matthew Ahrens [Tue, 28 Nov 2023 17:20:48 +0000 (09:20 -0800)]
unnecessary alloc/free in dsl_scan_visitbp()
Clean up code in dsl_scan_visitbp() by removing an unnecessary
alloc/free and `goto`. This has the side benefit of reducing CPU usage,
which is only really noticeable if we are not doing i/o for the leaf
blocks, like when `zfs_no_scrub_io` is set.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #15549
Brooks Davis [Mon, 27 Nov 2023 17:07:06 +0000 (17:07 +0000)]
memfd_create: don't allocate heap memory
Rather than calling calloc() to allocate space for a page size array to
pass to getpagesizes(), just follow the getpagesizes() implementation
and allocate MAXPAGESIZES elements on the stack. This avoids the need
for the allocation.
While this does mean that a new libc is required to take advantage of a
new huge page size, that was already true due to getpagesizes() using a
static buffer of MAXPAGESIZES elements.
Brooks Davis [Mon, 27 Nov 2023 17:06:33 +0000 (17:06 +0000)]
memfd_create: move implementation to libc/gen
Due to memfd_create(3)'s construction of a path to pass to shm_open2(2),
it has a much larger than typical dependency footprint for a system
call wrapper (the list currently includes calloc, memset, sprintf, and
strlen). As such, split it off into its own file under libc/gen to
lighten libc/sys's dependency list.
Brooks Davis [Mon, 27 Nov 2023 17:06:25 +0000 (17:06 +0000)]
getpagesize(3): drop support for non-ELF kernels
AT_PAGESZ was introduced with ELF support in 1996 (commit e1743d02cd14069f69a50bb8a6c626c1c6f47ddd) so we can safely count on
being able to use it to get our page size via elf_aux_info(). As such
we don't need a fallback sysctl query.
Save a few bytes of bss by dropping caching as elf_aux_info() runs
in constant time for a given query.
Brooks Davis [Mon, 27 Nov 2023 17:06:01 +0000 (17:06 +0000)]
getpagesizes(3): drop support for kernels before 9.0
AT_PAGESIZES and elf_aux_info where added prior to FreeBSD 9.0 in commit ee235befcb8253fab9beea27b916f1bc46b33147. It's safe to say that a
FreeBSD 15 libc won't work on a 8.x kernel so drop sysctl fallback.
Rob N [Tue, 28 Nov 2023 17:07:57 +0000 (04:07 +1100)]
dnode_is_dirty: check dnode and its data for dirtiness
Over its history this the dirty dnode test has been changed between
checking for a dnodes being on `os_dirty_dnodes` (`dn_dirty_link`) and
`dn_dirty_record`.
de198f2d9 Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency 2531ce372 Revert "Report holes when there are only metadata changes" ec4f9b8f3 Report holes when there are only metadata changes 454365bba Fix dirty check in dmu_offset_next() 66aca2473 SEEK_HOLE should not block on txg_wait_synced()
In the case of appending data to a newly created file, the dnode proper
is dirtied (at least to change the blocksize) and dirty records are
added. Thus, a single logical operation is represented by separate
dirty indicators, and must not be separated.
The incorrect dirty check becomes a problem when the first block of a
file is being appended to while another process is calling lseek to skip
holes. There is a small window where the dnode part is undirtied while
there are still dirty records. In this case, `lseek(fd, 0, SEEK_DATA)`
would not know that the file is dirty, and would go to
`dnode_next_offset()`. Since the object has no data blocks yet, it
returns `ESRCH`, indicating no data found, which results in `ENXIO`
being returned to `lseek()`'s caller.
Since coreutils 9.2, `cp` performs sparse copies by default, that is, it
uses `SEEK_DATA` and `SEEK_HOLE` against the source file and attempts to
replicate the holes in the target. When it hits the bug, its initial
search for data fails, and it goes on to call `fallocate()` to create a
hole over the entire destination file.
This has come up more recently as users upgrade their systems, getting
OpenZFS 2.2 as well as a newer coreutils. However, this problem has been
reproduced against 2.1, as well as on FreeBSD 13 and 14.
This change simply updates the dirty check to check both types of dirty.
If there's anything dirty at all, we immediately go to the "wait for
sync" stage, It doesn't really matter after that; both changes are on
disk, so the dirty fields should be correct.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15571
Closes #15526
The vm_guest variable readily covers all uses of xen_domain_type, so
merge them together. Since support for PV domains has been removed
hard-core xen_pv_domain() to return false.