Jorgen Lundman [Mon, 6 Feb 2023 17:34:59 +0000 (02:34 +0900)]
Restore FreeBSD to use .rodata
In https://github.com/openzfs/zfs/pull/14228 the FreeBSD
SECTION_STATIC was set to ".data" instead of ".rodata". This
commit just restores it back to .rodata.
Reviewed-by: Attila Fülöp <attila@fueloep.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #14460
Jorgen Lundman [Mon, 6 Feb 2023 17:27:55 +0000 (02:27 +0900)]
Unify assembly files with macOS
The remaining changes needed to make the assembly files work
with macOS.
Reviewed-by: Attila Fülöp <attila@fueloep.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #14451
Paul Dagnelie [Thu, 2 Feb 2023 23:19:26 +0000 (15:19 -0800)]
Prevent error messages when running tests with no timeout
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #14450
George Amanakis [Thu, 2 Feb 2023 23:17:37 +0000 (00:17 +0100)]
Teach zdb about DMU_OT_ERROR_LOG objects
With the persistent error log feature we need to account for
spa_errlog_{scrub, last} containing mappings to other error log objects,
which need to be marked as in-use as well.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14442
Closes #14434
rob-wing [Thu, 2 Feb 2023 23:16:40 +0000 (14:16 -0900)]
zfs_main.c: fix unused variable error with GCC
zfs_setproctitle_init() is stubbed out on FreeBSD.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Rob Wing <rob.fx907@gmail.com>
Closes #14441
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #14439
George Wilson [Thu, 2 Feb 2023 23:11:35 +0000 (18:11 -0500)]
rootdelay on zfs should be adaptive
The 'rootdelay' boot option currently pauses the boot for a specified
amount of time. The original intent was to ensure that slower
configurations would have ample time to enumerate the devices to make
importing the root pool successful. This, however, causes unnecessary
boot delay for environments like Azure which set this parameter by
default.
This commit changes the initramfs logic to pause until it can
successfully load the 'zfs' module. The timeout specified by
'rootdelay' now becomes the maximum amount of time that initramfs will
wait before failing the boot.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: George Wilson <gwilson@delphix.com>
Closes #14430
Ameer Hamza [Thu, 2 Feb 2023 23:09:57 +0000 (04:09 +0500)]
Fix console progress reporting for recursive send
After commit 19d3961, progress reporting (-v) with replication flag
enabled does not report the progress on the console. This commit
fixes the issue by updating the logic to check for pa->progress
instead of pa_verbosity in send_progress_thread().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14448
Brian Behlendorf [Tue, 24 Jan 2023 23:23:32 +0000 (15:23 -0800)]
Increase default zfs_rebuild_vdev_limit to 64MB
When testing distributed rebuild performance with more capable
hardware it was observed than increasing the zfs_rebuild_vdev_limit
to 64M reduced the rebuild time by 17%. Beyond 64MB there was
some improvement (~2%) but it was not significant when weighed
against the increased memory usage. Memory usage is capped at 1/4
of arc_c_max.
Additionally, vr_bytes_inflight_max has been moved so it's updated
per-metaslab to allow the size to be adjust while a rebuild is
running.
Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14428
Brian Behlendorf [Tue, 24 Jan 2023 22:05:45 +0000 (14:05 -0800)]
Increase default zfs_scan_vdev_limit to 16MB
For HDD based pools the default zfs_scan_vdev_limit of 4M
per-vdev can significantly limit the maximum scrub performance.
Increasing the default to 16M can double the scrub speed from
80 MB/s per disk to 160 MB/s per disk.
This does increase the memory footprint during scrub/resilver
but given the performance win this is a reasonable trade off.
Memory usage is capped at 1/4 of arc_c_max. Note that number
of outstanding I/Os has not changed and is still limited by
zfs_vdev_scrub_max_active.
Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14428
Alexander Motin [Wed, 25 Jan 2023 19:30:24 +0000 (14:30 -0500)]
Prefetch on deadlists merge
During snapshot deletion ZFS may issue several reads for each deadlist
to merge them into next snapshot's or pool's bpobj. Number of the dead
lists increases with number of snapshots. On HDD pools it may take
significant time during which sync thread is blocked.
This patch introduces prescient prefetch of required blocks for up to
128 deadlists ahead. Tests show reduction of time required to delete
dataset with 720 snapshots with randomly overwritten file on wide HDD
pool from 75-85 to 22-28 seconds.
Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Issue #14276
Closes #14402
Brian Behlendorf [Wed, 25 Jan 2023 19:28:54 +0000 (11:28 -0800)]
Improve resilver ETAs
When resilvering the estimated time remaining is calculated using
the average issue rate over the current pass. Where the current
pass starts when a scan was started, or restarted, if the pool
was exported/imported.
For dRAID pools in particular this can result in wildly optimistic
estimates since the issue rate will be very high while scanning
when non-degraded regions of the pool are scanned. Once repair
I/O starts being issued performance drops to a realistic number
but the estimated performance is still significantly skewed.
To address this we redefine a pass such that it starts after a
scanning phase completes so the issue rate is more reflective of
recent performance. Additionally, the zfs_scan_report_txgs
module option can be set to reset the pass statistics more often.
Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14410
Coleman Kane [Tue, 24 Jan 2023 19:20:50 +0000 (14:20 -0500)]
linux 6.2 compat: zpl_set_acl arg2 is now struct dentry
Linux 6.2 changes the second argument of the set_acl operation to be a
"struct dentry *" rather than a "struct inode *". The inode* parameter
is still available as dentry->d_inode, so adjust the call to the _impl
function call to dereference and pass that pointer to it.
Also document that the get_acl -> get_inode_acl member name change from
commit 884a693 was an API change also introduced in Linux 6.2.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #14415
Alexander Motin [Tue, 24 Jan 2023 17:20:32 +0000 (12:20 -0500)]
Introduce minimal ZIL block commit delay
Despite all optimizations, tests on actual hardware show that FreeBSD
kernel can't sleep for less then ~2us. Similar tests on Linux show
~50us delay at least from nanosleep() (haven't tested inside kernel).
It means that on very fast log device ZIL may not be able to satisfy
zfs_commit_timeout_pct block commit timeout, increasing log latency
more than desired.
Handle that by introduction of zil_min_commit_timeout parameter,
specifying minimal timeout value where additional delays to aggregate
writes may be skipped. Also skip delays if the LWB is more than 7/8
full, that often happens if I/O sizes are constant and match one of
LWB sizes. Both things are applied only if there were no already
outstanding log blocks, that may indicate single-threaded workload,
that by definition can not benefit from the commit delays.
While there, add short time moving average to zl_last_lwb_latency to
make it more stable.
Tests of single-threaded 4KB writes to NVDIMM SLOG on FreeBSD show IOPS
increase by 9% instead of expected 5%. For zfs_commit_timeout_pct of
1 there IOPS increase by 5.5% instead of expected 1%.
Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14418
Attila Fülöp [Mon, 23 Jan 2023 19:25:21 +0000 (20:25 +0100)]
x86 asm: Replace .align with .balign
The .align directive used to align storage locations is
ambiguous. On some platforms and assemblers it takes a byte count,
on others the argument is interpreted as a shift value. The current
usage expects the first interpretation.
Replace it with the unambiguous .balign directive which always
expects a byte count, regardless of platform and assembler.
Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #14422
Attila Fülöp [Mon, 23 Jan 2023 18:48:39 +0000 (19:48 +0100)]
IPC: blake3 x86 asm: fix placement of .size directives
The .size directive used by the SET_SIZE C macro uses the special
dot symbol to calculate the size of a function. The dot symbol
refers to the current address, so for the calculation to be
meaningful the SET_SIZE macro must be placed immediately after the
end of the function the size is calculated for.
Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #14422
David Hedberg [Mon, 23 Jan 2023 21:19:43 +0000 (22:19 +0100)]
Wait for txg sync if the last DRR_FREEOBJECTS might result in a hole
If we receive a DRR_FREEOBJECTS as the first entry in an object range,
this might end up producing a hole if the freed objects were the
only existing objects in the block.
If the txg starts syncing before we've processed any following
DRR_OBJECT records, this leads to a possible race where the backing
arc_buf_t gets its psize set to 0 in the arc_write_ready() callback
while still being referenced from a dirty record in the open txg.
To prevent this, we insert a txg_wait_synced call if the first
record in the range was a DRR_FREEOBJECTS that actually
resulted in one or more freed objects.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: David Hedberg <david.hedberg@findity.com>
Sponsored by: Findity AB
Closes #11893
Closes #14358
Richard Yao [Mon, 23 Jan 2023 21:16:22 +0000 (16:16 -0500)]
Reject streams that set ->drr_payloadlen to unreasonably large values
In the zstream code, Coverity reported:
"The argument could be controlled by an attacker, who could invoke the
function with arbitrary values (for example, a very high or negative
buffer size)."
It did not report this in the kernel. This is likely because the
userspace code stored this in an int before passing it into the
allocator, while the kernel code stored it in a uint32_t.
However, this did reveal a potentially real problem. On 32-bit systems
and systems with only 4GB of physical memory or less in general, it is
possible to pass a large enough value that the system will hang. Even
worse, on Linux systems, the kernel memory allocator is not able to
support allocations up to the maximum 4GB allocation size that this
allows.
This had already been limited in userspace to 64MB by
`ZFS_SENDRECV_MAX_NVLIST`, but we need a hard limit in the kernel to
protect systems. After some discussion, we settle on 256MB as a hard
upper limit. Attempting to receive a stream that requires more memory
than that will result in E2BIG being returned to user space.
Reported-by: Coverity (CID-1529836) Reported-by: Coverity (CID-1529837) Reported-by: Coverity (CID-1529838) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14285
rob-wing [Mon, 23 Jan 2023 21:14:25 +0000 (12:14 -0900)]
Configure zed's diagnosis engine with vdev properties
Introduce four new vdev properties:
checksum_n
checksum_t
io_n
io_t
These properties can be used for configuring the thresholds of zed's
diagnosis engine and are interpeted as <N> events in T <seconds>.
When this property is set to a non-default value on a top-level vdev,
those thresholds will also apply to its leaf vdevs. This behavior can be
overridden by explicitly setting the property on the leaf vdev.
Note that, these properties do not persist across vdev replacement. For
this reason, it is advisable to set the property on the top-level vdev
instead of the leaf vdev.
The default values for zed's diagnosis engine (10 events, 600 seconds)
remains unchanged.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Sponsored-by: Seagate Technology LLC
Closes #13805
Richard Yao [Mon, 23 Jan 2023 21:12:37 +0000 (16:12 -0500)]
free_blocks(): Fix reports from 2016 PVS Studio FreeBSD report
In 2016, the authors of PVS Studio ran it on the FreeBSD kernel, which
identified a number of bugs / cleanup opportunities in the FreeBSD ZFS kernel
code. A few of them persist to the present day:
In particular, we have the following in free_blocks():
\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\dnode_sync.c (174): error V547: Expression '__left >= __right' is always true. Unsigned type value is always >= 0.
\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\dnode_sync.c (171): error V634: The priority of the '*' operation is higher than that of the '<<' operation. It's possible that parentheses should be used in the expression.
\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\dnode_sync.c (175): error V547: Expression '__left >= __right' is always true. Unsigned type value is always >= 0.
A couple of assertions accidentally typecast the arguments they check to
unsigned in such a way that the result is always true. Also, parentheses
are missing around `1<<epbs` in `(db->db_blkid * 1<<epbs)`. This works
out to be okay due to multiplication not caring what order of operations
we use, but it is better to fix it to be `(db->db_blkid << epbs)`.
A few of the function local variables probably never should have been
32-bit in the first place, so we make them 64-bit. We also replace the
existing assertions with additional assertions to ensure that 64-bit
unsigned arithmetic is safe.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14407
Chunwei Chen [Fri, 20 Jan 2023 19:49:56 +0000 (11:49 -0800)]
Fix reading uninitialized variable in receive_read
When zfs_file_read returns error, resid may be uninitialized.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #14404
Allan Jude [Fri, 20 Jan 2023 19:11:54 +0000 (14:11 -0500)]
zfs_receive_one: Check for the more likely error first
If zfs_receive_one() gets back EINVAL, check for the more likely case,
embedded block pointers + encryption and return that error, before
falling back to the less likely case, a resumable stream when the
kernel has not been upgraded to support resume.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Sponsored-by: rsync.net Sponsored-by: Klara Inc.
Closes #14379
Richard Yao [Fri, 20 Jan 2023 19:10:15 +0000 (14:10 -0500)]
Cleanup ->dd_space_towrite should be unsigned
This is only ever used with unsigned data, so the type itself should be
unsigned. Also, PVS Studio's 2016 FreeBSD kernel report correctly
identified the following assertion as always being true, so we can drop
it:
The reason it was always true is because it would do casts to give us
unsigned comparisons. This could have been fixed by switching to
`ASSERT3S()`, but upon inspection, it turned out that this variable
never should have been allowed to be signed in the first place.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14408
Mark Johnston [Mon, 16 Jan 2023 12:53:16 +0000 (07:53 -0500)]
Micro-optimize dsl_prop_get_dd()
Use the saved property index instead of looking it up once per DSL
directory when traversing up towards the root.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Sponsored-by: The FreeBSD Foundation
Closes #14397
Mark Johnston [Sun, 15 Jan 2023 22:05:19 +0000 (17:05 -0500)]
Avoid passing an uninitialized index to dsl_prop_known_index
Reported-by: KMSAN Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Sponsored-by: The FreeBSD Foundation
Closes #14397
Chunwei Chen [Fri, 20 Jan 2023 00:59:05 +0000 (16:59 -0800)]
Fix unprotected zfs_znode_dmu_fini
In original code, zfs_znode_dmu_fini is called in zfs_rmnode without
zfs_znode_hold_enter. It seems to assume it's ok to do so when the znode
is unlinked. However this assumption is not correct, as zfs_zget can be
called by NFS through zpl_fh_to_dentry as pointed out by Christian in
https://github.com/openzfs/zfs/pull/12767, which could result in a
use-after-free bug.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #12767
Closes #14364
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #14400
Jorgen Lundman [Tue, 17 Jan 2023 19:09:19 +0000 (04:09 +0900)]
Unify Assembler files between Linux and Windows
Add new macro ASMABI used by Windows to change
calling API to "sysv_abi".
Reviewed-by: Attila Fülöp <attila@fueloep.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #14228
Ameer Hamza [Tue, 17 Jan 2023 18:17:35 +0000 (23:17 +0500)]
Use setproctitle to report progress of zfs send
This allows parsing of zfs send progress by checking the process
title.
Doing so requires some changes to the send code in libzfs_sendrecv.c;
primarily these changes move some of the accounting around, to allow
for the code to be verbose as normal, or set the process title. Unlike
BSD, setproctitle() isn't standard in Linux; thus, borrowed it from
libbsd with slight modifications.
Authored-by: Sean Eric Fagan <sef@FreeBSD.org> Co-authored-by: Ryan Moeller <ryan@iXsystems.com> Co-authored-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14376
Richard Yao [Tue, 17 Jan 2023 17:57:12 +0000 (12:57 -0500)]
Cleanup of dead code suggested by Clang Static Analyzer (#14380)
I recently gained the ability to run Clang's static analyzer on the
linux kernel modules via a few hacks. This extended coverage to code
that was previously missed since Clang's static analyzer only looked at
code that we built in userspace. Running it against the Linux kernel
modules built from my local branch produced a total of 72 reports
against my local branch. Of those, 50 were reports of logic errors and
22 were reports of dead code. Since we already had cleaned up all of
the previous dead code reports, I felt it would be a good next step to
clean up these dead code reports. Clang did a further breakdown of the
dead code reports into:
Dead assignment 15
Dead increment 2
Dead nested assignment 5
The benefit of cleaning these up, especially in the case of dead nested
assignment, is that they can expose places where our error handling is
incorrect. A number of them were fairly straight forward. However
several were not:
In vdev_disk_physio_completion(), not only were we not using the return
value from the static function vdev_disk_dio_put(), but nothing used it,
so I changed it to return void and removed the existing (void) cast in
the other area where we call it in addition to no longer storing it to a
stack value.
In FSE_createDTable(), the function is dead code. Its helper function
FSE_freeDTable() is also dead code, as are the CPP definitions in
`module/zstd/include/zstd_compat_wrapper.h`. We just delete it all.
In zfs_zevent_wait(), we have an optimization opportunity. cv_wait_sig()
returns 0 if there are waiting signals and 1 if there are none. The
Linux SPL version literally returns `signal_pending(current) ? 0 : 1)`
and FreeBSD implements the same semantics, we can just do
`!cv_wait_sig()` in place of `signal_pending(current)` to avoid
unnecessarily calling it again.
zfs_setattr() on FreeBSD version did not have error handling issue
because the code was removed entirely from FreeBSD version. The error is
from updating the attribute directory's files. After some thought, I
decided to propapage errors on it to userspace.
In zfs_secpolicy_tmp_snapshot(), we ignore a lack of permission from the
first check in favor of checking three other permissions. I assume this
is intentional.
In zfs_create_fs(), the return value of zap_update() was not checked
despite setting an important version number. I see no backward
compatibility reason to permit failures, so we add an assertion to catch
failures. Interestingly, Linux is still using ASSERT(error == 0) from
OpenSolaris while FreeBSD has switched to the improved ASSERT0(error)
from illumos, although illumos has yet to adopt it here. ASSERT(error ==
0) was used on Linux while ASSERT0(error) was used on FreeBSD since the
entire file needs conversion and that should be the subject of
another patch.
dnode_move()'s issue was caused by us not having implemented
POINTER_IS_VALID() on Linux. We have a stub in
`include/os/linux/spl/sys/kmem_cache.h` for it, when it really should be
in `include/os/linux/spl/sys/kmem.h` to be consistent with
Illumos/OpenSolaris. FreeBSD put both `POINTER_IS_VALID()` and
`POINTER_INVALIDATE()` in `include/os/freebsd/spl/sys/kmem.h`, so we
copy what it did.
Whenever a report was in platform-specific code, I checked the FreeBSD
version to see if it also applied to FreeBSD, but it was only relevant a
few times.
Lastly, the patch that enabled Clang's static analyzer to be run on the
Linux kernel modules needs more work before it can be put into a PR. I
plan to do that in the future as part of the on-going static analysis
work that I am doing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14380
Brian Behlendorf [Tue, 17 Jan 2023 17:53:53 +0000 (09:53 -0800)]
ZTS: Annotate additonal flaky test cases
Update several flaky test cases in zts-report.py.in until they
can be made entirely reliable.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14392
Brian Behlendorf [Tue, 17 Jan 2023 17:52:27 +0000 (09:52 -0800)]
CI: Reclaim space after package operations
Rather than reclaiming space before updating the packages do
it afterwards. This avoids issues with apt returning an
error due to missing files on the system.
Rob Wing [Wed, 21 Dec 2022 06:52:26 +0000 (21:52 -0900)]
zpool-set: print error message when pool or vdev is not valid
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Sponsored-by: Seagate Technology Submitted-by: Klara, Inc.
Closes #14310
Rob Wing [Wed, 21 Dec 2022 06:33:11 +0000 (21:33 -0900)]
zpool-set: update usage text
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Sponsored-by: Seagate Technology Submitted-by: Klara, Inc.
Closes #14310
Richard Yao [Fri, 13 Jan 2023 18:58:58 +0000 (13:58 -0500)]
Linux ppc64le ieee128 compat: Do not redefine __asm on external headers
There is an external assembly declaration extension in GNU C that glibc
uses when building with ieee128 floating point support on ppc64le.
Marking that as volatile makes no sense, so the build breaks.
It does not make sense to only mark this as volatile on Linux, since if
do not want the compiler reordering things on Linux, we do not want the
compiler reordering things on any other platform, so we stop treating
Linux specially and just manually inline the CPP macro so that we can
eliminate it. This should fix the build on ppc64le.
Tested-by: @gyakovlev Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14308
Closes #14384
Richard Yao [Tue, 10 Jan 2023 21:41:26 +0000 (16:41 -0500)]
Cleanup: Use MIN() macro
The Linux 5.16.14 kernel's coccicheck caught this. The semantic
patch that caught it was:
./scripts/coccinelle/misc/minmax.cocci
There was a third opportunity to use `MIN()`, but that was in
`FSE_minTableLog()` in `module/zstd/lib/compress/fse_compress.c`.
Upstream zstd has yet to make this change and I did not want to change
header includes just for MIN, or do a one off, so I left it alone.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14372
When !IS_DEVVP(ZTOV(zp)) is false, IS_DEVVP(ZTOV(zp)) is true under the
law of the excluded middle since we are not doing pseudoboolean alegbra.
Therefore doing:
Richard Yao [Tue, 10 Jan 2023 20:44:35 +0000 (15:44 -0500)]
Cleanup: Replace oldstyle struct hack with C99 flexible array members
The Linux 5.16.14 kernel's coccicheck caught this. The semantic
patch that caught it was:
./scripts/coccinelle/misc/flexible_array.cocci
However, unlike the cases where the GNU zero length array extension had
been used, coccicheck would not suggest patches for the older style
single member arrays. That was good because blindly changing them would
break size calculations in most cases.
Therefore, this required care to make sure that we did not break size
calculations. In the case of `indirect_split_t`, we use
`offsetof(indirect_split_t, is_child[is->is_children])` to calculate
size. This might be subtly wrong according to an old mailing list
thread:
That is because the C99 specification should consider the flexible array
members to start at the end of a structure, but compilers prefer to put
padding at the end. A suggestion was made to allow compilers to allocate
padding after the VLA like compilers already did:
However, upon thinking about it, whether or not we allocate end of
structure padding does not matter, so using offsetof() to calculate the
size of the structure is fine, so long as we do not mix it with sizeof()
on structures with no array members.
In the case that we mix them and padding causes offsetof(struct_t,
vla_member[0]) to differ from sizeof(struct_t), we would be doing unsafe
operations if we underallocate via `offsetof()` and then overcopy via
sizeof().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14372
d_alias may need to be converted to du.d_alias
depending on the kernel version.
d_alias is currently in only one place in the code which
changes
"hlist_for_each_entry(dentry, &inode->i_dentry, d_alias)"
to
"hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias)"
as neccesary.
This effectively results in a double macro expansion
for code that uses the zfs headers but already has its
own macro for just d_alias (lustre in this case).
Remove the conditional code for hlist_for_each_entry
and have a macro for "d_alias -> du.d_alias" instead.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gian-Carlo DeFazio <defazio1@llnl.gov>
Closes #14377
George Amanakis [Thu, 12 Jan 2023 02:00:39 +0000 (03:00 +0100)]
Activate filesystem features only in syncing context
When activating filesystem features after receiving a snapshot, do
so only in syncing context.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14304
Closes #14252
Brian Behlendorf [Wed, 11 Jan 2023 23:18:51 +0000 (15:18 -0800)]
CI: remove unused packages/snaps
Removing portions of packages/snaps directly with rm can result in
unexpected errors when running `apt update`. Free up the additional
space by removing (some) packages with the proper tools.
This change frees up slightly less space than before, but it is
expected to still be sufficient.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14374
rob-wing [Wed, 11 Jan 2023 23:14:35 +0000 (14:14 -0900)]
zpool: do guid-based comparison in is_vdev_cb()
is_vdev_cb() uses string comparison to find a matching vdev and
will fallback to comparing the guid via a string. These changes
drop the string comparison and compare the guids instead.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Sponsored-by: Seagate Technology Submitted-by: Klara, Inc.
Closes #14311
Turn default_bs and default_ibs into ZFS_MODULE_PARAMs
The default_bs and default_ibs tunables control the default block size
and indirect block size.
So far, default_bs and default_ibs were tunable only on FreeBSD, e.g.,
sysctl vfs.zfs.default_ibs
Remove the FreeBSD-specific sysctl code and expose default_bs and
default_ibs as tunables on both Linux and FreeBSD using
ZFS_MODULE_PARAM.
One of the use cases for changing the values of those tunables is to
lower the indirect block size, which may improve performance of large
directories (as discussed during the OpenZFS Leadership Meeting
on 2022-08-16).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com> Sponsored-by: Wasabi Technology, Inc.
Closes #14293
Add tunable to allow changing micro ZAP's max size
This change turns `MZAP_MAX_BLKSZ` into a `ZFS_MODULE_PARAM()` called
`zap_micro_max_size`. As a result, we can experiment with different
micro ZAP sizes to improve directory size scaling.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Mateusz Piotrowski <mateuszpiotrowski@klarasystems.com> Co-authored-by: Toomas Soome <toomas.soome@klarasystems.com> Signed-off-by: Mateusz Piotrowski <mateuszpiotrowski@klarasystems.com> Sponsored-by: Wasabi Technology, Inc.
Closes #14292
Alyssa Ross [Tue, 10 Jan 2023 21:40:31 +0000 (21:40 +0000)]
etc/systemd/zfs-mount-generator: avoid strndupa
The non-standard strndupa function is not implemented by musl libc,
and can be dangerous due to its potential to blow the stack. (musl
_does_ implement strdupa, used elsewhere in this function.)
With a similar amount of code, we can use a heap allocation to
construct the pool name, which is musl-friendly and doesn't have
potential stack problems.
(Why care about musl when systemd only supports glibc? Some distros
patch systemd with portability fixes, and it would be nice to be able
to use ZFS on those distros.)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alyssa Ross <alyssa.ross@unikie.com>
Closes #14327
Matthew Ahrens [Tue, 10 Jan 2023 21:39:22 +0000 (13:39 -0800)]
Batch enqueue/dequeue for bqueue
The Blocking Queue (bqueue) code is used by zfs send/receive to send
messages between the various threads. It uses a shared linked list,
which is locked whenever we enqueue or dequeue. For workloads which
process many blocks per second, the locking on the shared list can be
quite expensive.
This commit changes the bqueue logic to have 3 linked lists:
1. An enquing list, which is used only by the (single) enquing thread,
and thus needs no locks.
2. A shared list, with an associated lock.
3. A dequing list, which is used only by the (single) dequing thread,
and thus needs no locks.
The entire enquing list can be moved to the shared list in constant
time, and the entire shared list can be moved to the dequing list in
constant time. These operations only happen when the `fill_fraction` is
reached, or on an explicit flush request. Therefore, the lock only
needs to be acquired infrequently.
The API already allows for dequing to block until an explicit flush, so
callers don't need to be changed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #14121
ECHRNG is returned when the channel program encounters a runtime
error. For example, this can happen when a snapshot doesn't exist.
We handle this error the same way as the existing EEXIST and ENOENT
error checks.
Additionally, improve the internal debug message to include the
error describing why a pool couldn't be opened.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14351
ztest: update expectation for sparing a special device
Commit c23738c70eb86a7f04f93292caef2ed977047608 modified the expected
behavior of attach to prevent hot spares from being used as special
vdev replacements. We update ztest's expectations accordingly to
prevent it from failing when testing the updated behavior.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14351
Matthew Ahrens [Tue, 10 Jan 2023 00:43:45 +0000 (16:43 -0800)]
ztest fails assertion in zio_write_gang_member_ready()
Encrypted blocks can have up to 2 DVA's, as the third DVA is reserved
for the salt+IV. However, dmu_write_policy() allows non-encrypted
blocks (e.g. DMU_OT_OBJSET) inside encrypted datasets to request and
allocate 3 DVA's, since they don't need a salt+IV (they are merely
authenicated).
However, if such a block becomes a gang block, the gang code incorrectly
limits the gang block header to 2 DVA's. This leads to a "NDVAs
inversion", where a parent block (the gang block header) has less DVA's
than its children (the gang members), causing an assertion failure in
zio_write_gang_member_ready().
This commit addresses the problem by only restricting the gang block
header to 2 DVA's if the block is actually encrypted (and thus its gang
block members can have at most 2 DVA's).
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #14250
Closes #14356
Charles Suh [Mon, 9 Jan 2023 20:49:35 +0000 (12:49 -0800)]
libzpool: fix ddi_strtoull to update nptr
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Charles Suh <charles.suh@gmail.com>
Closes #14360
Ameer Hamza [Mon, 9 Jan 2023 20:43:03 +0000 (01:43 +0500)]
zed: add hotplug support for spare vdevs
This commit supports for spare vdev hotplug. The
spare vdev associated with all the pools will be
marked as "Removed" when the drive is physically
detached and will become "Available" when the
drive is reattached. Currently, the spare vdev
status does not change on the drive removal and
the same is the case with reattachment.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14295
Alexander Motin [Mon, 9 Jan 2023 18:45:17 +0000 (13:45 -0500)]
Remove some dead ARC code. (#14340)
Every ARC buffer holds a reference on the header. It means headers with
buffers are never evictable. When we are evicting a header, there can
be no more buffers to free. Just assert that.
b_evict_lock seems not protecting anything now. Remove it.
Buffers checksum should also be freed with the last uncompressed buffer,
so it should not be there also when we are evicting the header.
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Coleman Kane [Tue, 27 Dec 2022 03:59:13 +0000 (22:59 -0500)]
linux 6.2 compat: bio->bi_rw was renamed bio->bi_opf
The bi_rw member of struct bio was renamed to bi_opf in Linux 6.2.
As well, Linux's implementation of bio_set_op_attrs(...) has been
removed.
The HAVE_BIO_BI_OPF macro already appears to be defined, but the
removal of the bio_set_op_attrs(...) implementation makes the build
fall back on the locally-defined implementation, which isn't updated
for the bio->bi_opf change. This commit adds that update.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #14324
Closes #14331
Coleman Kane [Tue, 27 Dec 2022 03:04:34 +0000 (22:04 -0500)]
linux 6.2 compat: get_acl() got moved to get_inode_acl() in 6.2
Linux 6.2 renamed the get_acl() operation to get_inode_acl() in
the inode_operations struct. This should fix Issue #14323.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #14323
Closes #14331
Antonio Russo [Thu, 5 Jan 2023 04:41:16 +0000 (21:41 -0700)]
Introduce ZFS_LINUX_REQUIRE_API autoconf macro
Currently, if API tests fail, we either ignore the failures, or
unconditionally halt the kernel build. This leads to situations where
incompatibilities with existing APIs may develop, but not trip the
configure compatibility checks.
This introduces a new mechanism to require APIs for kernels above a
particular version. While not perfect, this at least guarantees
mainline kernels do not break existing APIs without at least providing
some warning.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Antonio Russo <aerusso@aerusso.net>
Closes #14343
Antonio Russo [Sat, 31 Dec 2022 14:51:32 +0000 (07:51 -0700)]
Linux 6.1 compat: open inside tmpfile()
Linux 863f144 modified the .tmpfile interface to pass a struct file,
rather than a struct dentry, and expect the tmpfile implementation to
open inside of tmpfile().
This patch implements a configuration test that checks for this new API
and appropriately sets a HAVE_TMPFILE_DENTRY flag that tracks this old
API. Contingent on this flag, the appropriate API is implemented.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Antonio Russo <aerusso@aerusso.net>
Closes #14301
Closes #14343
Antonio Russo [Fri, 6 Jan 2023 18:52:08 +0000 (11:52 -0700)]
ZTS: close in mmapwrite.c
mmapwrite is used during the ZTS to identify issues with mmap-ed files.
This helper program exercises this pathway by continuously writing to a
file. ee6bf97c7 modified the writing threads to terminate after a set
amount of total data is written. This change allows standard program
execution to reach the end of a writer thread without closing the file
descriptor, introducing a resource "leak."
This patch appeases resource leak analyses by close()-ing the file at
the end of the thread.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Antonio Russo <aerusso@aerusso.net>
Closes #14353
Antonio Russo [Thu, 5 Jan 2023 20:50:30 +0000 (13:50 -0700)]
ZTS: limit mmapwrite file size
mmapwrite spawns several threads, all of which perform writes on a file
for the purpose of testing the behavior of mmap(2)-ed files. One
thread performs an mmap and a write to the beginning of that region,
while the others perform regular writes after lseek(2)-ing the end of
the file.
Because these regular writes are set in a while (1) loop, they will
write an unbounded amount of data to disk. The mmap_write_001_pos test
script SIGKILLs them after 30 seconds, but on fast testbeds, this may
be enough time to exhaust the available space in the filesystem,
leading to spurious test failures.
Instead, limit the total file size by checking that the lseek return
value is no greater than 250 * 1024*1024 bytes, which is less than the
default minimum vdev size defined in includes/default.cfg .
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Antonio Russo <aerusso@aerusso.net>
Closes #14277
Closes #14345
Clemens Lang [Thu, 5 Jan 2023 20:07:43 +0000 (20:07 +0000)]
contrib: dracut: Do not timeout waiting for pw
systemd-ask-password has a default timeout of 90 seconds, which means
that dracut will fall back to the rescue shell 4.5 minutes after boot if
no password is entered.
This is undesirable when combined with, for example, unlocking remotely
using dracut-sshd and systemd-tty-ask-password-agent. See also
https://github.com/gsauthof/dracut-sshd#timeout and
https://bugzilla.redhat.com/show_bug.cgi?id=868421.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Clemens Lang <neverpanic@gmail.com>
Closes #14341
Authored by: Dan McDonald <danmcd@mnx.io>
Reviewed by: Patrick Mooney <pmooney@pfmooney.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Joshua M. Clulow <josh@sysmgr.org> Ported-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Illumos-issue: https://www.illumos.org/issues/15286
Illumos-commit: https://github.com/illumos/illumos-gate/commit/f137b22e734e85642da3e56e8b94da3f5f027c73
Porting Notes:
The patch in illumos did not have much of a commit message, and did not
provide attribution to the reporter, while original patch proposed to
OpenZFS did, so I am listing the reporter (myself) and original patch
author (also myself) below while including the original commit message
with some minor corrections as part of the porting notes:
There, we have type promotion from int8_t to size_t, which is unsigned.
C will sign extend the value as part of the widening before treating the
value as unsigned and the negative values we can counter are error
values from U8_ILLEGAL_CHAR and U8_OUT_OF_RANGE_CHAR, which are -1 and
-2 respectively. The unsigned versions of these under two's complement
are SIZE_MAX and SIZE_MAX-1 respectively.
The bounds check is written under the assumption that `size <= 1` does a
signed comparison. This is followed by a pointer comparison to see if
the string has the correct length, which is fine.
A little further down we have:
for (i = 0; i < size; i++)
tc[i] = *p++;
When an error condition is encountered, this will attempt to iterate at
least SIZE_MAX-1 times, which will massively overflow the buffer, which
is not fine.
The kernel will kill the loop as soon as it hits the kernel stack guard
on Linux systems built with CONFIG_VMAP_STACK=y, which should be just
about all of them. That prevents arbitrary code execution and just about
any other bad thing that a black hat attacker might attempt with
knowledge of this buffer overflow. Other systems' kernels have
mitigations for unbounded in-kernel buffer overflows that will catch
this too.
Also, the patch in illumos-gate made an effort to fix C style issues
that had been fixed in the OpenZFS/ZFSOnLinux repository. Those issues
had been mentioned in the email that I originally sent them about this
issue. One of the fixes had not been already done, so it is included.
Another to collect_a_seq()'s arguments was handled differently in
OpenZFS. For the sake of avoiding unnecessary differences, it has been
adopted. This has the interesting effect that if you correct the paths
in the illumos-gate patch to match the current OpenZFS repository, you
can reverse apply it cleanly.
Original-patch-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reported-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Co-authored-by: Dan McDonald <danmcd@mnx.io>
Closes #14318
Closes #14342
Matthew Ahrens [Thu, 5 Jan 2023 19:04:24 +0000 (11:04 -0800)]
removal of LegacyVersion broke ax_python_dev.m4
The 22.0 release of the python `packaging` package removed the
`LegacyVersion` trait, causing ZFS to no longer compile.
This commit replaces the sections of `ax_python_dev.m4` that rely on
`LegacyVersion` with updated implementations from the upstream
`autoconf-archive`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #14297
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #14328
Martin Rüegg [Thu, 29 Dec 2022 09:59:12 +0000 (10:59 +0100)]
Fix shebang for helper script of deb-utils
Shebang was missing the `!` between `#` and the actual path.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Martin Rüegg <martin.rueegg@metaworx.ch>
Closes #14339
Martin Rüegg [Thu, 29 Dec 2022 09:56:39 +0000 (10:56 +0100)]
Add quotation marks around `$PATH` for deb-utils
Fix #14338, failing to build deb-utils if existing `$PATH` variable
would include a whitespace.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Martin Rüegg <martin.rueegg@metaworx.ch>
Closes #14339
Alexander Motin [Thu, 5 Jan 2023 17:31:55 +0000 (12:31 -0500)]
Pack zrlock_t by 8 bytes
On FreeBSD this reduces this structure size from 64 to 56 bytes.
dnode_handle_t respectively reduces from 72 to 64 bytes. It sounds
like a waste to need 72 bytes to be able to relocate 808 bytes of
dnode_t, which relocation on FreeBSD is not even supported.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14317
Alexander Motin [Thu, 5 Jan 2023 17:29:13 +0000 (12:29 -0500)]
Update arc_summary and arcstat outputs
Recent ARC commits added new statistic counters, such as iohits,
uncached state, etc. Represent those. Also some of previously
reported numbers were confusing or even made no sense. Cleanup
and restructure existing reports.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Issue #14115
Issue #14123
Issue #14243
Closes #14320
Alexander Motin [Thu, 5 Jan 2023 17:15:31 +0000 (12:15 -0500)]
Hide b_freeze_* under ZFS_DEBUG
This saves 40 bytes per full ARC header, reducing it on FreeBSD from
240 to 200 bytes on production bits.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #14315
Alexander Motin [Thu, 5 Jan 2023 00:29:54 +0000 (19:29 -0500)]
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
Alexander Motin [Thu, 22 Dec 2022 20:10:24 +0000 (15:10 -0500)]
arc_read()/arc_access() refactoring and cleanup
ARC code was many times significantly modified over the years, that
created significant amount of tangled and potentially broken code.
This should make arc_access()/arc_read() code some more readable.
- Decouple prefetch status tracking from b_refcnt. It made sense
originally, but became highly cryptic over the years. Move all the
logic into arc_access(). While there, clean up and comment state
transitions in arc_access(). Some transitions were weird IMO.
- Unify arc_access() calls to arc_read() instead of sometimes calling
it from arc_read_done(). To avoid extra state changes and checks add
one more b_refcnt for ARC_FLAG_IO_IN_PROGRESS.
- Reimplement ARC_FLAG_WAIT in case of ARC_FLAG_IO_IN_PROGRESS with
the same callback mechanism to not falsely account them as hits. Count
those as "iohits", an intermediate between "hits" and "misses". While
there, call read callbacks in original request order, that should be
good for fairness and random speculations/allocations/aggregations.
- Introduce additional statistic counters for prefetch, accounting
predictive vs prescient and hits vs iohits vs misses.
- Remove hash_lock argument from functions not needing it.
- Remove ARC_FLAG_PREDICTIVE_PREFETCH, since it should be opposite
to ARC_FLAG_PRESCIENT_PREFETCH if ARC_FLAG_PREFETCH is set. We may
wish to add ARC_FLAG_PRESCIENT_PREFETCH to few more places.
- Fix few false positive tests found in the process.
Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14123
Matthew Ahrens [Thu, 22 Dec 2022 19:48:49 +0000 (11:48 -0800)]
deadlock between spa_errlog_lock and dp_config_rwlock
There is a lock order inversion deadlock between `spa_errlog_lock` and
`dp_config_rwlock`:
A thread in `spa_delete_dataset_errlog()` is running from a sync task.
It is holding the `dp_config_rwlock` for writer (see
`dsl_sync_task_sync()`), and waiting for the `spa_errlog_lock`.
A thread in `dsl_pool_config_enter()` is holding the `spa_errlog_lock`
(see `spa_get_errlog_size()`) and waiting for the `dp_config_rwlock` (as
reader).
Note that this was introduced by #12812.
This commit address this by defining the lock ordering to be
dp_config_rwlock first, then spa_errlog_lock / spa_errlist_lock.
spa_get_errlog() and spa_get_errlog_size() can acquire the locks in this
order, and then process_error_block() and get_head_and_birth_txg() can
verify that the dp_config_rwlock is already held.
Additionally, a buffer overrun in `spa_get_errlog()` is corrected. Many
code paths didn't check if `*count` got to zero, instead continuing to
overwrite past the beginning of the userspace buffer at `uaddr`.
Tested by having some errors in the pool (via `zinject -t data
/path/to/file`), one thread running `zpool iostat 0.001`, and another
thread runs `zfs destroy` (in a loop, although it hits the first time).
This reproduces the problem easily without the fix, and works with the
fix.
Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: George Amanakis <gamanakis@gmail.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #14239
Closes #14289
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14306
George Melikov [Tue, 20 Dec 2022 00:08:15 +0000 (03:08 +0300)]
systemd: set restart=always for zfs-zed.service
Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Melikov <mail@gmelikov.ru> Co-authored-by: Attila Fülöp <attila@fueloep.org>
Closes #14294
Doug Rabson [Wed, 14 Dec 2022 01:35:07 +0000 (01:35 +0000)]
FreeBSD: Remove stray debug printf
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Doug Rabson <dfr@rabson.org>
Closes #14286
Closes #14287
Umer Saleem [Wed, 14 Dec 2022 01:33:05 +0000 (06:33 +0500)]
Add native-deb* targets to build native Debian packages
In continuation of previous #13451, this commits adds native-deb*
targets for make to build native debian packages. Github workflows
are updated to build and test native Debian packages.
Native packages only build with pre-configured paths (see the
dh_auto_configure section in contrib/debian/rules.in). While
building native packages, paths should not be configured. Initial
config flags e.g. '--enable-debug' are replaced in
contrib/debian/rules.in.
Additional packages on top of existing zfs packages required to
build native packages include debhelper-compat, dh-python, dkms,
po-debconf, python3-all-dev, python3-sphinx.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #14265
Richard Yao [Wed, 14 Dec 2022 01:31:47 +0000 (20:31 -0500)]
Zero end of embedded block buffer in dump_write_embedded()
This fixes a kernel stack leak.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Tested-by: Nicholas Sherlock <n.sherlock@gmail.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13778
Closes #14255
Ameer Hamza [Wed, 14 Dec 2022 01:30:46 +0000 (06:30 +0500)]
Allow receiver to override encryption properties in case of replication
Currently, the receiver fails to override the encryption
property for the plain replicated dataset with the error:
"cannot receive incremental stream: encryption property
'encryption' cannot be set for incremental streams.". The
problem is resolved by allowing the receiver to override
the encryption property for plain replicated send.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14253
Closes #13533
Richard Yao [Wed, 14 Dec 2022 01:29:21 +0000 (20:29 -0500)]
Cache dbuf_hash() calculation
We currently compute a 64-bit hash three times, which consumes 0.8% CPU
time on ARC eviction heavy workloads. Caching the 64-bit value in the
dbuf allows us to avoid that overhead.
Sponsored-By: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Closes #14251
Allan Jude [Wed, 14 Dec 2022 01:27:54 +0000 (20:27 -0500)]
zfs list: Allow more fields in ZFS_ITER_SIMPLE mode
If the fields to be listed and sorted by are constrained to those
populated by dsl_dataset_fast_stat(), then zfs list is much faster,
as it does not need to open each objset and reads its properties.
A previous optimization by Pawel Dawidek
(0cee24064a79f9c01fc4521543c37acea538405f) took advantage
of this to make listing snapshot names sorted only by name much faster.
However, it was limited to `-o name -s name`, this work extends this
optimization to work with:
- name
- guid
- createtxg
- numclones
- inconsistent
- redacted
- origin
and could be further extended to any other properties supported by
dsl_dataset_fast_stat() or similar, that do not require extra locking
or reading from disk.
If the kernel does not populate zc->zc_objset_stats, we now fallback
to getting the properties via the slower interface, to avoid problems
with newer userland and older kernels.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #14110
Marcel Menzel [Wed, 14 Dec 2022 01:26:10 +0000 (02:26 +0100)]
Change ZEVENT_POOL_GUID to ZEVENT_POOL to display pool names
Outgoing mails for ZFS pool events include the pool GUID,
but not the actual pool name. Let's change this for better
readability, as it is already done in the mails for finished
pool resilvers.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Marcel Menzel <mail@mcl.gg>
Closes #14272
Richard Yao [Mon, 12 Dec 2022 18:40:05 +0000 (13:40 -0500)]
Address theoretical uninitialized variable usage in zstream
Coverity has long complained about the checksum being uninitialized if
an END record is processed before its BEGIN record. This should not
happen, but there was no code to check for it. I had left this unfixed
since it was a low priority issue, but then 9f4ede63d23be4f43ba8dd0ca42c6a773a8eaa8d added another instance of this.
I am making an effort to "hold the line" to keep new coverity defect
reports from going unaddressed, so I find myself forced to fix this much
earlier than I had originally planned to address it.
The solution is to maintain a counter and a flag. Then use VERIFY
statements to verify the following runtime constraints:
* Every record either has a corresponding BEGIN record, is a BEGIN
record or is the end of stream END record for replication streams.
* BEGIN records cannot be nested. i.e. There must be an END record
before another BEGIN record may be seen.
Failure to meet these constraints will cause the program to exit.
This is sufficient to ensure that the checksum is never accessed when
uninitialized.
Reported-by: Coverity (CID 1524578) Reported-by: Coverity (CID 1524633) Reported-by: Coverity (CID 1527295) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14176
Ryan Moeller [Mon, 12 Dec 2022 18:23:06 +0000 (13:23 -0500)]
initramfs: Fix legacy mountpoint rootfs
Legacy mountpoint datasets should not pass `-o zfsutil` to `mount.zfs`.
Fix the logic in `mount_fs()` to not forget we have a legacy mountpoint
when checking for an `org.zol:mountpoint` userprop.
Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #14274
Ameer Hamza [Mon, 12 Dec 2022 18:21:37 +0000 (23:21 +0500)]
Skip permission checks for extended attributes
zfs_zaccess_trivial() calls the generic_permission() to read
xattr attributes. This causes deadlock if called from
zpl_xattr_set_dir() context as xattr and the dent locks are
already held in this scenario. This commit skips the permissions
checks for extended attributes since the Linux VFS stack already
checks it before passing us the control.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Youzhong Yang <yyang@mathworks.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14220
Allan Jude [Fri, 9 Dec 2022 19:04:29 +0000 (14:04 -0500)]
Restrict visibility of per-dataset kstats inside FreeBSD jails
When inside a jail, visibility on datasets not "jailed" to the
jail is restricted. However, it was possible to enumerate all
datasets in the pool by looking at the kstats sysctl MIB.
Only the kstats corresponding to datasets that the user has
visibility on are accessible now.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #14254
Context:
We recently had a scenario where a customer with 2x10TB disks at 95+%
fragmentation and capacity, wanted to migrate their disks to a 2x20TB
setup. So they added the 2 new disks and submitted the removal of the
first 10TB disk. The removal took a lot more than expected (order of
more than a week to 2 weeks vs a couple of days) and once it was done it
generated a huge indirect mappign table in RAM (~16GB vs expected ~1GB).
Root-Cause:
The removal code calls `metaslab_alloc_dva()` to allocate a new block
for each evacuating block in the removing device and it tries to batch
them into 16MB segments. If it can't find such a segment it tries for
8MBs, 4MBs, all the way down to 512 bytes.
In our scenario what would happen is that `metaslab_alloc_dva()` from
the removal thread pick the new devices initially but wouldn't allocate
from them because of throttling in their metaslab allocation queue's
depth (see `metaslab_group_allocatable()`) as these devices are new and
favored for most types of allocations because of their free space. So
then the removal thread would look at the old fragmented disk for
allocations and wouldn't find any contiguous space and finally retry
with a smaller allocation size until it would to the low KB range. This
caused a lot of small mappings to be generated blowing up the size of
the indirect table. It also wasted a lot of CPU while the removal was
active making everything slow.
This patch:
Make all allocations coming from the device removal thread bypass the
throttle checks. These allocations are not even counted in the metaslab
allocation queues anyway so why check them?
Side-Fix:
Allocations with METASLAB_DONT_THROTTLE in their flags would not be
accounted at the throttle queues but they'd still abide by the
throttling rules which seems wrong. This patch fixes this by checking
for that flag in `metaslab_group_allocatable()`. I did a quick check to
see where else this flag is used and it doesn't seem like this change
would cause issues.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #14159
Richard Yao [Sun, 4 Dec 2022 22:03:14 +0000 (17:03 -0500)]
Do not pass -1 to strerror() from zfs_send_cb_impl()
`zfs_send_cb_impl()` calls `dump_filesystems()`, which calls
`dump_filesystem()`, which will return `-1` as an error when
`zfs_open()` returns `NULL`.
This will be passed to `zfs_standard_error()`, which passes it to
`zfs_standard_error_fmt()`, which passes it to `strerror()`.
To fix this, we modify zfs_open() to set `errno` whenever it returns
NULL. Most of the cases already have `errno` set (since they pass it to
`zfs_standard_error_fmt()`, which makes this easy. Then we modify
`dump_filesystem()` to pass `errno` instead of `-1`.
Reported-by: Coverity (CID-1524598) Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14264
Richard Yao [Sun, 4 Dec 2022 21:31:28 +0000 (16:31 -0500)]
Fix dereference after null check in enqueue_range
If the bp is NULL, we have a hole. However, when we build with
assertions, we will dereference bp when `blkid == DMU_SPILL_BLKID`. When
this happens on a hole, we will have a NULL pointer dereference.
Reported-by: Coverity (CID-1524670) Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14264