]> CyberLeo.Net >> Repos - FreeBSD/FreeBSD.git/log
FreeBSD/FreeBSD.git
17 months agoImprove resilver ETAs
Brian Behlendorf [Wed, 25 Jan 2023 19:28:54 +0000 (11:28 -0800)]
Improve resilver ETAs

When resilvering the estimated time remaining is calculated using
the average issue rate over the current pass.  Where the current
pass starts when a scan was started, or restarted, if the pool
was exported/imported.

For dRAID pools in particular this can result in wildly optimistic
estimates since the issue rate will be very high while scanning
when non-degraded regions of the pool are scanned.  Once repair
I/O starts being issued performance drops to a realistic number
but the estimated performance is still significantly skewed.

To address this we redefine a pass such that it starts after a
scanning phase completes so the issue rate is more reflective of
recent performance.  Additionally, the zfs_scan_report_txgs
module option can be set to reset the pass statistics more often.

Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14410

17 months agolinux 6.2 compat: zpl_set_acl arg2 is now struct dentry
Coleman Kane [Tue, 24 Jan 2023 19:20:50 +0000 (14:20 -0500)]
linux 6.2 compat:  zpl_set_acl arg2 is now struct dentry

Linux 6.2 changes the second argument of the set_acl operation to be a
"struct dentry *" rather than a "struct inode *". The inode* parameter
is still available as dentry->d_inode, so adjust the call to the _impl
function call to dereference and pass that pointer to it.

Also document that the get_acl -> get_inode_acl member name change from
commit 884a693 was an API change also introduced in Linux 6.2.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #14415

17 months agoIntroduce minimal ZIL block commit delay
Alexander Motin [Tue, 24 Jan 2023 17:20:32 +0000 (12:20 -0500)]
Introduce minimal ZIL block commit delay

Despite all optimizations, tests on actual hardware show that FreeBSD
kernel can't sleep for less then ~2us.  Similar tests on Linux show
~50us delay at least from nanosleep() (haven't tested inside kernel).
It means that on very fast log device ZIL may not be able to satisfy
zfs_commit_timeout_pct block commit timeout, increasing log latency
more than desired.

Handle that by introduction of zil_min_commit_timeout parameter,
specifying minimal timeout value where additional delays to aggregate
writes may be skipped.  Also skip delays if the LWB is more than 7/8
full, that often happens if I/O sizes are constant and match one of
LWB sizes.  Both things are applied only if there were no already
outstanding log blocks, that may indicate single-threaded workload,
that by definition can not benefit from the commit delays.

While there, add short time moving average to zl_last_lwb_latency to
make it more stable.

Tests of single-threaded 4KB writes to NVDIMM SLOG on FreeBSD show IOPS
increase by 9% instead of expected 5%.  For zfs_commit_timeout_pct of
1 there IOPS increase by 5.5% instead of expected 1%.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14418

17 months agox86 asm: Replace .align with .balign
Attila Fülöp [Mon, 23 Jan 2023 19:25:21 +0000 (20:25 +0100)]
x86 asm: Replace .align with .balign

The .align directive used to align storage locations is
ambiguous. On some platforms and assemblers it takes a byte count,
on others the argument is interpreted as a shift value. The current
usage expects the first interpretation.

Replace it with the unambiguous .balign directive which always
expects a byte count, regardless of platform and assembler.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #14422

17 months agoIPC: blake3 x86 asm: fix placement of .size directives
Attila Fülöp [Mon, 23 Jan 2023 18:48:39 +0000 (19:48 +0100)]
IPC: blake3 x86 asm: fix placement of .size directives

The .size directive used by the SET_SIZE C macro uses the special
dot symbol to calculate the size of a function. The dot symbol
refers to the current address, so for the calculation to be
meaningful the SET_SIZE macro must be placed immediately after the
end of the function the size is calculated for.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #14422

17 months agoWait for txg sync if the last DRR_FREEOBJECTS might result in a hole
David Hedberg [Mon, 23 Jan 2023 21:19:43 +0000 (22:19 +0100)]
Wait for txg sync if the last DRR_FREEOBJECTS might result in a hole

If we receive a DRR_FREEOBJECTS as the first entry in an object range,
this might end up producing a hole if the freed objects were the
only existing objects in the block.

If the txg starts syncing before we've processed any following
DRR_OBJECT records, this leads to a possible race where the backing
arc_buf_t gets its psize set to 0 in the arc_write_ready() callback
while still being referenced from a dirty record in the open txg.

To prevent this, we insert a txg_wait_synced call if the first
record in the range was a DRR_FREEOBJECTS that actually
resulted in one or more freed objects.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: David Hedberg <david.hedberg@findity.com>
Sponsored by: Findity AB
Closes #11893
Closes #14358

17 months agoReject streams that set ->drr_payloadlen to unreasonably large values
Richard Yao [Mon, 23 Jan 2023 21:16:22 +0000 (16:16 -0500)]
Reject streams that set ->drr_payloadlen to unreasonably large values

In the zstream code, Coverity reported:

"The argument could be controlled by an attacker, who could invoke the
function with arbitrary values (for example, a very high or negative
buffer size)."

It did not report this in the kernel. This is likely because the
userspace code stored this in an int before passing it into the
allocator, while the kernel code stored it in a uint32_t.

However, this did reveal a potentially real problem. On 32-bit systems
and systems with only 4GB of physical memory or less in general, it is
possible to pass a large enough value that the system will hang. Even
worse, on Linux systems, the kernel memory allocator is not able to
support allocations up to the maximum 4GB allocation size that this
allows.

This had already been limited in userspace to 64MB by
`ZFS_SENDRECV_MAX_NVLIST`, but we need a hard limit in the kernel to
protect systems. After some discussion, we settle on 256MB as a hard
upper limit. Attempting to receive a stream that requires more memory
than that will result in E2BIG being returned to user space.

Reported-by: Coverity (CID-1529836)
Reported-by: Coverity (CID-1529837)
Reported-by: Coverity (CID-1529838)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14285

17 months agoConfigure zed's diagnosis engine with vdev properties
rob-wing [Mon, 23 Jan 2023 21:14:25 +0000 (12:14 -0900)]
Configure zed's diagnosis engine with vdev properties

Introduce four new vdev properties:
    checksum_n
    checksum_t
    io_n
    io_t

These properties can be used for configuring the thresholds of zed's
diagnosis engine and are interpeted as <N> events in T <seconds>.

When this property is set to a non-default value on a top-level vdev,
those thresholds will also apply to its leaf vdevs. This behavior can be
overridden by explicitly setting the property on the leaf vdev.

Note that, these properties do not persist across vdev replacement. For
this reason, it is advisable to set the property on the top-level vdev
instead of the leaf vdev.

The default values for zed's diagnosis engine (10 events, 600 seconds)
remains unchanged.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology LLC
Closes #13805

17 months agofree_blocks(): Fix reports from 2016 PVS Studio FreeBSD report
Richard Yao [Mon, 23 Jan 2023 21:12:37 +0000 (16:12 -0500)]
free_blocks(): Fix reports from 2016 PVS Studio FreeBSD report

In 2016, the authors of PVS Studio ran it on the FreeBSD kernel, which
identified a number of bugs / cleanup opportunities in the FreeBSD ZFS kernel
code. A few of them persist to the present day:

https://reviews.freebsd.org/D5245

Note that the scan was done against
freebsd/freebsd-src@46763fd4ca8a37f836c9bf2333f9d687509278f3.

In particular, we have the following in free_blocks():

\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\dnode_sync.c (174): error V547: Expression '__left >= __right' is always true. Unsigned type value is always >= 0.
\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\dnode_sync.c (171): error V634: The priority of the '*' operation is higher than that of the '<<' operation. It's possible that parentheses should be used in the expression.
\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\dnode_sync.c (175): error V547: Expression '__left >= __right' is always true. Unsigned type value is always >= 0.

A couple of assertions accidentally typecast the arguments they check to
unsigned in such a way that the result is always true. Also, parentheses
are missing around `1<<epbs` in `(db->db_blkid * 1<<epbs)`. This works
out to be okay due to multiplication not caring what order of operations
we use, but it is better to fix it to be `(db->db_blkid << epbs)`.

A few of the function local variables probably never should have been
32-bit in the first place, so we make them 64-bit. We also replace the
existing assertions with additional assertions to ensure that 64-bit
unsigned arithmetic is safe.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14407

17 months agoFix reading uninitialized variable in receive_read
Chunwei Chen [Fri, 20 Jan 2023 19:49:56 +0000 (11:49 -0800)]
Fix reading uninitialized variable in receive_read

When zfs_file_read returns error, resid may be uninitialized.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #14404

17 months agozfs_receive_one: Check for the more likely error first
Allan Jude [Fri, 20 Jan 2023 19:11:54 +0000 (14:11 -0500)]
zfs_receive_one: Check for the more likely error first

If zfs_receive_one() gets back EINVAL, check for the more likely case,
embedded block pointers + encryption and return that error, before
falling back to the less likely case, a resumable stream when the
kernel has not been upgraded to support resume.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Sponsored-by: rsync.net
Sponsored-by: Klara Inc.
Closes #14379

17 months agoCleanup ->dd_space_towrite should be unsigned
Richard Yao [Fri, 20 Jan 2023 19:10:15 +0000 (14:10 -0500)]
Cleanup ->dd_space_towrite should be unsigned

This is only ever used with unsigned data, so the type itself should be
unsigned. Also, PVS Studio's 2016 FreeBSD kernel report correctly
identified the following assertion as always being true, so we can drop
it:

ASSERT3U(dd->dd_space_towrite[i & TXG_MASK], >=, 0);

The reason it was always true is because it would do casts to give us
unsigned comparisons. This could have been fixed by switching to
`ASSERT3S()`, but upon inspection, it turned out that this variable
never should have been allowed to be signed in the first place.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14408

17 months agoMicro-optimize dsl_prop_get_dd()
Mark Johnston [Mon, 16 Jan 2023 12:53:16 +0000 (07:53 -0500)]
Micro-optimize dsl_prop_get_dd()

Use the saved property index instead of looking it up once per DSL
directory when traversing up towards the root.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Sponsored-by: The FreeBSD Foundation
Closes #14397

17 months agoAvoid passing an uninitialized index to dsl_prop_known_index
Mark Johnston [Sun, 15 Jan 2023 22:05:19 +0000 (17:05 -0500)]
Avoid passing an uninitialized index to dsl_prop_known_index

Reported-by: KMSAN
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Sponsored-by: The FreeBSD Foundation
Closes #14397

17 months agoFix unprotected zfs_znode_dmu_fini
Chunwei Chen [Fri, 20 Jan 2023 00:59:05 +0000 (16:59 -0800)]
Fix unprotected zfs_znode_dmu_fini

In original code, zfs_znode_dmu_fini is called in zfs_rmnode without
zfs_znode_hold_enter. It seems to assume it's ok to do so when the znode
is unlinked. However this assumption is not correct, as zfs_zget can be
called by NFS through zpl_fh_to_dentry as pointed out by Christian in
https://github.com/openzfs/zfs/pull/12767, which could result in a
use-after-free bug.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #12767
Closes #14364

17 months agoMan: fix defaults for zfs_dirty_data_max_max
George Melikov [Tue, 17 Jan 2023 21:12:29 +0000 (00:12 +0300)]
Man: fix defaults for zfs_dirty_data_max_max

It was changed in e99932f7dec6efeb006e225e0bf0901c30345cac,
but without docs update.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #14400

17 months agoUnify Assembler files between Linux and Windows
Jorgen Lundman [Tue, 17 Jan 2023 19:09:19 +0000 (04:09 +0900)]
Unify Assembler files between Linux and Windows

Add new macro ASMABI used by Windows to change
calling API to "sysv_abi".

Reviewed-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #14228

17 months agoUse setproctitle to report progress of zfs send
Ameer Hamza [Tue, 17 Jan 2023 18:17:35 +0000 (23:17 +0500)]
Use setproctitle to report progress of zfs send

This allows parsing of zfs send progress by checking the process
title.
Doing so requires some changes to the send code in libzfs_sendrecv.c;
primarily these changes move some of the accounting around, to allow
for the code to be verbose as normal, or set the process title. Unlike
BSD, setproctitle() isn't standard in Linux; thus, borrowed it from
libbsd with slight modifications.

Authored-by: Sean Eric Fagan <sef@FreeBSD.org>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14376

17 months agoCleanup of dead code suggested by Clang Static Analyzer (#14380)
Richard Yao [Tue, 17 Jan 2023 17:57:12 +0000 (12:57 -0500)]
Cleanup of dead code suggested by Clang Static Analyzer (#14380)

I recently gained the ability to run Clang's static analyzer on the
linux kernel modules via a few hacks. This extended coverage to code
that was previously missed since Clang's static analyzer only looked at
code that we built in userspace. Running it against the Linux kernel
modules built from my local branch produced a total of 72 reports
against my local branch. Of those, 50 were reports of logic errors and
22 were reports of dead code. Since we already had cleaned up all of
the previous dead code reports, I felt it would be a good next step to
clean up these dead code reports. Clang did a further breakdown of the
dead code reports into:

Dead assignment 15

Dead increment 2

Dead nested assignment 5

The benefit of cleaning these up, especially in the case of dead nested
assignment, is that they can expose places where our error handling is
incorrect. A number of them were fairly straight forward. However
several were not:

In vdev_disk_physio_completion(), not only were we not using the return
value from the static function vdev_disk_dio_put(), but nothing used it,
so I changed it to return void and removed the existing (void) cast in
the other area where we call it in addition to no longer storing it to a
stack value.

In FSE_createDTable(), the function is dead code. Its helper function
FSE_freeDTable() is also dead code, as are the CPP definitions in
`module/zstd/include/zstd_compat_wrapper.h`. We just delete it all.

In zfs_zevent_wait(), we have an optimization opportunity. cv_wait_sig()
returns 0 if there are waiting signals and 1 if there are none. The
Linux SPL version literally returns `signal_pending(current) ? 0 : 1)`
and FreeBSD implements the same semantics, we can just do
`!cv_wait_sig()` in place of `signal_pending(current)` to avoid
unnecessarily calling it again.

zfs_setattr() on FreeBSD version did not have error handling issue
because the code was removed entirely from FreeBSD version. The error is
from updating the attribute directory's files. After some thought, I
decided to propapage errors on it to userspace.

In zfs_secpolicy_tmp_snapshot(), we ignore a lack of permission from the
first check in favor of checking three other permissions. I assume this
is intentional.

In zfs_create_fs(), the return value of zap_update() was not checked
despite setting an important version number. I see no backward
compatibility reason to permit failures, so we add an assertion to catch
failures. Interestingly, Linux is still using ASSERT(error == 0) from
OpenSolaris while FreeBSD has switched to the improved ASSERT0(error)
from illumos, although illumos has yet to adopt it here. ASSERT(error ==
0) was used on Linux while ASSERT0(error) was used on FreeBSD since the
entire file needs conversion and that should be the subject of
another patch.

dnode_move()'s issue was caused by us not having implemented
POINTER_IS_VALID() on Linux. We have a stub in
`include/os/linux/spl/sys/kmem_cache.h` for it, when it really should be
in `include/os/linux/spl/sys/kmem.h` to be consistent with
Illumos/OpenSolaris. FreeBSD put both `POINTER_IS_VALID()` and
`POINTER_INVALIDATE()` in `include/os/freebsd/spl/sys/kmem.h`, so we
copy what it did.

Whenever a report was in platform-specific code, I checked the FreeBSD
version to see if it also applied to FreeBSD, but it was only relevant a
few times.

Lastly, the patch that enabled Clang's static analyzer to be run on the
Linux kernel modules needs more work before it can be put into a PR. I
plan to do that in the future as part of the on-going static analysis
work that I am doing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14380

17 months agoZTS: Annotate additonal flaky test cases
Brian Behlendorf [Tue, 17 Jan 2023 17:53:53 +0000 (09:53 -0800)]
ZTS: Annotate additonal flaky test cases

Update several flaky test cases in zts-report.py.in until they
can be made entirely reliable.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14392

17 months agoCI: Reclaim space after package operations
Brian Behlendorf [Tue, 17 Jan 2023 17:52:27 +0000 (09:52 -0800)]
CI: Reclaim space after package operations

Rather than reclaiming space before updating the packages do
it afterwards.  This avoids issues with apt returning an
error due to missing files on the system.

This commit includes a revert for 6320b9e6.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14387

17 months agozpool-set: print error message when pool or vdev is not valid
Rob Wing [Wed, 21 Dec 2022 06:52:26 +0000 (21:52 -0900)]
zpool-set: print error message when pool or vdev is not valid

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology
Submitted-by: Klara, Inc.
Closes #14310

17 months agozpool-set: update usage text
Rob Wing [Wed, 21 Dec 2022 06:33:11 +0000 (21:33 -0900)]
zpool-set: update usage text

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology
Submitted-by: Klara, Inc.
Closes #14310

17 months agoLinux ppc64le ieee128 compat: Do not redefine __asm on external headers
Richard Yao [Fri, 13 Jan 2023 18:58:58 +0000 (13:58 -0500)]
Linux ppc64le ieee128 compat: Do not redefine __asm on external headers

There is an external assembly declaration extension in GNU C that glibc
uses when building with ieee128 floating point support on ppc64le.
Marking that as volatile makes no sense, so the build breaks.

It does not make sense to only mark this as volatile on Linux, since if
do not want the compiler reordering things on Linux, we do not want the
compiler reordering things on any other platform, so we stop treating
Linux specially and just manually inline the CPP macro so that we can
eliminate it. This should fix the build on ppc64le.

Tested-by: @gyakovlev
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14308
Closes #14384

17 months agoCleanup: Use NULL when doing NULL pointer comparisons
Richard Yao [Tue, 10 Jan 2023 21:58:21 +0000 (16:58 -0500)]
Cleanup: Use NULL when doing NULL pointer comparisons

The Linux 5.16.14 kernel's coccicheck caught this. The semantic
patch that caught it was:

./scripts/coccinelle/null/badzero.cocci

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14372

17 months agoCleanup: Remove unneeded semicolons
Richard Yao [Tue, 10 Jan 2023 21:54:23 +0000 (16:54 -0500)]
Cleanup: Remove unneeded semicolons

The Linux 5.16.14 kernel's coccicheck caught this. The semantic
patch that caught it was:

./scripts/coccinelle/misc/semicolon.cocci

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14372

17 months agoCleanup: Use MIN() macro
Richard Yao [Tue, 10 Jan 2023 21:41:26 +0000 (16:41 -0500)]
Cleanup: Use MIN() macro

The Linux 5.16.14 kernel's coccicheck caught this. The semantic
patch that caught it was:

./scripts/coccinelle/misc/minmax.cocci

There was a third opportunity to use `MIN()`, but that was in
`FSE_minTableLog()` in `module/zstd/lib/compress/fse_compress.c`.
Upstream zstd has yet to make this change and I did not want to change
header includes just for MIN, or do a one off, so I left it alone.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14372

17 months agoCleanup: !A || A && B is equivalent to !A || B
Richard Yao [Tue, 10 Jan 2023 21:34:29 +0000 (16:34 -0500)]
Cleanup: !A || A && B is equivalent to !A || B

In zfs_zaccess_dataset_check(), we have the following subexpression:

(!IS_DEVVP(ZTOV(zp)) ||
    (IS_DEVVP(ZTOV(zp)) && (v4_mode & WRITE_MASK_ATTRS)))

When !IS_DEVVP(ZTOV(zp)) is false, IS_DEVVP(ZTOV(zp)) is true under the
law of the excluded middle since we are not doing pseudoboolean alegbra.
Therefore doing:

(IS_DEVVP(ZTOV(zp)) && (v4_mode & WRITE_MASK_ATTRS))

Is unnecessary and we can just do:

(v4_mode & WRITE_MASK_ATTRS)

The Linux 5.16.14 kernel's coccicheck caught this. The semantic
patch that caught it was:

./scripts/coccinelle/misc/excluded_middle.cocci

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14372

17 months agoCleanup: Replace oldstyle struct hack with C99 flexible array members
Richard Yao [Tue, 10 Jan 2023 20:44:35 +0000 (15:44 -0500)]
Cleanup: Replace oldstyle struct hack with C99 flexible array members

The Linux 5.16.14 kernel's coccicheck caught this. The semantic
patch that caught it was:

./scripts/coccinelle/misc/flexible_array.cocci

However, unlike the cases where the GNU zero length array extension had
been used, coccicheck would not suggest patches for the older style
single member arrays. That was good because blindly changing them would
break size calculations in most cases.

Therefore, this required care to make sure that we did not break size
calculations. In the case of `indirect_split_t`, we use
`offsetof(indirect_split_t, is_child[is->is_children])` to calculate
size. This might be subtly wrong according to an old mailing list
thread:

https://inbox.sourceware.org/gcc-prs/20021226123454.27019.qmail@sources.redhat.com/T/

That is because the C99 specification should consider the flexible array
members to start at the end of a structure, but compilers prefer to put
padding at the end. A suggestion was made to allow compilers to allocate
padding after the VLA like compilers already did:

http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/n983.htm

However, upon thinking about it, whether or not we allocate end of
structure padding does not matter, so using offsetof() to calculate the
size of the structure is fine, so long as we do not mix it with sizeof()
on structures with no array members.

In the case that we mix them and padding causes offsetof(struct_t,
vla_member[0]) to differ from sizeof(struct_t), we would be doing unsafe
operations if we underallocate via `offsetof()` and then overcopy via
sizeof().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14372

17 months agoCleanup: Fix indentation in zfs_dbgmsg_t
Richard Yao [Tue, 10 Jan 2023 20:18:57 +0000 (15:18 -0500)]
Cleanup: Fix indentation in zfs_dbgmsg_t

fdc2d303710416868a05084e984fd8f231e948bd accidentally broke the
indentation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14372

17 months agoCleanup: Use C99 flexible array members instead of zero length arrays
Richard Yao [Tue, 10 Jan 2023 20:14:59 +0000 (15:14 -0500)]
Cleanup: Use C99 flexible array members instead of zero length arrays

The Linux 5.16.14 kernel's coccicheck caught this. The semantic
patch that caught it was:

./scripts/coccinelle/misc/flexible_array.cocci

The Linux kernel's documentation makes a good case for why we should not
use these:

https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14372

17 months agoCleanup: Use kmem_zalloc() instead of memset() to zero memory
Richard Yao [Tue, 10 Jan 2023 19:06:53 +0000 (14:06 -0500)]
Cleanup: Use kmem_zalloc() instead of memset() to zero memory

The Linux 5.16.14 kernel's coccicheck caught this. The semantic patch
that caught it was:

./scripts/coccinelle/api/alloc/zalloc-simple.cocci

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14372

17 months agoCleanup: Remove unnecessary explicit casts of pointers from allocators
Richard Yao [Tue, 10 Jan 2023 19:03:46 +0000 (14:03 -0500)]
Cleanup: Remove unnecessary explicit casts of pointers from allocators

The Linux 5.16.14 kernel's coccicheck caught these. The semantic patch
that caught them was:

./scripts/coccinelle/api/alloc/alloc_cast.cocci

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14372

17 months agochange how d_alias is replaced by du.d_alias
Gian-Carlo DeFazio [Thu, 12 Jan 2023 18:14:04 +0000 (10:14 -0800)]
change how d_alias is replaced by du.d_alias

d_alias may need to be converted to du.d_alias
depending on the kernel version.
d_alias is currently in only one place in the code which
changes
"hlist_for_each_entry(dentry, &inode->i_dentry, d_alias)"
to
"hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias)"
as neccesary.

This effectively results in a double macro expansion
for code that uses the zfs headers but already has its
own macro for just d_alias (lustre in this case).

Remove the conditional code for hlist_for_each_entry
and have a macro for "d_alias -> du.d_alias" instead.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gian-Carlo DeFazio <defazio1@llnl.gov>
Closes #14377

17 months agoActivate filesystem features only in syncing context
George Amanakis [Thu, 12 Jan 2023 02:00:39 +0000 (03:00 +0100)]
Activate filesystem features only in syncing context

When activating filesystem features after receiving a snapshot, do
so only in syncing context.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14304
Closes #14252

17 months agoCI: remove unused packages/snaps
Brian Behlendorf [Wed, 11 Jan 2023 23:18:51 +0000 (15:18 -0800)]
CI: remove unused packages/snaps

Removing portions of packages/snaps directly with rm can result in
unexpected errors when running `apt update`.  Free up the additional
space by removing (some) packages with the proper tools.

This change frees up slightly less space than before, but it is
expected to still be sufficient.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14374

17 months agozpool: do guid-based comparison in is_vdev_cb()
rob-wing [Wed, 11 Jan 2023 23:14:35 +0000 (14:14 -0900)]
zpool: do guid-based comparison in is_vdev_cb()

is_vdev_cb() uses string comparison to find a matching vdev and
will fallback to comparing the guid via a string.  These changes
drop the string comparison and compare the guids instead.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology
Submitted-by: Klara, Inc.
Closes #14311

17 months agoTurn default_bs and default_ibs into ZFS_MODULE_PARAMs
Mateusz Piotrowski [Wed, 11 Jan 2023 17:38:20 +0000 (18:38 +0100)]
Turn default_bs and default_ibs into ZFS_MODULE_PARAMs

The default_bs and default_ibs tunables control the default block size
and indirect block size.

So far, default_bs and default_ibs were tunable only on FreeBSD, e.g.,

    sysctl vfs.zfs.default_ibs

Remove the FreeBSD-specific sysctl code and expose default_bs and
default_ibs as tunables on both Linux and FreeBSD using
ZFS_MODULE_PARAM.

One of the use cases for changing the values of those tunables is to
lower the indirect block size, which may improve performance of large
directories (as discussed during the OpenZFS Leadership Meeting
on 2022-08-16).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Sponsored-by: Wasabi Technology, Inc.
Closes #14293

17 months agoUpdate META to 6.1 kernel
Tony Hutter [Tue, 10 Jan 2023 23:53:33 +0000 (15:53 -0800)]
Update META to 6.1 kernel

ZFS successfully builds against the 6.1.4 kernel.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #14371

17 months agoAdd tunable to allow changing micro ZAP's max size
Mateusz Piotrowski [Tue, 10 Jan 2023 21:41:54 +0000 (22:41 +0100)]
Add tunable to allow changing micro ZAP's max size

This change turns `MZAP_MAX_BLKSZ` into a `ZFS_MODULE_PARAM()` called
`zap_micro_max_size`. As a result, we can experiment with different
micro ZAP sizes to improve directory size scaling.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Mateusz Piotrowski <mateuszpiotrowski@klarasystems.com>
Co-authored-by: Toomas Soome <toomas.soome@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateuszpiotrowski@klarasystems.com>
Sponsored-by: Wasabi Technology, Inc.
Closes #14292

17 months agoetc/systemd/zfs-mount-generator: avoid strndupa
Alyssa Ross [Tue, 10 Jan 2023 21:40:31 +0000 (21:40 +0000)]
etc/systemd/zfs-mount-generator: avoid strndupa

The non-standard strndupa function is not implemented by musl libc,
and can be dangerous due to its potential to blow the stack.  (musl
_does_ implement strdupa, used elsewhere in this function.)

With a similar amount of code, we can use a heap allocation to
construct the pool name, which is musl-friendly and doesn't have
potential stack problems.

(Why care about musl when systemd only supports glibc?  Some distros
patch systemd with portability fixes, and it would be nice to be able
to use ZFS on those distros.)

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alyssa Ross <alyssa.ross@unikie.com>
Closes #14327

17 months agoBatch enqueue/dequeue for bqueue
Matthew Ahrens [Tue, 10 Jan 2023 21:39:22 +0000 (13:39 -0800)]
Batch enqueue/dequeue for bqueue

The Blocking Queue (bqueue) code is used by zfs send/receive to send
messages between the various threads.  It uses a shared linked list,
which is locked whenever we enqueue or dequeue.  For workloads which
process many blocks per second, the locking on the shared list can be
quite expensive.

This commit changes the bqueue logic to have 3 linked lists:
1. An enquing list, which is used only by the (single) enquing thread,
   and thus needs no locks.
2. A shared list, with an associated lock.
3. A dequing list, which is used only by the (single) dequing thread,
   and thus needs no locks.

The entire enquing list can be moved to the shared list in constant
time, and the entire shared list can be moved to the dequing list in
constant time.  These operations only happen when the `fill_fraction` is
reached, or on an explicit flush request.  Therefore, the lock only
needs to be acquired infrequently.

The API already allows for dequing to block until an explicit flush, so
callers don't need to be changed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #14121

17 months agoztest: update ztest_dmu_snapshot_create_destroy()
Brian Behlendorf [Thu, 5 Jan 2023 00:05:36 +0000 (16:05 -0800)]
ztest: update ztest_dmu_snapshot_create_destroy()

ECHRNG is returned when the channel program encounters a runtime
error.  For example, this can happen when a snapshot doesn't exist.
We handle this error the same way as the existing EEXIST and ENOENT
error checks.

Additionally, improve the internal debug message to include the
error describing why a pool couldn't be opened.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14351

17 months agoztest: ztest_dsl_prop_set_uint64() ENOSPC consistency
Brian Behlendorf [Thu, 5 Jan 2023 00:23:28 +0000 (16:23 -0800)]
ztest: ztest_dsl_prop_set_uint64() ENOSPC consistency

It is possible for ztest_dsl_prop_set_uint64() to fail with ENOSPC
and this needs to be handled consistently.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14351

17 months agoztest: reduce `zpool split` frequency
Brian Behlendorf [Thu, 5 Jan 2023 00:04:02 +0000 (16:04 -0800)]
ztest: reduce `zpool split` frequency

There's no need to so aggressively test splitting a pool.  Reduce
the occurence of this test to once every 10 seconds.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14351

17 months agoztest: update expectation for sparing a special device
Brian Behlendorf [Wed, 4 Jan 2023 23:59:50 +0000 (15:59 -0800)]
ztest: update expectation for sparing a special device

Commit c23738c70eb86a7f04f93292caef2ed977047608 modified the expected
behavior of attach to prevent hot spares from being used as special
vdev replacements.  We update ztest's expectations accordingly to
prevent it from failing when testing the updated behavior.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14351

17 months agoztest fails assertion in zio_write_gang_member_ready()
Matthew Ahrens [Tue, 10 Jan 2023 00:43:45 +0000 (16:43 -0800)]
ztest fails assertion in zio_write_gang_member_ready()

Encrypted blocks can have up to 2 DVA's, as the third DVA is reserved
for the salt+IV.  However, dmu_write_policy() allows non-encrypted
blocks (e.g. DMU_OT_OBJSET) inside encrypted datasets to request and
allocate 3 DVA's, since they don't need a salt+IV (they are merely
authenicated).

However, if such a block becomes a gang block, the gang code incorrectly
limits the gang block header to 2 DVA's.  This leads to a "NDVAs
inversion", where a parent block (the gang block header) has less DVA's
than its children (the gang members), causing an assertion failure in
zio_write_gang_member_ready().

This commit addresses the problem by only restricting the gang block
header to 2 DVA's if the block is actually encrypted (and thus its gang
block members can have at most 2 DVA's).

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #14250
Closes #14356

17 months agolibzpool: fix ddi_strtoull to update nptr
Charles Suh [Mon, 9 Jan 2023 20:49:35 +0000 (12:49 -0800)]
libzpool: fix ddi_strtoull to update nptr

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Charles Suh <charles.suh@gmail.com>
Closes #14360

17 months agozed: add hotplug support for spare vdevs
Ameer Hamza [Mon, 9 Jan 2023 20:43:03 +0000 (01:43 +0500)]
zed: add hotplug support for spare vdevs

This commit supports for spare vdev hotplug. The
spare vdev associated with all the pools will be
marked as "Removed" when the drive is physically
detached and will become "Available" when the
drive is reattached. Currently, the spare vdev
status does not change on the drive removal and
the same is the case with reattachment.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14295

17 months agoRemove some dead ARC code. (#14340)
Alexander Motin [Mon, 9 Jan 2023 18:45:17 +0000 (13:45 -0500)]
Remove some dead ARC code. (#14340)

Every ARC buffer holds a reference on the header. It means headers with
buffers are never evictable.  When we are evicting a header, there can
be no more buffers to free.  Just assert that.

b_evict_lock seems not protecting anything now.  Remove it.

Buffers checksum should also be freed with the last uncompressed buffer,
so it should not be there also when we are evicting the header.

Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.

17 months agolinux 6.2 compat: bio->bi_rw was renamed bio->bi_opf
Coleman Kane [Tue, 27 Dec 2022 03:59:13 +0000 (22:59 -0500)]
linux 6.2 compat: bio->bi_rw was renamed bio->bi_opf

The bi_rw member of struct bio was renamed to bi_opf in Linux 6.2.
As well, Linux's implementation of bio_set_op_attrs(...) has been
removed.

The HAVE_BIO_BI_OPF macro already appears to be defined, but the
removal of the bio_set_op_attrs(...) implementation makes the build
fall back on the locally-defined implementation, which isn't updated
for the bio->bi_opf change. This commit adds that update.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #14324
Closes #14331

17 months agolinux 6.2 compat: get_acl() got moved to get_inode_acl() in 6.2
Coleman Kane [Tue, 27 Dec 2022 03:04:34 +0000 (22:04 -0500)]
linux 6.2 compat: get_acl() got moved to get_inode_acl() in 6.2

Linux 6.2 renamed the get_acl() operation to get_inode_acl() in
the inode_operations struct. This should fix Issue #14323.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #14323
Closes #14331

17 months agoIntroduce ZFS_LINUX_REQUIRE_API autoconf macro
Antonio Russo [Thu, 5 Jan 2023 04:41:16 +0000 (21:41 -0700)]
Introduce ZFS_LINUX_REQUIRE_API autoconf macro

Currently, if API tests fail, we either ignore the failures, or
unconditionally halt the kernel build.  This leads to situations where
incompatibilities with existing APIs may develop, but not trip the
configure compatibility checks.

This introduces a new mechanism to require APIs for kernels above a
particular version.  While not perfect, this at least guarantees
mainline kernels do not break existing APIs without at least providing
some warning.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <aerusso@aerusso.net>
Closes #14343

17 months agoLinux 6.1 compat: open inside tmpfile()
Antonio Russo [Sat, 31 Dec 2022 14:51:32 +0000 (07:51 -0700)]
Linux 6.1 compat: open inside tmpfile()

Linux 863f144 modified the .tmpfile interface to pass a struct file,
rather than a struct dentry, and expect the tmpfile implementation to
open inside of tmpfile().

This patch implements a configuration test that checks for this new API
and appropriately sets a HAVE_TMPFILE_DENTRY flag that tracks this old
API.  Contingent on this flag, the appropriate API is implemented.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <aerusso@aerusso.net>
Closes #14301
Closes #14343

17 months agoZTS: close in mmapwrite.c
Antonio Russo [Fri, 6 Jan 2023 18:52:08 +0000 (11:52 -0700)]
ZTS: close in mmapwrite.c

mmapwrite is used during the ZTS to identify issues with mmap-ed files.
This helper program exercises this pathway by continuously writing to a
file.  ee6bf97c7 modified the writing threads to terminate after a set
amount of total data is written.  This change allows standard program
execution to reach the end of a writer thread without closing the file
descriptor, introducing a resource "leak."

This patch appeases resource leak analyses by close()-ing the file at
the end of the thread.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <aerusso@aerusso.net>
Closes #14353

17 months agoZTS: limit mmapwrite file size
Antonio Russo [Thu, 5 Jan 2023 20:50:30 +0000 (13:50 -0700)]
ZTS: limit mmapwrite file size

mmapwrite spawns several threads, all of which perform writes on a file
for the purpose of testing the behavior of mmap(2)-ed files.  One
thread performs an mmap and a write to the beginning of that region,
while the others perform regular writes after lseek(2)-ing the end of
the file.

Because these regular writes are set in a while (1) loop, they will
write an unbounded amount of data to disk.  The mmap_write_001_pos test
script SIGKILLs them after 30 seconds, but on fast testbeds, this may
be enough time to exhaust the available space in the filesystem,
leading to spurious test failures.

Instead, limit the total file size by checking that the lseek return
value is no greater than 250 * 1024*1024 bytes, which is less than the
default minimum vdev size defined in includes/default.cfg .

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <aerusso@aerusso.net>
Closes #14277
Closes #14345

17 months agocontrib: dracut: Do not timeout waiting for pw
Clemens Lang [Thu, 5 Jan 2023 20:07:43 +0000 (20:07 +0000)]
contrib: dracut: Do not timeout waiting for pw

systemd-ask-password has a default timeout of 90 seconds, which means
that dracut will fall back to the rescue shell 4.5 minutes after boot if
no password is entered.

This is undesirable when combined with, for example, unlocking remotely
using dracut-sshd and systemd-tty-ask-password-agent. See also
https://github.com/gsauthof/dracut-sshd#timeout and
https://bugzilla.redhat.com/show_bug.cgi?id=868421.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Clemens Lang <neverpanic@gmail.com>
Closes #14341

17 months agoIllumos #15286: do_composition() needs sign awareness
Richard Yao [Thu, 5 Jan 2023 19:16:21 +0000 (14:16 -0500)]
Illumos #15286: do_composition() needs sign awareness

Authored by: Dan McDonald <danmcd@mnx.io>
Reviewed by: Patrick Mooney <pmooney@pfmooney.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Ported-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Illumos-issue: https://www.illumos.org/issues/15286
Illumos-commit: https://github.com/illumos/illumos-gate/commit/f137b22e734e85642da3e56e8b94da3f5f027c73

Porting Notes:

The patch in illumos did not have much of a commit message, and did not
provide attribution to the reporter, while original patch proposed to
OpenZFS did, so I am listing the reporter (myself) and original patch
author (also myself) below while including the original commit message
with some minor corrections as part of the porting notes:

In do_composition(), we have:

size = u8_number_of_bytes[*p];
if (size <= 1 || (p + size) > oslast)
break;

There, we have type promotion from int8_t to size_t, which is unsigned.
C will sign extend the value as part of the widening before treating the
value as unsigned and the negative values we can counter are error
values from U8_ILLEGAL_CHAR and U8_OUT_OF_RANGE_CHAR, which are -1 and
-2 respectively. The unsigned versions of these under two's complement
are SIZE_MAX and SIZE_MAX-1 respectively.

The bounds check is written under the assumption that `size <= 1` does a
signed comparison. This is followed by a pointer comparison to see if
the string has the correct length, which is fine.

A little further down we have:

for (i = 0; i < size; i++)
tc[i] = *p++;

When an error condition is encountered, this will attempt to iterate at
least SIZE_MAX-1 times, which will massively overflow the buffer, which
is not fine.

The kernel will kill the loop as soon as it hits the kernel stack guard
on Linux systems built with CONFIG_VMAP_STACK=y, which should be just
about all of them. That prevents arbitrary code execution and just about
any other bad thing that a black hat attacker might attempt with
knowledge of this buffer overflow. Other systems' kernels have
mitigations for unbounded in-kernel buffer overflows that will catch
this too.

Also, the patch in illumos-gate made an effort to fix C style issues
that had been fixed in the OpenZFS/ZFSOnLinux repository. Those issues
had been mentioned in the email that I originally sent them about this
issue. One of the fixes had not been already done, so it is included.
Another to collect_a_seq()'s arguments was handled differently in
OpenZFS. For the sake of avoiding unnecessary differences, it has been
adopted. This has the interesting effect that if you correct the paths
in the illumos-gate patch to match the current OpenZFS repository, you
can reverse apply it cleanly.

Original-patch-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reported-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Co-authored-by: Dan McDonald <danmcd@mnx.io>
Closes #14318
Closes #14342

17 months agoremoval of LegacyVersion broke ax_python_dev.m4
Matthew Ahrens [Thu, 5 Jan 2023 19:04:24 +0000 (11:04 -0800)]
removal of LegacyVersion broke ax_python_dev.m4

The 22.0 release of the python `packaging` package removed the
`LegacyVersion` trait, causing ZFS to no longer compile.

This commit replaces the sections of `ax_python_dev.m4` that rely on
`LegacyVersion` with updated implementations from the upstream
`autoconf-archive`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #14297

17 months agoFreeBSD: catch up to 1400077
Mateusz Guzik [Thu, 5 Jan 2023 18:56:40 +0000 (19:56 +0100)]
FreeBSD: catch up to 1400077

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #14328

17 months agoFix shebang for helper script of deb-utils
Martin Rüegg [Thu, 29 Dec 2022 09:59:12 +0000 (10:59 +0100)]
Fix shebang for helper script of deb-utils

Shebang was missing the `!` between `#` and the actual path.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Martin Rüegg <martin.rueegg@metaworx.ch>
Closes #14339

17 months agoAdd quotation marks around `$PATH` for deb-utils
Martin Rüegg [Thu, 29 Dec 2022 09:56:39 +0000 (10:56 +0100)]
Add quotation marks around `$PATH` for deb-utils

Fix #14338, failing to build deb-utils if existing `$PATH` variable
would include a whitespace.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Martin Rüegg <martin.rueegg@metaworx.ch>
Closes #14339

17 months agoPack zrlock_t by 8 bytes
Alexander Motin [Thu, 5 Jan 2023 17:31:55 +0000 (12:31 -0500)]
Pack zrlock_t by 8 bytes

On FreeBSD this reduces this structure size from 64 to 56 bytes.
dnode_handle_t respectively reduces from 72 to 64 bytes. It sounds
like a waste to need 72 bytes to be able to relocate 808 bytes of
dnode_t, which relocation on FreeBSD is not even supported.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #14317

17 months agoUpdate arc_summary and arcstat outputs
Alexander Motin [Thu, 5 Jan 2023 17:29:13 +0000 (12:29 -0500)]
Update arc_summary and arcstat outputs

Recent ARC commits added new statistic counters, such as iohits,
uncached state, etc.  Represent those.  Also some of previously
reported numbers were confusing or even made no sense.  Cleanup
and restructure existing reports.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Issue #14115
Issue #14123
Issue #14243
Closes #14320

17 months agoHide b_freeze_* under ZFS_DEBUG
Alexander Motin [Thu, 5 Jan 2023 17:15:31 +0000 (12:15 -0500)]
Hide b_freeze_* under ZFS_DEBUG

This saves 40 bytes per full ARC header, reducing it on FreeBSD from
240 to 200 bytes on production bits.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #14315

17 months agoImplement uncached prefetch
Alexander Motin [Thu, 5 Jan 2023 00:29:54 +0000 (19:29 -0500)]
Implement uncached prefetch

Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.

This change gives the ARC knowledge about uncacheable buffers
via  arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.

This change also tracks dbufs that were read from the beginning,
but not to the end.  It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction.  Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.

Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243

18 months agoarc_read()/arc_access() refactoring and cleanup
Alexander Motin [Thu, 22 Dec 2022 20:10:24 +0000 (15:10 -0500)]
arc_read()/arc_access() refactoring and cleanup

ARC code was many times significantly modified over the years, that
created significant amount of tangled and potentially broken code.
This should make arc_access()/arc_read() code some more readable.

 - Decouple prefetch status tracking from b_refcnt.  It made sense
originally, but became highly cryptic over the years.  Move all the
logic into arc_access().  While there, clean up and comment state
transitions in arc_access().  Some transitions were weird IMO.
 - Unify arc_access() calls to arc_read() instead of sometimes calling
it from arc_read_done().  To avoid extra state changes and checks add
one more b_refcnt for ARC_FLAG_IO_IN_PROGRESS.
 - Reimplement ARC_FLAG_WAIT in case of ARC_FLAG_IO_IN_PROGRESS with
the same callback mechanism to not falsely account them as hits. Count
those as "iohits", an intermediate between "hits" and "misses". While
there, call read callbacks in original request order, that should be
good for fairness and random speculations/allocations/aggregations.
 - Introduce additional statistic counters for prefetch, accounting
predictive vs prescient and hits vs iohits vs misses.
 - Remove hash_lock argument from functions not needing it.
 - Remove ARC_FLAG_PREDICTIVE_PREFETCH, since it should be opposite
to ARC_FLAG_PRESCIENT_PREFETCH if ARC_FLAG_PREFETCH is set.  We may
wish to add ARC_FLAG_PRESCIENT_PREFETCH to few more places.
 - Fix few false positive tests found in the process.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14123

18 months agoFreeBSD: Fix potential boot panic with bad label
Ryan Moeller [Thu, 22 Dec 2022 19:50:09 +0000 (14:50 -0500)]
FreeBSD: Fix potential boot panic with bad label

vdev_geom_read_pool_label() can leave NULL in configs.  Check for it
and skip consistently when generating rootconf.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #14291

18 months agodeadlock between spa_errlog_lock and dp_config_rwlock
Matthew Ahrens [Thu, 22 Dec 2022 19:48:49 +0000 (11:48 -0800)]
deadlock between spa_errlog_lock and dp_config_rwlock

There is a lock order inversion deadlock between `spa_errlog_lock` and
`dp_config_rwlock`:

A thread in `spa_delete_dataset_errlog()` is running from a sync task.
It is holding the `dp_config_rwlock` for writer (see
`dsl_sync_task_sync()`), and waiting for the `spa_errlog_lock`.

A thread in `dsl_pool_config_enter()` is holding the `spa_errlog_lock`
(see `spa_get_errlog_size()`) and waiting for the `dp_config_rwlock` (as
reader).

Note that this was introduced by #12812.

This commit address this by defining the lock ordering to be
dp_config_rwlock first, then spa_errlog_lock / spa_errlist_lock.
spa_get_errlog() and spa_get_errlog_size() can acquire the locks in this
order, and then process_error_block() and get_head_and_birth_txg() can
verify that the dp_config_rwlock is already held.

Additionally, a buffer overrun in `spa_get_errlog()` is corrected.  Many
code paths didn't check if `*count` got to zero, instead continuing to
overwrite past the beginning of the userspace buffer at `uaddr`.

Tested by having some errors in the pool (via `zinject -t data
/path/to/file`), one thread running `zpool iostat 0.001`, and another
thread runs `zfs destroy` (in a loop, although it hits the first time).
This reproduces the problem easily without the fix, and works with the
fix.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #14239
Closes #14289

18 months agoDocumentation corrections
Brian Behlendorf [Thu, 22 Dec 2022 19:34:28 +0000 (11:34 -0800)]
Documentation corrections

- Update the link to the OpenZFS Code of Conduct.
- Remove extra "the" from contrib/initramfs/scripts/zfs

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14298
Closes #14307

18 months agoRevert "zdb: zdb_ddt_leak_init() reads uninitialized memory..."
Brian Behlendorf [Wed, 21 Dec 2022 17:17:00 +0000 (09:17 -0800)]
Revert "zdb: zdb_ddt_leak_init() reads uninitialized memory..."

This reverts commit d30db519af44b905fc52b8c8ba34f6378aa03470.  With
this change applied zloop.sh fails reliably with the following ASSERT.

  zio_wait(zio_claim(NULL, zcb->zcb_spa, refcnt ? 0 : spa_min_claim_txg(
    zcb->zcb_spa), bp, NULL, NULL, ZIO_FLAG_CANFAIL)) == 0 (0x2 == 0x0)
  ASSERT at cmd/zdb/zdb.c:5452:zdb_count_block()

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14306

18 months agosystemd: set restart=always for zfs-zed.service
George Melikov [Tue, 20 Dec 2022 00:08:15 +0000 (03:08 +0300)]
systemd: set restart=always for zfs-zed.service

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Co-authored-by: Attila Fülöp <attila@fueloep.org>
Closes #14294

18 months agoAdd color output to zfs diff.
Ethan Coe-Renner [Mon, 12 Dec 2022 23:30:51 +0000 (15:30 -0800)]
Add color output to zfs diff.

This adds support to color zfs diff (in the style of git diff)
conditional on the ZFS_COLOR environment variable.

Signed-off-by: Ethan Coe-Renner <coerenner1@llnl.gov>
18 months agoFreeBSD: Remove stray debug printf
Doug Rabson [Wed, 14 Dec 2022 01:35:07 +0000 (01:35 +0000)]
FreeBSD: Remove stray debug printf

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Doug Rabson <dfr@rabson.org>
Closes #14286
Closes #14287

18 months agoAdd native-deb* targets to build native Debian packages
Umer Saleem [Wed, 14 Dec 2022 01:33:05 +0000 (06:33 +0500)]
Add native-deb* targets to build native Debian packages

In continuation of previous #13451, this commits adds native-deb*
targets for make to build native debian packages. Github workflows
are updated to build and test native Debian packages.

Native packages only build with pre-configured paths (see the
dh_auto_configure section in contrib/debian/rules.in). While
building native packages, paths should not be configured. Initial
config flags e.g. '--enable-debug' are replaced in
contrib/debian/rules.in.

Additional packages on top of existing zfs packages required to
build native packages include debhelper-compat, dh-python, dkms,
po-debconf, python3-all-dev, python3-sphinx.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #14265

18 months agoZero end of embedded block buffer in dump_write_embedded()
Richard Yao [Wed, 14 Dec 2022 01:31:47 +0000 (20:31 -0500)]
Zero end of embedded block buffer in dump_write_embedded()

This fixes a kernel stack leak.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Tested-by: Nicholas Sherlock <n.sherlock@gmail.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13778
Closes #14255

18 months agoAllow receiver to override encryption properties in case of replication
Ameer Hamza [Wed, 14 Dec 2022 01:30:46 +0000 (06:30 +0500)]
Allow receiver to override encryption properties in case of replication

Currently, the receiver fails to override the encryption
property for the plain replicated dataset with the error:
"cannot receive incremental stream: encryption property
'encryption' cannot be set for incremental streams.". The
problem is resolved by allowing the receiver to override
the encryption property for plain replicated send.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14253
Closes #13533

18 months agoCache dbuf_hash() calculation
Richard Yao [Wed, 14 Dec 2022 01:29:21 +0000 (20:29 -0500)]
Cache dbuf_hash() calculation

We currently compute a 64-bit hash three times, which consumes 0.8% CPU
time on ARC eviction heavy workloads. Caching the 64-bit value in the
dbuf allows us to avoid that overhead.

Sponsored-By: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Closes #14251

18 months agozfs list: Allow more fields in ZFS_ITER_SIMPLE mode
Allan Jude [Wed, 14 Dec 2022 01:27:54 +0000 (20:27 -0500)]
zfs list: Allow more fields in ZFS_ITER_SIMPLE mode

If the fields to be listed and sorted by are constrained to those
populated by dsl_dataset_fast_stat(), then zfs list is much faster,
as it does not need to open each objset and reads its properties.

A previous optimization by Pawel Dawidek
(0cee24064a79f9c01fc4521543c37acea538405f) took advantage
of this to make listing snapshot names sorted only by name much faster.

However, it was limited to `-o name -s name`, this work extends this
optimization to work with:
  - name
  - guid
  - createtxg
  - numclones
  - inconsistent
  - redacted
  - origin
and could be further extended to any other properties supported by
dsl_dataset_fast_stat() or similar, that do not require extra locking
or reading from disk.

This was committed before (9a9e2e343dfa2af28bf7910de77ae73aa006de62),
but was reverted due to a regression when used with an older kernel.

If the kernel does not populate zc->zc_objset_stats, we now fallback
to getting the properties via the slower interface, to avoid problems
with newer userland and older kernels.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #14110

18 months agoChange ZEVENT_POOL_GUID to ZEVENT_POOL to display pool names
Marcel Menzel [Wed, 14 Dec 2022 01:26:10 +0000 (02:26 +0100)]
Change ZEVENT_POOL_GUID to ZEVENT_POOL to display pool names

Outgoing mails for ZFS pool events include the pool GUID,
but not the actual pool name. Let's change this for better
readability, as it is already done in the mails for finished
pool resilvers.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Marcel Menzel <mail@mcl.gg>
Closes #14272

18 months agoAddress theoretical uninitialized variable usage in zstream
Richard Yao [Mon, 12 Dec 2022 18:40:05 +0000 (13:40 -0500)]
Address theoretical uninitialized variable usage in zstream

Coverity has long complained about the checksum being uninitialized if
an END record is processed before its BEGIN record. This should not
happen, but there was no code to check for it. I had left this unfixed
since it was a low priority issue, but then
9f4ede63d23be4f43ba8dd0ca42c6a773a8eaa8d added another instance of this.

I am making an effort to "hold the line" to keep new coverity defect
reports from going unaddressed, so I find myself forced to fix this much
earlier than I had originally planned to address it.

The solution is to maintain a counter and a flag. Then use VERIFY
statements to verify the following runtime constraints:

 * Every record either has a corresponding BEGIN record, is a BEGIN
   record or is the end of stream END record for replication streams.
 * BEGIN records cannot be nested. i.e. There must be an END record
   before another BEGIN record may be seen.

Failure to meet these constraints will cause the program to exit.

This is sufficient to ensure that the checksum is never accessed when
uninitialized.

Reported-by: Coverity (CID 1524578)
Reported-by: Coverity (CID 1524633)
Reported-by: Coverity (CID 1527295)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14176

18 months agoinitramfs: Fix legacy mountpoint rootfs
Ryan Moeller [Mon, 12 Dec 2022 18:23:06 +0000 (13:23 -0500)]
initramfs: Fix legacy mountpoint rootfs

Legacy mountpoint datasets should not pass `-o zfsutil` to `mount.zfs`.
Fix the logic in `mount_fs()` to not forget we have a legacy mountpoint
when checking for an `org.zol:mountpoint` userprop.

Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #14274

18 months agoSkip permission checks for extended attributes
Ameer Hamza [Mon, 12 Dec 2022 18:21:37 +0000 (23:21 +0500)]
Skip permission checks for extended attributes

zfs_zaccess_trivial() calls the generic_permission() to read
xattr attributes. This causes deadlock if called from
zpl_xattr_set_dir() context as xattr and the dent locks are
already held in this scenario. This commit skips the permissions
checks for extended attributes since the Linux VFS stack already
checks it before passing us the control.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Youzhong Yang <yyang@mathworks.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14220

18 months agoRestrict visibility of per-dataset kstats inside FreeBSD jails
Allan Jude [Fri, 9 Dec 2022 19:04:29 +0000 (14:04 -0500)]
Restrict visibility of per-dataset kstats inside FreeBSD jails

When inside a jail, visibility on datasets not "jailed" to the
jail is restricted. However, it was possible to enumerate all
datasets in the pool by looking at the kstats sysctl MIB.

Only the kstats corresponding to datasets that the user has
visibility on are accessible now.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #14254

18 months agoLinux PPC: Fix build failures on kernels built without CONFIG_SPE
Richard Yao [Fri, 9 Dec 2022 18:51:23 +0000 (13:51 -0500)]
Linux PPC: Fix build failures on kernels built without CONFIG_SPE

We do a simple ifdef to avoid calling enable_kernel_spe()/
disable_kernel_spe() on PowerPC.

Reported-by: Rich Ercolani <Rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Tested-by: Georgy Yakovlev <gyakovlev@gentoo.org>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14233
Closes #14244

18 months agoBypass metaslab throttle for removal allocations
Serapheim Dimitropoulos [Fri, 9 Dec 2022 18:48:33 +0000 (10:48 -0800)]
Bypass metaslab throttle for removal allocations

Context:
We recently had a scenario where a customer with 2x10TB disks at 95+%
fragmentation and capacity, wanted to migrate their disks to a 2x20TB
setup. So they added the 2 new disks and submitted the removal of the
first 10TB disk.  The removal took a lot more than expected (order of
more than a week to 2 weeks vs a couple of days) and once it was done it
generated a huge indirect mappign table in RAM (~16GB vs expected ~1GB).

Root-Cause:
The removal code calls `metaslab_alloc_dva()` to allocate a new block
for each evacuating block in the removing device and it tries to batch
them into 16MB segments. If it can't find such a segment it tries for
8MBs, 4MBs, all the way down to 512 bytes.

In our scenario what would happen is that `metaslab_alloc_dva()` from
the removal thread pick the new devices initially but wouldn't allocate
from them because of throttling in their metaslab allocation queue's
depth (see `metaslab_group_allocatable()`) as these devices are new and
favored for most types of allocations because of their free space. So
then the removal thread would look at the old fragmented disk for
allocations and wouldn't find any contiguous space and finally retry
with a smaller allocation size until it would to the low KB range. This
caused a lot of small mappings to be generated blowing up the size of
the indirect table. It also wasted a lot of CPU while the removal was
active making everything slow.

This patch:
Make all allocations coming from the device removal thread bypass the
throttle checks. These allocations are not even counted in the metaslab
allocation queues anyway so why check them?

Side-Fix:
Allocations with METASLAB_DONT_THROTTLE in their flags would not be
accounted at the throttle queues but they'd still abide by the
throttling rules which seems wrong. This patch fixes this by checking
for that flag in `metaslab_group_allocatable()`. I did a quick check to
see where else this flag is used and it doesn't seem like this change
would cause issues.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #14159

18 months agoDo not pass -1 to strerror() from zfs_send_cb_impl()
Richard Yao [Sun, 4 Dec 2022 22:03:14 +0000 (17:03 -0500)]
Do not pass -1 to strerror() from zfs_send_cb_impl()

`zfs_send_cb_impl()` calls `dump_filesystems()`, which calls
`dump_filesystem()`, which will return `-1` as an error when
`zfs_open()` returns `NULL`.

This will be passed to `zfs_standard_error()`, which passes it to
`zfs_standard_error_fmt()`, which passes it to `strerror()`.

To fix this, we modify zfs_open() to set `errno` whenever it returns
NULL. Most of the cases already have `errno` set (since they pass it to
`zfs_standard_error_fmt()`, which makes this easy. Then we modify
`dump_filesystem()` to pass `errno` instead of `-1`.

Reported-by: Coverity (CID-1524598)
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14264

18 months agoFix dereference after null check in enqueue_range
Richard Yao [Sun, 4 Dec 2022 21:31:28 +0000 (16:31 -0500)]
Fix dereference after null check in enqueue_range

If the bp is NULL, we have a hole. However, when we build with
assertions, we will dereference bp when `blkid == DMU_SPILL_BLKID`. When
this happens on a hole, we will have a NULL pointer dereference.

Reported-by: Coverity (CID-1524670)
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14264

18 months agozdb: Handle theoretical buffer overflow when printing float
Richard Yao [Sun, 4 Dec 2022 20:41:24 +0000 (15:41 -0500)]
zdb: Handle theoretical buffer overflow when printing float

CodeQL pointed out that for extreme floating point values, `sprintf()`
will overwrite a 32 character buffer. It cited 1e304 as an example,
which causes `sprintf()` to print 308 characters.

In practice, the numbers should never exceed 100, so this should not
happen. To silence the warning and also handle unexpected situations, we
change the code to use `snprintf()`.

This was missed during my audit of our use of `sprintf()`, since I did
not think to consider extreme floating point representations. It also
really should not happen, so this change is purely defensive
programming.

This was found by CodeQL's cpp/overrunning-write-with-float check.

Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14264

18 months agozdb: zdb_ddt_leak_init() reads uninitialized memory when birth == 0
Richard Yao [Sat, 3 Dec 2022 20:09:48 +0000 (15:09 -0500)]
zdb: zdb_ddt_leak_init() reads uninitialized memory when birth == 0

This was written by Jeff Bonick and was committed to OpenSolaris on
November 1, 2009. It appears that Jeff meant to continue the outer loop
iteration when `ddp->ddp_phys_birth == 0`, but put his check inside the
inner loop. This causes a pointer to uninitialized memory to be passed
to ddt_lookup() inside a VERIFY() statement whenever that condition is
true.

Reported-by: Coverity (CID 1524462)
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14264

18 months agoztest: comparisons against errno should not assign to it
Richard Yao [Sat, 3 Dec 2022 23:08:55 +0000 (18:08 -0500)]
ztest: comparisons against errno should not assign to it

888914486e26d81c8dbfd7a9d8091c05f9fc98ba introduced this regression.

I used cscope to verify that there are no other instances of this in the
codebase. This is the one of the few bugs that are extremely easy to
identify using cscope.

Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14264

18 months agoFix potential buffer overflow in zpool command
Richard Yao [Sun, 4 Dec 2022 02:43:33 +0000 (21:43 -0500)]
Fix potential buffer overflow in zpool command

The ZPOOL_SCRIPTS_PATH environment variable can be passed here. This
allows for arbitrarily long strings to be passed to sprintf(), which can
overflow the buffer.

I missed this in my earlier audit of the codebase. CodeQL's
cpp/unbounded-write check caught this.

Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14264

18 months agozdb: Fix big parameter passed by value
Richard Yao [Sun, 4 Dec 2022 22:58:46 +0000 (17:58 -0500)]
zdb: Fix big parameter passed by value

This is not in performance critical code, but static analyzers will
complain about it, so lets switch to pass by pointer here.

Reported-by: Coverity (CID-1524384)
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14263

18 months agoLinux: Cleanup unnecessary NULL check in __vdev_disk_physio()
Richard Yao [Sun, 4 Dec 2022 21:46:44 +0000 (16:46 -0500)]
Linux: Cleanup unnecessary NULL check in __vdev_disk_physio()

zio is never NULL when given to the vdev. Coverity complained saying:

"Either the check against null is unnecessary, or there may be a null
pointer dereference."

Reported-by: Coverity (CID-1466174)
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14263

18 months agoRemove duplicate statically allocated variable
Richard Yao [Sun, 4 Dec 2022 03:09:58 +0000 (22:09 -0500)]
Remove duplicate statically allocated variable

dsl_dataset_snapshot_sync_impl() declares `static zil_header_t zero_zil
__maybe_unused;`, but this is also declared globally. This wastes
memory.

CodeQL's cpp/local-variable-hides-global-variable check caught this.

Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14263

18 months agoCleanup: zhack should not declare function prototypes in main()
Richard Yao [Sun, 4 Dec 2022 02:09:42 +0000 (21:09 -0500)]
Cleanup: zhack should not declare function prototypes in main()

Instead, it should include the proper header.

CodeQL caught this.

Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14263

18 months agoZTS: Add missing tests to Makefile.am
Brian Behlendorf [Thu, 8 Dec 2022 01:26:33 +0000 (17:26 -0800)]
ZTS: Add missing tests to Makefile.am

The send-c_zstream_recompress.ksh test case was being skipped
because it was not added to the Makefile.am, and was thus left
out of the package.

As for the renameat2 tests these were being skipped because
when the patch was rebased it was not updated to use the new
Makefile layout for the tests directory.  Correct this.

Add missing pre/post sections to sanity.run so the pyzfs tests
will run.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14266

18 months agoMicro-optimize fletcher4 calculations
Richard Yao [Mon, 5 Dec 2022 19:00:34 +0000 (14:00 -0500)]
Micro-optimize fletcher4 calculations

When processing abds, we execute 1 `kfpu_begin()`/`kfpu_end()` pair on
every page in the abd. This is wasteful and slows down checksum
performance versus what the benchmark claimed. We correct this by moving
those calls to the init and fini functions.

Also, we always check the buffer length against 0 before calling the
non-scalar checksum functions. This means that we do not need to execute
the loop condition for the first loop iteration. That allows us to
micro-optimize the checksum calculations by switching to do-while loops.

Note that we do not apply that micro-optimization to the scalar
implementation because there is no check in
`fletcher_4_incremental_native()`/`fletcher_4_incremental_byteswap()`
against 0 sized buffers being passed.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14247

18 months agoFreeBSD: zfs_register_callbacks() must implement error check correctly
Richard Yao [Mon, 5 Dec 2022 18:16:50 +0000 (13:16 -0500)]
FreeBSD: zfs_register_callbacks() must implement error check correctly

I read the following article and noticed a couple of ZFS bugs mentioned:

https://pvs-studio.com/en/blog/posts/cpp/0377/

I decided to search for them in the modern OpenZFS codebase and then
found one that matched the description of the first one:

V593 Consider reviewing the expression of the 'A = B != C' kind. The
expression is calculated as following: 'A = (B != C)'. zfs_vfsops.c 498

The consequence of this is that the error value is replaced with `1`
when there is an error. When there is no error, 0 is correctly passed.
This is a very minor issue that is unlikely to cause any real problems.

The incorrect error code would either be returned to the mount command
on a failure or any of `zfs receive`, `zfs recv`, `zfs rollback` or `zfs
upgrade`.

The second one has already been fixed.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14261

19 months agozio can deadlock during device removal
George Wilson [Sat, 3 Dec 2022 01:46:29 +0000 (19:46 -0600)]
zio can deadlock during device removal

When doing a device removal on a pool with gang blocks, the zio pipeline
can deadlock when trying to free blocks from a device which is being
removed with a stack similar to this:

 0xffff8ab9a13a1740 UNINTERRUPTIBLE       4
                   __schedule+0x2e5
                   __schedule+0x2e5
                   schedule+0x33
                   schedule_preempt_disabled+0xe
                   __mutex_lock.isra.12+0x2a7
                   __mutex_lock.isra.12+0x2a7
                   __mutex_lock_slowpath+0x13
                   mutex_lock+0x2c
                   free_from_removing_vdev+0x61
                   metaslab_free_impl+0xd6
                   metaslab_free_dva+0x5e
                   metaslab_free+0x196
                   zio_free_sync+0xe4
                   zio_free_gang+0x38
                   zio_gang_tree_issue+0x42
                   zio_gang_tree_issue+0xa2
                   zio_gang_issue+0x6d
                   zio_execute+0x94
                   zio_execute+0x94
                   taskq_thread+0x23b
                   kthread+0x120
                   ret_from_fork+0x1f

Since there are gang blocks we have to read the gang members as part of
the free. This can be seen with a zio dependency tree that looks like
this:

sdb> echo 0xffff900c24f8a700 | zio -rc | zio
ADDRESS                       TYPE  STAGE            WAITER
0xffff900c24f8a700            NULL  CHECKSUM_VERIFY  0xffff900ddfd31740
0xffff900c24f8c920            FREE  GANG_ASSEMBLE    -
0xffff900d93d435a0            READ  DONE

In the illustration above we are processing frees but because of gang
block we have to read the constituents blocks. Once we finish the READ
in the zio pipeline we will execute the parent. In this case the parent
is a FREE but the zio taskq is a READ and we continue to process the
pipeline leading to the stack above. In the stack above, we are blocked
waiting for the svr_lock so as a result a READ interrupt taskq thread
is now consumed. Eventually, all of the READ taskq threads end up
blocked and we're unable to complete any read requests.

In zio_notify_parent there is an optimization to continue to use
the taskq thread to exectue the parent's pipeline. To resolve the
deadlock above, we only allow this optimization if the parent's
zio type matches the child which just completed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
External-issue: DLPX-80130
Closes #14236