]> CyberLeo.Net >> Repos - FreeBSD/FreeBSD.git/log
FreeBSD/FreeBSD.git
15 months agoImplementation of block cloning for ZFS
Pawel Jakub Dawidek [Fri, 10 Mar 2023 19:59:53 +0000 (20:59 +0100)]
Implementation of block cloning for ZFS

Block Cloning allows to manually clone a file (or a subset of its
blocks) into another (or the same) file by just creating additional
references to the data blocks without copying the data itself.
Those references are kept in the Block Reference Tables (BRTs).

The whole design of block cloning is documented in module/zfs/brt.c.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #13392

15 months agoFix incremental receive silently failing for recursive sends
Paul Dagnelie [Fri, 10 Mar 2023 17:52:44 +0000 (09:52 -0800)]
Fix incremental receive silently failing for recursive sends

The problem occurs because dmu_recv_begin pulls in the payload and
next header from the input stream in order to use the contents of
the begin record's nvlist. However, the change to do that before the
other checks in dmu_recv_begin occur caused a regression where an
empty send stream in a recursive send could have its END record
consumed by this, which broke the logic of recv_skip. A test is
also included to protect against this case in the future.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #12661
Closes #14568

15 months agoWorkaround for Linux PowerPC GPL-only cpu_has_feature()
Low-power [Fri, 10 Mar 2023 17:35:00 +0000 (01:35 +0800)]
Workaround for Linux PowerPC GPL-only cpu_has_feature()

Linux since 4.7 makes interface 'cpu_has_feature' to use jump labels on
powerpc if CONFIG_JUMP_LABEL_FEATURE_CHECKS is enabled, in this case
however the inline function references GPL-only symbol
'cpu_feature_keys'.

ZFS currently uses 'cpu_has_feature' either directly or indirectly from
several places; while it is unknown how this issue didn't break ZFS on
64-bit little-endian powerpc, it is known to break ZFS with many Linux
versions on both 32-bit and 64-bit big-endian powerpc.

Until this issue is fixed in Linux, we have to workaround it by
overriding affected inline functions without depending on
'cpu_feature_keys'.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: WHR <msl0000023508@gmail.com>
Closes #14590

15 months agotxg_sync should handle write errors in ZIL
Richard Yao [Fri, 10 Mar 2023 17:34:00 +0000 (12:34 -0500)]
txg_sync should handle write errors in ZIL

The txg_sync thread will see certain buffers in a DR_IN_DMU_SYNC state
when ZIL is writing them out. Then it waits until the state changes, but
has an assertion to check that they were not DR_NOT_OVERRIDDEN. If the
data write failed with an error, ZIL will put it into the
DR_NOT_OVERRIDDEN state. It looks like the code will handle that state
without an issue, so we can just delete the assertion.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Sponsored-By: Wasabi Technology, Inc.
Closes #14283

16 months agoSuppress clang static analyzer warning in vdev_stat_update()
Richard Yao [Mon, 6 Mar 2023 18:29:36 +0000 (13:29 -0500)]
Suppress clang static analyzer warning in vdev_stat_update()

63652e154643cfe596fe077c13de0e7be34dd863 added unnecessary branches in
`vdev_stat_update()` to suppress an ASAN false positive the breaks
ztest. This had the downside of causing false positive reports in both
Coverity and Clang's static analyzer. vd is never NULL, so we add a
preprocessor check to only apply the workaround when compiling with ASAN
support.

Reported-by: Coverity (CID-1524583)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14575

16 months agoRefactor loop in dump_histogram()
Richard Yao [Mon, 6 Mar 2023 14:48:42 +0000 (09:48 -0500)]
Refactor loop in dump_histogram()

The current loop triggers a complaint that we are using an array offset
prior to a range check from cpp/offset-use-before-range-check when we
are actually calculating maximum and minimum values. I was about to file
a false positive report with CodeQL, but after looking at how the code
is structured, I really cannot blame CodeQL for mistaking this for a
range check.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14575

16 months agoSuppress static analyzer warning in dmu_objset_create_impl_dnstats()
Richard Yao [Mon, 6 Mar 2023 16:15:37 +0000 (11:15 -0500)]
Suppress static analyzer warning in dmu_objset_create_impl_dnstats()

ae7e7006500ca37c471dd625cd5cbfc590b49885 added an assertion to suppress
a complaint from Clang's static analyzer. Unfortunately, it missed
another way for Clang to complain about this function. This adds another
assertion to handle that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14575

16 months agoLinux: Fix octal detection in define_ddi_strtox()
Richard Yao [Sun, 5 Mar 2023 07:01:58 +0000 (02:01 -0500)]
Linux: Fix octal detection in define_ddi_strtox()

Clang Tidy reported this as a misc-redundant-expression because writing
`8` instead of `'8'` meant that the condition could never be true.

The only place where we have a chance of this being a bug would be in
nvlist_lookup_nvpair_ei_sep(). I am not sure if we ever pass an octal to
that, but if we ever do, it should work properly now instead of failing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14575

16 months agoLinux: Suppress clang static analyzer warning in zfs_remove()
Richard Yao [Sun, 5 Mar 2023 06:32:03 +0000 (01:32 -0500)]
Linux: Suppress clang static analyzer warning in zfs_remove()

Clang's static analyzer points out that if we fail to find an extended
attribute directory, but somehow find it when calculating delete_now and
delete_now is true, we will have a NULL pointer dereference when we try
to unlink the extended attribute directory.

I am not sure if this is possible, but if it is, I do not see a sane way
of handling this other than rolling back the transaction and retrying.
For now, let us do an VERIFY_IMPLY(). If this trips, it will stop the
transaction from committing, which will prevent an attribute directory
leak.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14575

16 months agoLinux: Silence static analyzer warning in crypto_create_ctx_template()
Richard Yao [Sun, 5 Mar 2023 04:48:06 +0000 (23:48 -0500)]
Linux: Silence static analyzer warning in crypto_create_ctx_template()

A CodeChecker report from Clang's CTU analysis indicated that we were
assigning uninitialized values in crypto_create_ctx_template() when we
call it from zio_crypt_key_init(). This occurs because the ->cm_param
and ->cm_param_len fields are uninitialized. Thankfully, the
uninitialized values are only used in the skein via
KCF_PROV_CREATE_CTX_TEMPLATE() -> skein_create_ctx_template() ->
skein_mac_ctx_build() -> skein_get_digest_bitlen(), but that should not
be called from here. We fix this to avoid a possible trap should this
code change in the future.

The FreeBSD version of zio_crypt_key_init() is unaffected.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14575

16 months agoSuppress Clang Static Analyzer warning in bpobj_enqueue()
Richard Yao [Sun, 5 Mar 2023 03:31:33 +0000 (22:31 -0500)]
Suppress Clang Static Analyzer warning in bpobj_enqueue()

scan-build does not do cross translation unit analysis to realize that
`dmu_buf_hold()` will always set `bpo->bpo_cached_dbuf` to a non-NULL
pointer, so we add an assertion to make it realize this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14575

16 months agoSuppress Clang Static Analyzer warning in dsl_dir_rename_sync()
Richard Yao [Sun, 5 Mar 2023 03:23:48 +0000 (22:23 -0500)]
Suppress Clang Static Analyzer warning in dsl_dir_rename_sync()

Clang's static analyzer reports that if we try to rename a root dataset
in `dsl_dir_rename_sync()`, we will have a NULL pointer passed to
strlcpy(). This is impossible because `dsl_dir_rename_check()` will
prevent us from doing this. We add an assertion to silence this warning.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14575

16 months agoCleanup: Remove constant comparisons reported by CodeQL
Richard Yao [Sat, 4 Mar 2023 22:57:01 +0000 (17:57 -0500)]
Cleanup: Remove constant comparisons reported by CodeQL

CodeQL's cpp/constant-comparison query from its security-and-extended
query set reported 4 instances where we have comparions that always
evaluate the same way.

In `draid_config_by_type()`, we have an early `if (nparity == 0)` check
that returns `EINVAL`, making a later `if (nparity == 0 || nparity >
VDEV_DRAID_MAXPARITY)` partially redundant. The later check prints an
error message when parity is 0, but the early check does not. This is
not useful feedback, so we move the later check to the place where the
early check runs to replace the early check.

In `perform_thread_merge()`, we return when `num_threads == 0`. After
that block, we do `if (num_threads > 0) {`, which will always be true.
We remove the `if` statement.

In `sa_modify_attrs()`, we have a loop condition that is `k != 2`, but
at the end of the loop, we have `if (k == 0 && hdl->sa_spill)` followed
by an else that does a break. The result is that k != 2 will never be
evaluated when it is false. We drop the comparison.

In `zap_leaf_array_read()`, we have a for loop condition that is `i <
ZAP_LEAF_ARRAY_BYTES && len > 0`. However, that loop itself is in a loop
that is `while (len > 0)` and while the value of len is decremented
inside the loop, when `len == 0`, it will return, such that `len > 0`
inside the loop condition will always be true. We drop that part of the
condition.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14575

16 months agoLinux cleanup: zvol_discard() should only call blk_queue_io_stat() once
Richard Yao [Sat, 4 Mar 2023 21:48:29 +0000 (16:48 -0500)]
Linux cleanup: zvol_discard() should only call blk_queue_io_stat() once

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14575

16 months agoSuppress Clang Static Analyzer warning in dbuf_dnode_findbp()
Richard Yao [Sat, 4 Mar 2023 21:11:49 +0000 (16:11 -0500)]
Suppress Clang Static Analyzer warning in dbuf_dnode_findbp()

Clang's static analyzer reports that if a `blkid == DMU_SPILL_BLKID` is
passed, then we can have a NULL pointer dereference when either
->dn_have_spill or `DNODE_FLAG_SPILL_BLKPTR` is not set. This should not
happen. We add an `ASSERT()` to suppress reports about NULL pointer
dereferences.

Originally, I wanted to use one or two IMPLY statements on
pre-conditions before the call to `dbuf_findbp()`, but Clang's static
analyzer did not understand it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14575

16 months agoSuppress Clang Static Analyzer warning in vdev_split()
Richard Yao [Sat, 4 Mar 2023 20:38:06 +0000 (15:38 -0500)]
Suppress Clang Static Analyzer warning in vdev_split()

Clang's static analyzer pointed out that we can have a NULL pointer
dereference if we ever attempt to split a vdev that has only 1 child. If
that happens, we are left with zero children, but then try to access a
non-existent child. Calling vdev_split() on a vdev with only 1 child
should be impossible due to how the code is structured. If this ever
happens, it would be best to stop execution immediately even in a
production environment to allow for the best possible chance of recovery
by an expert, so we use `VERIFY3U()` instead of `ASSERT3U()`.

Unfortunately, while that defensive assertion will prevent execution
from ever reaching the NULL pointer dereference, Clang's static analyzer
does not realize that, so we add an `ASSERT()` to inform it of this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14575

16 months agoSuppress Clang Static Analyzer warning about SNPRINTF_BLKPTR()
Richard Yao [Sat, 4 Mar 2023 02:15:28 +0000 (21:15 -0500)]
Suppress Clang Static Analyzer warning about SNPRINTF_BLKPTR()

Clang's static analyzer pointed out that if we can pass a -1 array index
to copyname[copies] if there are no valid DVAs. This is an absurd
situation, but it suggests that we are missing an assertion, so we add
it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14575

16 months agoSuppress Clang Static Analyzer false positive in the AVL tree code.
Richard Yao [Sun, 12 Feb 2023 21:38:27 +0000 (16:38 -0500)]
Suppress Clang Static Analyzer false positive in the AVL tree code.

This has been filed as llvm/llvm-project#60694. Switching from a copy
through a C pointer dereference to an explicit memcpy() is a workaround
that prevents a false positive.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14575

16 months agoSuppress Clang Static Analyzer defect report in abd_get_size()
Richard Yao [Tue, 7 Feb 2023 23:47:40 +0000 (18:47 -0500)]
Suppress Clang Static Analyzer defect report in abd_get_size()

Clang's static analyzer reports a possible NULL pointer dereference in
abd_get_size() when called from vdev_draid_map_alloc_write() called from
vdev_draid_map_alloc_row() and vdc->vdc_nparity == 0. This should be
impossible, so we add an assertion to silence the defect report.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14575

16 months agoFix TOCTOU race in zpool_do_labelclear()
Richard Yao [Sun, 4 Dec 2022 22:42:43 +0000 (17:42 -0500)]
Fix TOCTOU race in zpool_do_labelclear()

Coverity reported a TOCTOU race in `zpool_do_labelclear()`. This is not
believed to be a real security issue, but fixing it reduces the number
of syscalls we do and will prevent other static analyzers from
complaining about this.

The code is expected to be equivalent. However, under rare
circumstances, such as ELOOP, ENAMETOOLONG, ENOMEM, ENOTDIR and
EOVERFLOW, we will display the error message that we currently display
for the `open()` syscall rather than the one that we currently display
for the `stat()` syscall. This is considered to be an improvement.

Reported-by: Coverity (CID-1524188)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14575

16 months agoMore adaptive ARC eviction
Alexander Motin [Wed, 8 Mar 2023 19:17:23 +0000 (14:17 -0500)]
More adaptive ARC eviction

Traditionally ARC adaptation was limited to MRU/MFU distribution.  But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO.  As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.

This change removes most of existing eviction logic, rewriting it from
scratch.  This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.

The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata.  To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions.  Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums.  This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.

This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.

Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do.  Instead just call
arc_prune_async() if too much metadata appear not evictable.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359

16 months agoICP: AES-GCM: Unify gcm_init_ctx() and gmac_init_ctx()
Attila Fülöp [Wed, 8 Mar 2023 19:12:15 +0000 (20:12 +0100)]
ICP: AES-GCM: Unify gcm_init_ctx() and gmac_init_ctx()

gmac_init_ctx() duplicates most of the code in gcm_int_ctx() while
it just needs to set its own IV length and AAD tag length.

Introduce gcm_init_ctx_impl() which handles the GCM and GMAC
differences while reusing the duplicated code.

While here, fix a flaw where the AVX implementation would accept a
context using a byte swapped key schedule which it could not
handle. Also constify the IV and AAD pointers passed to
gcm_init{,_avx}().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #14529

16 months agoDo not hold spa_config in ZIL while blocked on IO
Richard Yao [Wed, 8 Mar 2023 00:12:28 +0000 (19:12 -0500)]
Do not hold spa_config in ZIL while blocked on IO

Otherwise, we can get a deadlock that looks like this:

1. fsync() grabs spa_config_enter(zilog->zl_spa, SCL_STATE, lwb,
RW_READER) as part of zil_lwb_write_issue() . It then blocks on the
txg_sync when a flush fails from a drive power cycling.

2. The txg_sync then blocks on the pool suspending due to the loss of
too many disks.

3. zpool clear then blocks on spa_config_enter(spa, SCL_STATE |
SCL_L2ARC | SCL_ZIO, spa, RW_WRITER)  because it is a writer.

The disks cannot be brought online due to fsync() holding that lock and
the user gets upset since fsync() is uninterruptibly blocked inside the
kernel.

We need to grab the lock for vdev_lookup_top(), but we do not need to
hold it while there is outstanding IO.

This fixes a regression introduced by
1ce23dcaff6c3d777cb0d9a4a2cf02b43f777d78.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Sponsored-By: Wasabi Technology, Inc.
Closes #14519

16 months agoFix build for Linux/powerpc without CONFIG_ALTIVEC or CONFIG_VSX
Low-power [Tue, 7 Mar 2023 22:06:52 +0000 (06:06 +0800)]
Fix build for Linux/powerpc without CONFIG_ALTIVEC or CONFIG_VSX

This fixes building ZFS for Linux 4.7+ powerpc* architecture, where
Linux was configured without CONFIG_ALTIVEC or CONFIG_VSX.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: WHR <msl0000023508@gmail.com>
Closes #14591

16 months agoBetter handling for future crypto parameters
Rob N [Tue, 7 Mar 2023 22:05:14 +0000 (09:05 +1100)]
Better handling for future crypto parameters

The intent is that this is like ENOTSUP, but specifically for when
something can't be done because we have no support for the requested
crypto parameters; eg unlocking a dataset or receiving a stream
encrypted with a suite we don't support.

Its not intended to be recoverable without upgrading ZFS itself.
If the request could be made to work by enabling a feature or modifying
some other configuration item, then some other code should be used.

load-key: In the future we might have more crypto suites (ie new values
for the `encryption` property. Right now trying to load a key on such
a future crypto suite will look up suite parameters off the end of the
crypto table, resulting in misbehaviour and/or crashes (or, with debug
enabled, trip the assertion in `zio_crypt_key_unwrap`).

Instead, lets check the value we got from the dataset, and if we can't
handle it, abort early.

recv: When receiving a raw stream encrypted with an unknown crypto
suite, `zfs recv` would report a generic `invalid backup stream`
(EINVAL). While technically correct, its not super helpful, so lets
ship a more specific error code and message.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #14577

16 months agoFix a typo in ac2038a
George Amanakis [Tue, 7 Mar 2023 21:50:44 +0000 (22:50 +0100)]
Fix a typo in ac2038a

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14585
Closes #14592

16 months ago[FreeBSD] fix false assert in cache_vop_rmdir when replaying ZIL
Andriy Gapon [Tue, 7 Mar 2023 21:48:43 +0000 (23:48 +0200)]
[FreeBSD] fix false assert in cache_vop_rmdir when replaying ZIL

The assert is enabled when DEBUG_VFS_LOCKS kernel option is set.
The exact panic is:
    panic: condition seqc_in_modify(_vp->v_seqc) not met
It happens because seqc protocol is not followed for ZIL replay.

But we actually do not need to make any namecache calls at that stage,
because the namecache use is not enabled until after the replay is
completed.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes #14566

16 months agospl: Add cmn_err_once() to log a message only on the first call
Attila Fülöp [Tue, 7 Mar 2023 21:44:11 +0000 (22:44 +0100)]
spl: Add cmn_err_once() to log a message only on the first call

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #14567

16 months agoinitramfs: fix zpool get argument order
q66 [Tue, 7 Mar 2023 01:07:01 +0000 (02:07 +0100)]
initramfs: fix zpool get argument order

When using the zfs initramfs scripts on my system, I get various
errors at initramfs stage, such as:

cannot open '-o': name must begin with a letter

My zfs binaries are compiled with musl libc, which may be why
this happens. In any case, fix the argument order to make the
zpool binary happy, and to match its --help output.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Daniel Kolesa <daniel@octaforge.org>
Closes #14572

16 months agoFix detection of IBM Power8 machines (ISA 2.07)
Tino Reichardt [Tue, 7 Mar 2023 01:01:01 +0000 (02:01 +0100)]
Fix detection of IBM Power8 machines (ISA 2.07)

An IBM POWER7 system with Power ISA 2.06 tried to execute
zfs_sha256_power8() - which should only be run on ISA 2.07
machines.

The detection is implemented via the zfs_isa207_available() call,
but this check was not used.

This pull request will fix this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Low-power <msl0000023508@gmail.com>
Closes #14576

16 months ago[FreeBSD] zfs_znode_alloc: lock the vnode earlier
Andriy Gapon [Tue, 7 Mar 2023 00:30:54 +0000 (02:30 +0200)]
[FreeBSD] zfs_znode_alloc: lock the vnode earlier

This is needed because of a possible error path where zfs_vnode_forget()
is called.  That function calls vgone() and vput(), the former requires
the vnode to be exclusively locked and the latter expects it to be
locked.

It should be safe to lock the vnode as early as possible because it is
not yet visible, so there is no interaction with other locks.

While here, remove a tautological assignment to 'vp'.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes #14565

16 months agoOptimize the is_l2cacheable functions
George Amanakis [Tue, 7 Mar 2023 00:13:05 +0000 (01:13 +0100)]
Optimize the is_l2cacheable functions

by placing the most common use case (no special vdevs) first and avoid
allocating new variables.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14494
Closes #14563

16 months agoFix memory leak in ztest
Richard Yao [Mon, 6 Mar 2023 23:30:29 +0000 (18:30 -0500)]
Fix memory leak in ztest

This is tripping LeakSanitizer, which causes zloop test failures on pull
requests.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14583

16 months agoAdd missing increment to dsl_deadlist_move_bpobj()
Richard Yao [Sat, 4 Mar 2023 23:42:01 +0000 (18:42 -0500)]
Add missing increment to dsl_deadlist_move_bpobj()

dc5c8006f684b1df3f2d4b6b8c121447d2db0017 was recently merged to prefetch
up to 128 deadlists. Unfortunately, a loop was missing an increment,
such that it will prefetch all deadlists. The performance properties of
that patch probably should be re-evaluated.

This was caught by CodeQL's cpp/constant-comparison check in an
experimental branch where I am testing the security-and-extended
queries. It complained about the `i < 128` part of the loop condition
always evaluating to the same thing. The standard CodeQL configuration
we use missed this because it does not include that check.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14573

16 months agoSHA2Init() should use signed assertions when checking an enum
Richard Yao [Sat, 4 Mar 2023 20:53:58 +0000 (15:53 -0500)]
SHA2Init() should use signed assertions when checking an enum

The recent 4c5fec01a48acc184614ab8735e6954961990235 commit caused
Coverity to report that ASSERT3U(algotype, >=, SHA256_MECH_INFO_TYPE);
is always true. That is because the signed algotype and signed
SHA256_MECH_INFO_TYPE values were cast to unsigned types. To fix this,
we switch the assertions to use ASSERT3S(), which retains the signedness
of the original values for the comparison.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reported-by: Coverity (CID-1535300)
Closes #14573

16 months agoRestore ASMABI and other Unify work
Jorgen Lundman [Mon, 6 Mar 2023 23:24:05 +0000 (08:24 +0900)]
Restore ASMABI and other Unify work

Make sure all SHA2 transform function has wrappers

For ASMABI to work, it is required the calling convention
is consistent.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Joergen Lundman <lundman@lundman.net>
Closes #14569

16 months agoUse SECTION_STATIC macro for sha2 x86_64 assembly
Tino Reichardt [Wed, 1 Mar 2023 19:44:49 +0000 (20:44 +0100)]
Use SECTION_STATIC macro for sha2 x86_64 assembly

- instead of ".section .rodata" we should use SECTION_STATIC

Tested-by: Rich Ercolani <rincebrain@gmail.com>
Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #13741

16 months agoUpdate BLAKE3 for using the new impl handling
Tino Reichardt [Mon, 27 Feb 2023 15:14:37 +0000 (16:14 +0100)]
Update BLAKE3 for using the new impl handling

This commit changes the BLAKE3 implementation handling and
also the calls to it from the ztest command.

Tested-by: Rich Ercolani <rincebrain@gmail.com>
Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #13741

16 months agoAdd generic implementation handling and SHA2 impl
Tino Reichardt [Wed, 1 Mar 2023 08:40:28 +0000 (09:40 +0100)]
Add generic implementation handling and SHA2 impl

The skeleton file module/icp/include/generic_impl.c can be used for
iterating over different implementations of algorithms.

It is used by SHA256, SHA512 and BLAKE3 currently.

The Solaris SHA2 implementation got replaced with a version which is
based on public domain code of cppcrypto v0.10.

These assembly files are taken from current openssl master:
- sha256-x86_64.S: x64, SSSE3, AVX, AVX2, SHA-NI (x86_64)
- sha512-x86_64.S: x64, AVX, AVX2 (x86_64)
- sha256-armv7.S: ARMv7, NEON, ARMv8-CE (arm)
- sha512-armv7.S: ARMv7, NEON (arm)
- sha256-armv8.S: ARMv7, NEON, ARMv8-CE (aarch64)
- sha512-armv8.S: ARMv7, ARMv8-CE (aarch64)
- sha256-ppc.S: Generic PPC64 LE/BE (ppc64)
- sha512-ppc.S: Generic PPC64 LE/BE (ppc64)
- sha256-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64)
- sha512-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64)

Tested-by: Rich Ercolani <rincebrain@gmail.com>
Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #13741

16 months agoAdd SHA2 SIMD feature tests for libspl
Tino Reichardt [Wed, 28 Sep 2022 08:54:35 +0000 (10:54 +0200)]
Add SHA2 SIMD feature tests for libspl

These are added via HWCAP interface:
- zfs_neon_available() for arm and aarch64
- zfs_sha256_available() for arm and aarch64
- zfs_sha512_available() for aarch64

This one via cpuid() call:
- zfs_shani_available() for x86_64

Tested-by: Rich Ercolani <rincebrain@gmail.com>
Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #13741

16 months agoAdd SHA2 SIMD feature tests for Linux
Tino Reichardt [Wed, 28 Sep 2022 08:54:02 +0000 (10:54 +0200)]
Add SHA2 SIMD feature tests for Linux

These are added:
- zfs_neon_available() for arm and aarch64
- zfs_sha256_available() for arm and aarch64
- zfs_sha512_available() for aarch64
- zfs_shani_available() for x86_64

Tested-by: Rich Ercolani <rincebrain@gmail.com>
Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Co-Authored-By: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Closes #13741

16 months agoAdd SHA2 SIMD feature tests for FreeBSD
Tino Reichardt [Wed, 28 Sep 2022 08:53:18 +0000 (10:53 +0200)]
Add SHA2 SIMD feature tests for FreeBSD

These are added:
- zfs_neon_available() for arm and aarch64
- zfs_sha256_available() for arm and aarch64
- zfs_sha512_available() for aarch64
- zfs_shani_available() for x86_64

Changes:
- simd_powerpc.h: change license from CDDL to BSD

Tested-by: Rich Ercolani <rincebrain@gmail.com>
Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #13741

16 months agoAdd ARM architecture to OpenZFS buildsystem
Tino Reichardt [Wed, 28 Sep 2022 08:55:04 +0000 (10:55 +0200)]
Add ARM architecture to OpenZFS buildsystem

Tested-by: Rich Ercolani <rincebrain@gmail.com>
Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #13741

16 months agoRemove old or redundant SHA2 files
Tino Reichardt [Mon, 27 Feb 2023 15:11:51 +0000 (16:11 +0100)]
Remove old or redundant SHA2 files

We had three sha2.h headers in different places.
The FreeBSD version, the Linux version and the generic solaris version.

The only assembly used for acceleration was some old x86-64 openssl
implementation for sha256 within the icp module.

For FreeBSD the whole SHA2 files of FreeBSD were copied into OpenZFS,
these files got removed also.

Tested-by: Rich Ercolani <rincebrain@gmail.com>
Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #13741

16 months agozdb: add decryption support
Rob N [Thu, 2 Mar 2023 21:39:09 +0000 (08:39 +1100)]
zdb: add decryption support

The approach is straightforward: for dataset ops, if a key was offered,
find the encryption root and the various encryption parameters, derive a
wrapping key if necessary, and then unlock the encryption root. After
that all the regular dataset ops will return unencrypted data, and
that's kinda the whole thing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #11551
Closes #12707
Closes #14503

16 months agoSystem-wide speculative prefetch limit.
Alexander Motin [Wed, 1 Mar 2023 23:27:40 +0000 (18:27 -0500)]
System-wide speculative prefetch limit.

With some pathological access patterns it is possible to make ZFS
accumulate almost unlimited amount of speculative prefetch ZIOs.
Combined with linear ABD allocations in RAIDZ code, it appears to
be possible to exhaust system KVA, triggering kernel panic.

Address this by introducing a system-wide counter of active prefetch
requests and blocking prefetch distance doubling per stream hits if
the number of active requests is higher that ~6% of ARC size.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14516

16 months agoAccommodate debug-kernel stack frame size
Brian Behlendorf [Wed, 1 Mar 2023 22:40:45 +0000 (14:40 -0800)]
Accommodate debug-kernel stack frame size

The blk_queue_discard() and blk_queue_sector_erase() functions
slightly exceed the allowed 4096 maximum stack frame size when
building with the RedHat debug kernel which causes their
configure checks to fail.

Add an exception for these two tests so the interfaces are
correctly detected.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14540

16 months agoFix data race between zil_commit() and zil_suspend()
Richard Yao [Wed, 1 Mar 2023 21:23:09 +0000 (16:23 -0500)]
Fix data race between zil_commit() and zil_suspend()

openzfsonwindows/openzfs#206 found that it is possible to trip
`VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed
by the scheduler long enough for a parallel `zil_suspend()` operation to
exit `zil_commit_impl()`. This is a data race. To prevent this, we
introduce a `zilog->zl_suspend_lock` rwlock to ensure that all
outstanding `zil_commit()` operations finish before `zil_suspend()`
begins and that subsequent operations fallback to `txg_wait_synced()`
after `zil_suspend()` has begun.

On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers
from writer starvation. This means that a ZIL intensive system can delay
`zil_suspend()` indefinitely. This is a pre-existing problem that
affects everything that uses rw locks, so it needs to be addressed in
the SPL.  However, builds against `PREEMPT_RT` Linux kernels are
currently broken due to a GPL symbol issue (#11097), so we can safely
disregard that issue for now.

Reported-by: Arun KV <arun.kv@datacore.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14514

16 months agoRemove bad kmem_free() oversight from previous zfsdev_state_list patch
Richard Yao [Wed, 1 Mar 2023 21:20:53 +0000 (16:20 -0500)]
Remove bad kmem_free() oversight from previous zfsdev_state_list patch

I forgot to remove the corresponding kmem_free() from zfs_kmod_fini() in
9a14ce43c3d6a9939804215bbbe66de5115ace42. Clang's static analyzer did
not complain, but the Coverity scan that was run after the patch was
merged did.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reported-by: Coverity (CID-1535275)
Closes #14556

16 months agoLinux: zfs_fillpage() should handle partial pages from end of file
Richard Yao [Wed, 1 Mar 2023 21:19:47 +0000 (16:19 -0500)]
Linux: zfs_fillpage() should handle partial pages from end of file

After 89cd2197b94986d315b9b1be707b645baf59af4f was merged, Clang's
static analyzer began complaining about a dead assignment in
`zfs_fillpage()`. Upon inspection, I noticed that the dead assignment
was because we are not using the calculated io_len that we should use to
avoid asking the DMU to read past the end of a file. This should result
in `dmu_buf_hold_array_by_dnode()` calling `zfs_panic_recover()`.

This issue predates 89cd2197b94986d315b9b1be707b645baf59af4f, but its
simplification of zfs_fillpage() eliminated the only use of the
assignment to io_len, which made Clang's static analyzer complain about
the issue.

Also, as a precaution, we add an assertion that io_offset < i_size. If
this ever fails, bad things will happen. Otherwise, we are blindly
trusting the kernel not to give us invalid offsets. We continue to
blindly trust it on non-debug kernels.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14534

16 months agoUpdate reclaim disk space script
Tino Reichardt [Wed, 1 Mar 2023 20:25:09 +0000 (21:25 +0100)]
Update reclaim disk space script

The script uses systemd-run, which does the job in background.
We should take the the time and wait for the job to finish.

Maybe some functional tests suffer from not really freed disk
space and fail because of this.

We also add some trimming in the end of the script.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #14554

16 months agoHandle unexpected errors in zil_lwb_commit() without ASSERT()
Richard Yao [Wed, 1 Mar 2023 17:39:41 +0000 (12:39 -0500)]
Handle unexpected errors in zil_lwb_commit() without ASSERT()

We tripped `ASSERT(error == ENOENT || error == EEXIST || error ==
EALREADY)` in `zil_lwb_commit()` at Klara when doing robustness testing
of ZIL against drive power cycles.

That assertion presumably exists because when this code was written, the
only errors expected from here were EIO, ENOENT, EEXIST and EALREADY,
with EIO having its own handling before the assertion. However, upon
doing a manual depth first search traversal of the source tree, it turns
out that a large number of unexpected errors are possible here. In
theory, EINVAL and ENOSPC can come from dnode_hold_impl(). However, most
unexpected errors originate in the block layer and come to us from
zio_wait() in various ways. One way is ->zl_get_data() -> dmu_buf_hold()
-> dbuf_read() -> zio_wait().

From vdev_disk.c on Linux alone, zio_wait() can return the unexpected
errors ENXIO, ENOTSUP, EOPNOTSUPP, ETIMEDOUT, ENOSPC, ENOLINK,
EREMOTEIO, EBADE, ENODATA, EILSEQ and ENOMEM

This was only observed after what have been likely over 1000 test
iterations, so we do not expect to reproduce this again to find out what
the error code was. However, circumstantial evidence suggests that the
error was ENXIO.

When ENXIO or any other unexpected error occurs, the `fsync()` or
equivalent operation that called zil_commit() will return success, when
in fact, dirty data has not been committed to stable storage. This is a
violation of the Single UNIX Specification.

The code should be able to handle this and any other unknown error by
calling `txg_wait_synced()`. In addition to changing the code to call
txg_wait_synced() on unexpected errors instead of returning, we modify
it to print information about unexpected errors to dmesg.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Sponsored-By: Wasabi Technology, Inc.
Closes #14532

16 months agomancheck: exclude dotfiles in man dir
Rob N [Wed, 1 Mar 2023 17:38:31 +0000 (04:38 +1100)]
mancheck: exclude dotfiles in man dir

Its not uncommon for an editor to drop a hidden swap file in the dir
while editing a file there. mancheck would find it and run mandoc on it,
which would complain about its distinctly not-manpage format.

A more correct solution might be to reconfigure the editor to not put
swap files in the same dir, but its the default a lot of the time, and
this is a very small change that gives a very nice quality-of-life
improvement.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #14549

16 months agoman: note that zdb operates directly on pool storage
Rob N [Wed, 1 Mar 2023 01:35:33 +0000 (12:35 +1100)]
man: note that zdb operates directly on pool storage

A frequent misunderstanding is that zdb accesses the pool through the
kernel or filesystem, leading to confusion particularly when it can't
access something that it seems like it should be able to.

I've seen this confusion recently when zdb couldn't access a pool because
the user didn't have permission to read directly from the block devices,
and when it couldn't show attributes of encrypted files even though the
dataset was unlocked.

The manpage already speaks to another symptom of this, namely that zdb
may "behave erratically" on an active pool; here I'm trying to make that
a little more explicit.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #14539

16 months agoMakefile.bsd: cleanup and sync with FreeBSD
Allan Jude [Wed, 1 Mar 2023 01:33:59 +0000 (20:33 -0500)]
Makefile.bsd: cleanup and sync with FreeBSD

xxhash.c was not being compiled, so when FreeBSD's kernel
switched to a newer version of ZSTD a few weeks ago, out-of-tree ZFS
failed to build

Sync module/Makefile.bsd with FreeBSD's sys/modules/zfs/Makefile

And restore the alphabetical sort in a number of places

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Sponsored-by: Klara, Inc.
Closes #14508

16 months agoSuppress static analyzer warning in dmu_objset_create_impl_dnstats()
Richard Yao [Tue, 7 Feb 2023 10:43:05 +0000 (05:43 -0500)]
Suppress static analyzer warning in dmu_objset_create_impl_dnstats()

Clang's static analyzer claims that dereferencing ds in
dmu_objset_create_impl_dnstats() could cause a NULL pointer dereference
when a previous NULL check confirms that it is NULL. It is only NULL on
the MOS, for which dmu_objset_userused_enabled(os) should always return
false, so ds will never be dereferenced when it is NULL. We add an
assertion to suppress this warning.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14470

16 months agoSuppress static analyzer warning in dbuf_hold_copy()
Richard Yao [Tue, 7 Feb 2023 09:51:56 +0000 (04:51 -0500)]
Suppress static analyzer warning in dbuf_hold_copy()

Clang's static analyzer claims that dbuf_hold_copy() will have a NULL
pointer dereference in data->b_data when called by dbuf_hold_impl().
This is impossible because data is dr->dt.dl.dr_data, which is non-NULL
whenever db->db_level == 0, which is always the case whenever
dbuf_hold_impl() calls dbuf_hold_copy(). We add an assertion to suppress
the complaint.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14470

16 months agoStatically allocate first node of zfsdev_state_list
Richard Yao [Tue, 7 Feb 2023 08:23:45 +0000 (03:23 -0500)]
Statically allocate first node of zfsdev_state_list

This avoids a call to kmem_alloc() during module load. It also
suppresses a defect report from Clang's static analyzer that claims that
we will have a NULL pointer dereference in zfsdev_state_init() because
it does not understand that this has already been allocated in
zfs_kmod_init().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14470

16 months agoSuppress static analyzer warning in sa_attr_iter()
Richard Yao [Tue, 7 Feb 2023 08:02:10 +0000 (03:02 -0500)]
Suppress static analyzer warning in sa_attr_iter()

Clang's static analyzer points out that when IS_SA_BONUSTYPE(type) is
true and .sa_length is 0 for an attribute, we have a NULL pointer
dereference. We suppress this with an IMPLY() statement.

This was also identified by Coverity.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reported-by: Coverity (CID-1017954)
Closes #14470

16 months agoSuppress static analyzer warnings in zio_checksum_error_impl()
Richard Yao [Tue, 7 Feb 2023 07:38:45 +0000 (02:38 -0500)]
Suppress static analyzer warnings in zio_checksum_error_impl()

Clang's static analyzer informs us of multiple NULL pointer dereferences
involving zio_checksum_error_impl().

The first is a NULL pointer dereference if bp is NULL and ci->ci_flags &
ZCHECKSUM_FLAG_EMBEDDED is false, but bp is NULL implies that
ci->ci_flags & ZCHECKSUM_FLAG_EMBEDDED is true, so we add an IMPLY()
statement to suppress the report.

The second and third are identical, and are duplicated because while the
NULL pointer dereference occurs in zio_checksum_gang_verifier(), it is
called by zio_checksum_error_impl() and there is a report for each of
the two functions. The reports state that when bp is NULL, ci->ci_flags
& ZCHECKSUM_FLAG_EMBEDDED is true and checksum is not
ZIO_CHECKSUM_LABEL, we also have a NULL pointer dereference. bp is NULL
should imply that checksum == ZIO_CHECKSUM_LABEL, so we add an IMPLY()
statement to suppress the second report. The two reports are
functionally identical.

A fourth variation of this was also reported by Coverity. It occurs when
checksum == ZIO_CHECKSUM_ZILOG2.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reported-by: Coverity (CID-1524672)
Closes #14470

16 months agoicp: Prevent compilers from optimizing away memset() in gcm_clear_ctx()
Richard Yao [Wed, 1 Mar 2023 01:28:50 +0000 (20:28 -0500)]
icp: Prevent compilers from optimizing away memset() in gcm_clear_ctx()

The recently merged f58e513f7408f353bf0151fdaf235d4e062e8950 was
intended to zero sensitive data before exit from encryption
functions to harden the code against theoretical information
leaks. Unfortunately, the method by which it did that is
optimized away by the compiler, so some information still leaks. This
was confirmed by counting function calls in disassembly.

After studying how the OpenBSD, FreeBSD and Linux kernels handle this,
and looking at our disassembly, I decided on a two-factor approach to
protect us from compiler dead store elimination passes.

The first factor is to stop trying to inline gcm_clear_ctx(). GCC does
not actually inline it in the first place, and testing suggests that
dead store elimination passes appear to become more powerful in a bad
way when inlining is forced, so we recognize that and move
gcm_clear_ctx() to a C file.

The second factor is to implement an explicit_memset() function based on
the technique used by `secure_zero_memory()` in FreeBSD's blake2
implementation, which coincidentally is functionally identical to the
one used by Linux. The source for this appears to be a LLVM bug:

https://llvm.org/bugs/show_bug.cgi?id=15495

Unlike both FreeBSD and Linux, we explicitly avoid the inline keyword,
based on my observations that GCC's dead store elimination pass becomes
more powerful when inlining is forced, under the assumption that it will
be equally powerful when the compiler does decide to inline function
calls.

Disassembly of GCC's output confirms that all 6 memset() calls are
executed with this patch applied.

Reviewed-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14544

16 months agoLinux: Assert mutex is held in mutex_exit()
Richard Yao [Wed, 1 Mar 2023 01:27:20 +0000 (20:27 -0500)]
Linux: Assert mutex is held in mutex_exit()

A spurious mutex_exit() in a development branch caused weird issues
until I identified it. An assertion prior to mutex_exit() would have
caught it. Rather than adding assertions before invocations of
mutex_exit() in the code, let us simply add an assertion to
mutex_exit(). It is cheap and will likely improve developer
productivity.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Sponsored-By: Wasabi Technology, Inc.
Closes #14541

16 months agoRevert zfeature_active() to static
George Amanakis [Tue, 28 Feb 2023 22:03:52 +0000 (23:03 +0100)]
Revert zfeature_active() to static

Commit 34ce4c4 made zfeature_active() non-static. This is not required.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14546

16 months agoFix github build failures because of vdevprops.7
Tino Reichardt [Tue, 28 Feb 2023 22:02:48 +0000 (23:02 +0100)]
Fix github build failures because of vdevprops.7

This small fix adds the manpage vdevprops.7 to the file
contrib/debian/openzfs-zfsutils.install and the github
actions will work again.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #14553

16 months agoPrevent incorrect datasets being mounted
John Poduska [Tue, 28 Feb 2023 00:49:34 +0000 (19:49 -0500)]
Prevent incorrect datasets being mounted

During a mount, zpl_mount_impl(), uses sget() with the callback
zpl_test_super() to find a super_block with a matching objset,
stored in z_os. It does so without taking the teardown lock on
the zfsvfs.

The problem is that operations like rollback will replace the
z_os. And, there is a window where the objset in the rollback
is freed, but z_os still points to it. Then, a mount like
operation, for instance a clone, can reallocate that exact same
pointer and zpl_test_super() will then match the super_block
associated with the rollback as opposed to the clone.

This fix tests for a match and if so, takes the teardown lock
before doing the final match test.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Poduska <jpoduska@datto.com>
Closes #14518

16 months agoAdd vdevprops.7 to the Makefile
D. Ebdrup [Tue, 28 Feb 2023 00:38:09 +0000 (01:38 +0100)]
Add vdevprops.7 to the Makefile

Adding vdevprops.7 to the Makefile ensures that it gets installed
properly on FreeBSD.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Daniel Ebdrup Jensen <debdrup@FreeBSD.org>
Closes #14527

16 months agoSkip memory allocation when compressing holes
Richard Yao [Mon, 27 Feb 2023 22:41:02 +0000 (17:41 -0500)]
Skip memory allocation when compressing holes

Hole detection in the zio compression code allows us to
opportunistically skip compression on holes. We can go a step further
by not doing memory allocations on holes either.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Sponsored-by: Wasabi Technology, Inc.
Closes #14500

16 months agoICP: AES-GCM: Refactor gcm_clear_ctx()
Attila Fülöp [Mon, 27 Feb 2023 22:38:12 +0000 (23:38 +0100)]
ICP: AES-GCM: Refactor gcm_clear_ctx()

Currently the temporary buffer in which decryption takes place
isn't cleared on context destruction. Further in some routines we
fail to call gcm_clear_ctx() on error exit. Both flaws may result
in leaking sensitive data.

We follow best practices and zero out the plaintext buffer before
freeing the memory holding it. Also move all cleanup into
gcm_clear_ctx() and call it on any context destruction.

The performance impact should be negligible.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #14528

16 months agoMove zap_attribute_t to the heap in dsl_deadlist_merge
Mariusz Zaborski [Mon, 27 Feb 2023 22:27:58 +0000 (23:27 +0100)]
Move zap_attribute_t to the heap in dsl_deadlist_merge

In the case of a regular compilation, the compiler
raises a warning for a dsl_deadlist_merge function, that
the stack size is to large. In debug build this can
generate an error.

Move large structures to heap.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #14524

16 months agoWorkaround GitHub Action failure
Brian Behlendorf [Mon, 27 Feb 2023 17:19:25 +0000 (09:19 -0800)]
Workaround GitHub Action failure

Ubuntu 20.04 and 22.04 workflows are failing due to an error
which is hit when running `apt-get update`.  Until the
problematic package is fixed apply the suggested workaround
described here:

  https://github.com/orgs/community/discussions/47863

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14530

16 months agoUse .section .rodata instead of .rodata on FreeBSD
Dimitry Andric [Sat, 25 Feb 2023 00:45:48 +0000 (01:45 +0100)]
Use .section .rodata instead of .rodata on FreeBSD

In commit 0a5b942d4 the FreeBSD SECTION_STATIC macro was set to
".rodata". This assembler directive is supported by LLVM (as a
convenience alias for ".section .rodata") by not by GNU as.

This caused the FreeBSD builds that are done with gcc to fail.
Therefore, use ".section .rodata" instead, similar to the other
asm_linkage.h headers.

Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Dimitry Andric <dimitry@andric.com>
Closes #14526

16 months agoMove dmu_buf_rele() after dsl_dataset_sync_done()
George Amanakis [Fri, 24 Feb 2023 01:14:52 +0000 (02:14 +0100)]
Move dmu_buf_rele() after dsl_dataset_sync_done()

Otherwise the dataset may be freed after the last dmu_buf_rele() leading
to a panic.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14522
Closes #14523

16 months agoZTS: Minor fixes
Brian Behlendorf [Fri, 24 Feb 2023 01:10:46 +0000 (17:10 -0800)]
ZTS: Minor fixes

- The migration_012_pos.ksh test case was failing because of a
  missing space after `log_must`.

- None of the tests listed in the runfiles should include the .ksh
  suffix.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14515

16 months agoFix buffered/direct/mmap I/O race
Brian Behlendorf [Thu, 23 Feb 2023 18:57:24 +0000 (10:57 -0800)]
Fix buffered/direct/mmap I/O race

When a page is faulted in for memory mapped I/O the page lock
may be dropped before it has been read and marked up to date.
If a buffered read encounters such a page in mappedread() it
must wait until the page has been updated. Failure to do so
will result in a panic on debug builds and incorrect data on
production builds.

The critical part of this change is in mappedread() where pages
which are not up to date are now handled. Additionally, it
includes the following simplifications.

- zfs_getpage() and zfs_fillpage() could be passed an array of
  pages. This could be more efficient if it was used but in
  practice only a single page was ever provided. These
  interfaces were simplified to acknowledge that.

- update_pages() was modified to correctly set the PG_error bit
  on a page when it cannot be read by dmu_read().

- Setting PG_error and PG_uptodate was moved to zfs_fillpage()
  from zpl_readpage_common(). This is consistent with the
  handling in update_pages() and mappedread().

- Minor additional refactoring to comments and variable
  declarations to improve readability.

- Add a test case to exercise concurrent buffered, direct,
  and mmap IO to the same file.

- Reduce the mmap_sync test case default run time.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13608
Closes #14498

16 months agoFix NULL pointer dereference in zio_ready()
Richard Yao [Thu, 23 Feb 2023 18:19:08 +0000 (13:19 -0500)]
Fix NULL pointer dereference in zio_ready()

Clang's static analyzer correctly identified a NULL pointer dereference
in zio_ready() when ZIO_FLAG_NODATA has been set on a zio that is
missing a block pointer. The NULL pointer dereference occurs because we
have logic intended to disable ZIO_FLAG_NODATA when it has been set on a
gang block.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14469

16 months agoUse rw_tryupgrade() in dmu_bonus_hold_by_dnode()
Richard Yao [Thu, 23 Feb 2023 00:33:23 +0000 (19:33 -0500)]
Use rw_tryupgrade() in dmu_bonus_hold_by_dnode()

When dn->dn_bonus == NULL, dmu_bonus_hold_by_dnode() will unlock its
read lock on dn->dn_struct_rwlock and grab a write lock. This can be
micro-optimized by calling rw_tryupgrade().

Linux will not benefit from this since it does not support rwlock
upgrades, but FreeBSD will.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14517

16 months agoImprove error message of zfs redact
Paul Dagnelie [Wed, 22 Feb 2023 01:30:05 +0000 (17:30 -0800)]
Improve error message of zfs redact

We improve the error message of zfs redact by checking if the target
snapshot exists, and if all the redaction snapshots exist. As a
future improvement we could iterate over every snapshot provided and
use that to determine which one specifically doesn't exist.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #11426
Closes #14496

16 months agoFreeBSD: don't verify recycled vnode for zfs control directory
rob-wing [Wed, 22 Feb 2023 01:26:33 +0000 (16:26 -0900)]
FreeBSD: don't verify recycled vnode for zfs control directory

Under certain loads, the following panic is hit:

    panic: page fault
    KDB: stack backtrace:
    #0 0xffffffff805db025 at kdb_backtrace+0x65
    #1 0xffffffff8058e86f at vpanic+0x17f
    #2 0xffffffff8058e6e3 at panic+0x43
    #3 0xffffffff808adc15 at trap_fatal+0x385
    #4 0xffffffff808adc6f at trap_pfault+0x4f
    #5 0xffffffff80886da8 at calltrap+0x8
    #6 0xffffffff80669186 at vgonel+0x186
    #7 0xffffffff80669841 at vgone+0x31
    #8 0xffffffff8065806d at vfs_hash_insert+0x26d
    #9 0xffffffff81a39069 at sfs_vgetx+0x149
    #10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    #11 0xffffffff8065a28c at lookup+0x45c
    #12 0xffffffff806594b9 at namei+0x259
    #13 0xffffffff80676a33 at kern_statat+0xf3
    #14 0xffffffff8067712f at sys_fstatat+0x2f
    #15 0xffffffff808ae50c at amd64_syscall+0x10c
    #16 0xffffffff808876bb at fast_syscall_common+0xf8

The page fault occurs because vgonel() will call VOP_CLOSE() for active
vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While
here, define vop_open for consistency.

After adding the necessary vop, the bug progresses to the following
panic:

    panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1)
    cpuid = 17
    KDB: stack backtrace:
    #0 0xffffffff805e29c5 at kdb_backtrace+0x65
    #1 0xffffffff8059620f at vpanic+0x17f
    #2 0xffffffff81a27f4a at spl_panic+0x3a
    #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40
    #4 0xffffffff8066fdee at vinactivef+0xde
    #5 0xffffffff80670b8a at vgonel+0x1ea
    #6 0xffffffff806711e1 at vgone+0x31
    #7 0xffffffff8065fa0d at vfs_hash_insert+0x26d
    #8 0xffffffff81a39069 at sfs_vgetx+0x149
    #9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    #10 0xffffffff80661c2c at lookup+0x45c
    #11 0xffffffff80660e59 at namei+0x259
    #12 0xffffffff8067e3d3 at kern_statat+0xf3
    #13 0xffffffff8067eacf at sys_fstatat+0x2f
    #14 0xffffffff808b5ecc at amd64_syscall+0x10c
    #15 0xffffffff8088f07b at fast_syscall_common+0xf8

This is caused by a race condition that can occur when allocating a new
vnode and adding that vnode to the vfs hash. If the newly created vnode
loses the race when being inserted into the vfs hash, it will not be
recycled as its usecount is greater than zero, hitting the above
assertion.

Fix this by dropping the assertion.

FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700
Reviewed-by: Andriy Gapon <avg@FreeBSD.org>
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Alek Pinchuk <apinchuk@axcient.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Submitted-by: Klara, Inc.
Sponsored-by: rsync.net
Closes #14501

16 months agoFix per-jail zfs.mount_snapshot setting
Allan Jude [Wed, 22 Feb 2023 01:23:01 +0000 (20:23 -0500)]
Fix per-jail zfs.mount_snapshot setting

When jail.conf set the nopersist flag during startup, it was
incorrectly destroying the per-jail ZFS settings.

Reported-by: Martin Matuska <mm@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Sponsored-by: Modirum MDPay
Sponsored-by: Klara, Inc.
Closes #14509

16 months agoPartially revert eee9362a7
George Amanakis [Tue, 21 Feb 2023 17:36:22 +0000 (18:36 +0100)]
Partially revert eee9362a7

With commit 34ce4c42f applied, there is no need for eee9362a7.
Revert that aside from the test. All tests introduced in those commits
pass.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14502

16 months agoSync thread should avoid holding the spa config write lock when possible
Richard Yao [Thu, 16 Feb 2023 22:10:52 +0000 (17:10 -0500)]
Sync thread should avoid holding the spa config write lock when possible

spa_sync() currently grabs the write lock due to an old hack that is
documented by a comment:

    We need the write lock here because, for aux vdevs,
    calling vdev_config_dirty() modifies sav_config.
    This is ugly and will become unnecessary when we
    eliminate the aux vdev wart by integrating all vdevs
    into the root vdev tree.

This has lead to deadlocks in rare edge cases from holding the write
lock. We can reduce incidence of these deadlocks by not grabbing the
write lock on pools without auxillary vdevs.

Sponsored-By: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Closes #14282

16 months agozfs redact fails when dnodesize=auto
Paul Dagnelie [Thu, 16 Feb 2023 17:23:39 +0000 (09:23 -0800)]
zfs redact fails when dnodesize=auto

Add handling to dmu_object_next for the case where *objectp == 0.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #14479

16 months agozdb: zero-pad checksum output follow up
Brian Behlendorf [Wed, 15 Feb 2023 17:06:29 +0000 (09:06 -0800)]
zdb: zero-pad checksum output follow up

Apply zero padding for checksums consistently.  The SNPRINTF_BLKPTR
macro was not updated in commit ac7648179c8 which results in the
`cli_root/zdb/zdb_checksum.ksh` test case reliably failing.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14497

16 months agoSuppress Clang static analyzer complaint in zfs_replay_create()
Richard Yao [Tue, 14 Feb 2023 19:05:41 +0000 (14:05 -0500)]
Suppress Clang static analyzer complaint in zfs_replay_create()

Clang's static analyzer incorrectly complains about an undefined value
here when lr->lr_common.lrc_txtype == TX_SYMLINK and txtype ==
TX_CREATE. This is impossible, because of this line:

txtype = (lr->lr_common.lrc_txtype & ~TX_CI((uint64_t)0x1 << 63));

Changing the code to compare against txtype suppresses the report.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14472

16 months agoLinux: use filemap_range_has_page()
Brian Behlendorf [Tue, 14 Feb 2023 19:04:34 +0000 (11:04 -0800)]
Linux: use filemap_range_has_page()

As of the 4.13 kernel filemap_range_has_page() can be used to
check if there is a page mapped in a given file range.  When
available this interface should be used which eliminates the
need for the zp->z_is_mapped boolean.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14493

16 months agoGive strlcat() full buffer lengths rather than smaller buffer lengths
Richard Yao [Tue, 14 Feb 2023 19:03:42 +0000 (14:03 -0500)]
Give strlcat() full buffer lengths rather than smaller buffer lengths

strlcat() is supposed to be given the length of the destination buffer,
including the existing contents. Unfortunately, I had been overzealous
when I wrote a51288aabbbc176a8a73a8b3cd56f79607db32cf, since I gave it
the length of the destination buffer, minus the existing contents. This
likely caused a regression on large strings.

On the topic of being overzealous, the use of strlcat() in
dmu_send_estimate_fast() was unnecessary because recv_clone_name is a
fixed length string. We continue using strlcat() mostly as defensive
programming, in case the string length is ever changed, even though it
is unnecessary.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14476

16 months agoquick fix for lingering snapdir unmount problems
Rich Ercolani [Tue, 14 Feb 2023 00:40:13 +0000 (19:40 -0500)]
quick fix for lingering snapdir unmount problems

Unfortunately, even after e79b6807, I still, much more rarely,
tripped asserts when playing with many ctldir mounts at once.

Since this appears to happen if we dispatched twice too fast, just
ignore it. We don't actually need to do anything if someone already
started doing it for us.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14462

16 months agoFix a race condition in dsl_dataset_sync() when activating features
George Amanakis [Tue, 14 Feb 2023 00:37:46 +0000 (01:37 +0100)]
Fix a race condition in dsl_dataset_sync() when activating features

The zio returned from arc_write() in dmu_objset_sync() uses
zio_nowait(). However we may reach the end of dsl_dataset_sync()
which checks if we need to activate features in the filesystem
without knowing if that zio has even run through the ZIO pipeline yet.
In that case we will flag features to be activated in
dsl_dataset_block_born() but dsl_dataset_sync() has already
completed its run and those features will not actually be activated.
Mitigate this by moving the feature activation code in
dsl_dataset_sync_done(). Also add new ASSERTs in
dsl_scan_visitbp() checking if a block contradicts any filesystem
flags.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #13816

16 months agoReduce need for contiguous memory for ioctls
Prakash Surya [Tue, 14 Feb 2023 00:35:59 +0000 (16:35 -0800)]
Reduce need for contiguous memory for ioctls

We've had cases where we trigger an OOM despite having memory freely
available on the system. For example, here, we had about 21GB free:

    kernel: Node 0 Normal: 2418758*4kB (UME) 1549533*8kB (UE) 0*16kB
    0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
    22071296kB

The problem being, all the memory is in 4K and 8K contiguous regions,
but the allocation request was for a 16K contiguous region:

    kernel: SafeExecutors-4 invoked oom-killer:
    gfp_mask=0x42dc0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO),
    order=2, oom_score_adj=0

The offending allocation came from this call trace:

    kernel: Call Trace:
    kernel:  dump_stack+0x57/0x7a
    kernel:  dump_header+0x4f/0x1e1
    kernel:  oom_kill_process.cold.33+0xb/0x10
    kernel:  out_of_memory+0x1ad/0x490
    kernel:  __alloc_pages_slowpath+0xd55/0xe40
    kernel:  __alloc_pages_nodemask+0x2df/0x330
    kernel:  kmalloc_large_node+0x42/0x90
    kernel:  __kmalloc_node+0x25a/0x320
    kernel:  ? spl_kmem_free_impl+0x21/0x30 [spl]
    kernel:  spl_kmem_alloc_impl+0xa5/0x100 [spl]
    kernel:  spl_kmem_zalloc+0x19/0x20 [spl]
    kernel:  zfsdev_ioctl+0x2b/0xe0 [zfs]
    kernel:  do_vfs_ioctl+0xa9/0x640
    kernel:  ? __audit_syscall_entry+0xdd/0x130
    kernel:  ksys_ioctl+0x67/0x90
    kernel:  __x64_sys_ioctl+0x1a/0x20
    kernel:  do_syscall_64+0x5e/0x200
    kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
    kernel: RIP: 0033:0x7fdca3674317

The problem is, for each ioctl that ZFS makes, it has to allocate a
zfs_cmd_t structure, which is 13744 bytes in size (on my system):

    sdb> sizeof zfs_cmd
    (size_t)13744

This size, coupled with the fact that we currently allocate it with
kmem_zalloc, means we need a 16K contiguous region of memory to satisfy
the request.

The solution taken by this change, is to use "vmem" instead of "kmem" to
do the allocation, such that we don't necessarily need a contiguous 16K
memory region to satisfy the allocation.

Arguably, a better solution would be not to require such a large
allocation to begin with (e.g. reduce the size of the zfs_cmd_t
structure), but that'd be a much larger change than this "one liner".
Thus, I've opted for this approach for now; we can always circle back
and attempt to reduce the size of the structure in the future.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #14474

16 months agoImprove arc_read() error reporting
Alexander Motin [Mon, 13 Feb 2023 21:21:53 +0000 (16:21 -0500)]
Improve arc_read() error reporting

Debugging reported NULL de-reference panic in dnode_hold_impl() I found
that for certain types of errors arc_read() may only return error code,
but not properly report it via done and pio arguments.  Lack of done
calls may result in reference and/or memory leaks in higher level code.
Lack of error reporting via pio may result in unnoticed errors there.
For example, dbuf_read(), where dbuf_read_impl() ignores arc_read()
return, relies completely on the pio mechanism and missed the errors.

This patch makes arc_read() to always call done callback and always
propagate errors to parent zio, if either is provided.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14454

16 months agorpm: Use libtirpc-devel and /usr/lib on SUSE
Andreas Vögele [Thu, 9 Feb 2023 19:57:50 +0000 (20:57 +0100)]
rpm: Use libtirpc-devel and /usr/lib on SUSE

SUSE Linux distributions require libtirpc-devel.  The dracut and udev
directories are /usr/lib/dracut and /usr/lib/udev.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Andreas Vögele <andreas@andreasvoegele.com>
Closes #14467
Closes #14468

16 months agozdb: zero-pad checksum output
Rob N ★ [Tue, 7 Feb 2023 21:48:22 +0000 (08:48 +1100)]
zdb: zero-pad checksum output

The leading zeroes are part of the checksum so we should show them.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #14464

17 months agoinitramfs: Make mountpoint=none work
Ryan Moeller [Mon, 6 Feb 2023 19:16:01 +0000 (14:16 -0500)]
initramfs: Make mountpoint=none work

In initramfs, mount.zfs fails to mount a dataset with mountpoint=none,
but mount.zfs -o zfsutil works.  Use -o zfsutil when mountpoint=none.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #14455

17 months agoAdd assertion and make variables unsigned in abd_alloc_chunks()
Richard Yao [Thu, 2 Feb 2023 22:30:39 +0000 (17:30 -0500)]
Add assertion and make variables unsigned in abd_alloc_chunks()

Clang's static analyzer pointed out that if alloc_pages >= nr_pages
before the loop, the value of page will be undefined and will be used
anyway. This should not be possible, but as cleanup, we add an
assertion. We also recognize that the local variables should be unsigned
in the first place, so we make them unsigned. This is not enough to
avoid the need for the assertion, since there is still the case that
alloc_pages == nr_pages and nr_pages == 0, which the assertion
implicitly checks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14456

17 months agoCleanup: spa vdev processing should check NULL pointers
Richard Yao [Mon, 23 Jan 2023 20:03:21 +0000 (15:03 -0500)]
Cleanup: spa vdev processing should check NULL pointers

The PVS Studio 2016 FreeBSD kernel report stated:

\contrib\opensolaris\uts\common\fs\zfs\spa.c (1341): error V595: The 'spa->spa_spares.sav_vdevs' pointer was utilized before it was verified against nullptr. Check lines: 1341, 1342.
\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\spa.c (1355): error V595: The 'spa->spa_l2cache.sav_vdevs' pointer was utilized before it was verified against nullptr. Check lines: 1355, 1357.
\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\spa.c (1398): error V595: The 'spa->spa_spares.sav_vdevs' pointer was utilized before it was verified against nullptr. Check lines: 1398, 1408.
\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\spa.c (1583): error V595: The 'oldvdevs' pointer was utilized before it was verified against nullptr. Check lines: 1583, 1595.

In practice, all of these uses were safe because a NULL pointer
implied a 0 vdev count, which kept us from iterating over vdevs.
However, rearranging the code to check the pointer first is not a
terrible micro-optimization and makes it more readable, so let us
do that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14456

17 months agozfs_get_temporary_prop() should not pass NULL to strcpy()
Richard Yao [Fri, 23 Dec 2022 05:00:38 +0000 (00:00 -0500)]
zfs_get_temporary_prop() should not pass NULL to strcpy()

`dsl_dir_activity_in_progress()` can call `zfs_get_temporary_prop()` with
the forth value set to NULL, which will pass NULL to `strcpy()` when
there is a match

Clang's static analyzer caught this with the help of CodeChecker for
Cross Translation Unit analysis.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14456

17 months agoEIO caused by encryption + recursive gang
Matthew Ahrens [Mon, 6 Feb 2023 17:37:06 +0000 (09:37 -0800)]
EIO caused by encryption + recursive gang

Encrypted blocks can not have 3 DVAs, because they use the space of the
3rd DVA for the IV+salt.  zio_write_gang_block() takes this into
account, setting `gbh_copies` to no more than 2 in this case.  Gang
members BP's do not have the X (encrypted) bit set (nor do they have the
DMU level and type fields set), because encryption is not handled at
this level.  The gang block is reassembled, and then encryption (and
compression) are handled.

To check if this gang block is encrypted, the code in
zio_write_gang_block() checks `pio->io_bp`.  This is normally fine,
because the block that's being ganged is typically the encrypted BP.

The problem is that if there is "recursive ganging", where a gang member
is itself a gang block, then when zio_write_gang_block() is called to
create a gang block for a gang member, `pio->io_bp` is the gang member's
BP, which doesn't have the X bit set, so the number of DVA's is not
restricted to 2.  It should instead be looking at the the "gang leader",
i.e. the top-level gang block, to determine how many DVA's can be used,
to avoid a "NDVA's inversion" (where a child has more DVA's than its
parent).

gang leader BP: X (encrypted) bit set, 2 DVA's, IV+salt in 3rd DVA's
space:
```
DVA[0]=<1:...:100400> DVA[1]=<0:...:100400> salt=... iv=...
[L0 ZFS plain file] fletcher4 uncompressed encrypted LE
gang unique double size=100000L/100000P birth=... fill=1 cksum=...
```

leader's GBH contains a BP with gang bit set and 3 DVA's:
```
DVA[0]=<1:...:55600> DVA[1]=<0:...:55600>
[L0 unallocated] fletcher4 uncompressed unencrypted LE
contiguous unique double size=55600L/55600P birth=... fill=0 cksum=...

DVA[0]=<1:...:55600> DVA[1]=<0:...:55600>
[L0 unallocated] fletcher4 uncompressed unencrypted LE
contiguous unique double size=55600L/55600P birth=... fill=0 cksum=...

DVA[0]=<1:...:55600> DVA[1]=<0:...:55600> DVA[2]=<1:...:200>
[L0 unallocated] fletcher4 uncompressed unencrypted LE
gang unique double size=55400L/55400P birth=... fill=0 cksum=...
```

On nondebug bits, having the 3rd DVA in the gang block works for the
most part, because it's true that all 3 DVA's are available in the gang
member BP (in the GBH).  However, for accounting purposes, gang block
DVA's ASIZE include all the space allocated below them, i.e. the
512-byte gang block header (GBH) as well as the gang members below that.
We see that above where the gang leader BP is 1MB logical (and after
compression: 0x`100000P`), but the ASIZE of each DVA is 2 sectors (1KB)
more than 1MB (0x`100400`).

Since thre are 3 copies of a block below it, we increment the ATIME of
the 3rd DVA of the gang leader by the space used by the 3rd DVA of the
child (1 sector, in this case).  But there isn't really a 3rd DVA of the
parent; the salt is stored in place of the 3rd DVA's ASIZE.

So when zio_write_gang_member_ready() increments the parent's BP's
`DVA[2]`'s ASIZE, it's actually incrementing the parent's salt.  When we
later try to read the encrypted recursively-ganged block, the salt
doesn't match what we used to write it, so MAC verification fails and we
get an EIO.

```
zio_encrypt():  encrypted 515/2/0/403 salt: 25 25 bb 9d ad d6 cd 89
zio_decrypt(): decrypting 515/2/0/403 salt: 26 25 bb 9d ad d6 cd 89
```

This commit addresses the problem by not increasing the number of copies
of the GBH beyond 2 (even for non-encrypted blocks).  This simplifies
the logic while maintaining the ability to traverse all metadata
(including gang blocks) even if one copy is lost.  (Note that 3 copies
of the GBH will still be created if requested, e.g. for `copies=3` or
MOS blocks.)  Additionally, the code that increments the parent's DVA's
ASIZE is made to check the parent DVA's NDVAS even on nondebug bits.  So
if there's a similar bug in the future, it will cause a panic when
trying to write, rather than corrupting the parent BP and causing an
error when reading.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Caused-by: #14356
Closes #14440
Closes #14413

17 months agoRestore FreeBSD to use .rodata
Jorgen Lundman [Mon, 6 Feb 2023 17:34:59 +0000 (02:34 +0900)]
Restore FreeBSD to use .rodata

In https://github.com/openzfs/zfs/pull/14228 the FreeBSD
SECTION_STATIC was set to ".data" instead of ".rodata". This
commit just restores it back to .rodata.

Reviewed-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #14460

17 months agoUnify assembly files with macOS
Jorgen Lundman [Mon, 6 Feb 2023 17:27:55 +0000 (02:27 +0900)]
Unify assembly files with macOS

The remaining changes needed to make the assembly files work
with macOS.

Reviewed-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #14451

17 months agoFix variable shadowing in libzfs_mount
Reno Reckling [Thu, 2 Feb 2023 23:22:12 +0000 (00:22 +0100)]
Fix variable shadowing in libzfs_mount

We accidentally reused variable name "i" for inner and outer loops.

Reviewed-by: Rich Ercolani <Rincebrain@gmail.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Reno Reckling <e-github@wthack.de>
Closes #14452
Closes #14445