Ed Maste [Wed, 25 Jan 2023 22:01:00 +0000 (17:01 -0500)]
lua: reduce diffs between luaconf.h copies
Upstream luaconf.h is contrib/lua/src/luaconf.h.dist, while userland lua
and loader lua have copies in lib/liblua/luaconf.h and
stand/liblua/luaconf.h.
Adjust whitespace, VCS tags, etc. to match upstream's version, for ease
of comparison.
Reviewed By: imp
Sponsored By: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38206
During tcp session start, various mechanisms need to
track a few initial RTTs before becoming active.
Prevent overflows of the corresponding tracking counter
and reduce the size of tcpcb simultaneously.
Mark Johnston [Thu, 26 Jan 2023 15:46:19 +0000 (10:46 -0500)]
netlink: Zero-initialize writer structures allocated on the stack
The prevailing pattern seems to be to simply initialize all fields to
zero. Without this, it's possible to trigger a branch on uninitialized
memory, specifically, when testing nw->ignore_limit in
nlmsg_refill_buffer().
Initialize the writer structure in a couple of functions where this is
necessary.
Reported by: KMSAN
Reviewed by: melifaro
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38213
This is the same as `debugfs_create_file()` but takes the initial size
of the file. In FreeBSD, the given size is ignored and
`debugfs_create_file()` is called.
Reviewed by: emaste, manu
Approved by: manu
Differential Revision: https://reviews.freebsd.org/D37914
By (ab)using `atomic_long_add_return()`, `atomic_long_sub()` was making
the atomic long overflow. Indeed the underlying FreeBSD atomic is based
on an unsigned long.
Reviewed by: manu
Approved by: manu
Differential Revision: https://reviews.freebsd.org/D38090
However, these commits were later reverted (also in Linux 5.13) so the
i915 driver doesn't use these functions. But perhaps it will help in the
future.
To sum up, the code comes from the i915 DRM driver but it doesn't use it
(i.e. it continues to use its internal implementation).
Reviewed by: manu
Approved by: manu
Differential Revision: https://reviews.freebsd.org/D38088
Notable upstream pull request merges:
#13805 Configure zed's diagnosis engine with vdev properties
#14110 zfs list: Allow more fields in ZFS_ITER_SIMPLE mode
#14121 Batch enqueue/dequeue for bqueue
#14123 arc_read()/arc_access() refactoring and cleanup
#14159 Bypass metaslab throttle for removal allocations
#14243 Implement uncached prefetch
#14251 Cache dbuf_hash() calculation
#14253 Allow reciever to override encryption property in case of replication
#14254 Restrict visibility of per-dataset kstats inside FreeBSD jails
#14255 Zero end of embedded block buffer in dump_write_embedded()
#14263 Cleanups identified by CodeQL and Coverity
#14264 Miscellaneous fixes
#14272 Change ZEVENT_POOL_GUID to ZEVENT_POOL to display pool names
#14287 FreeBSD: Remove stray debug printf
#14288 Colorize zfs diff output
#14289 deadlock between spa_errlog_lock and dp_config_rwlock
#14291 FreeBSD: Fix potential boot panic with bad label
#14292 Add tunable to allow changing micro ZAP's max size
#14293 Turn default_bs and default_ibs into ZFS_MODULE_PARAMs
#14295 zed: add hotplug support for spare vdevs
#14304 Activate filesystem features only in syncing context
#14311 zpool: do guid-based comparison in is_vdev_cb()
#14317 Pack zrlock_t by 8 bytes
#14320 Update arc_summary and arcstat outputs
#14328 FreeBSD: catch up to 1400077
#14376 Use setproctitle to report progress of zfs send
#14340 Remove some dead ARC code
#14358 Wait for txg sync if the last DRR_FREEOBJECTS might result in a hole
#14360 libzpool: fix ddi_strtoull to update nptr
#14364 Fix unprotected zfs_znode_dmu_fini
#14379 zfs_receive_one: Check for the more likely error first
#14380 Cleanup of dead code suggested by Clang Static Analyzer
#14397 Avoid passing an uninitialized index to dsl_prop_known_index
#14404 Fix reading uninitialized variable in receive_read
#14407 free_blocks(): Fix reports from 2016 PVS Studio FreeBSD report
#14418 Introduce minimal ZIL block commit delay
#14422 x86 assembly: fix .size placement and replace .align with .balign
Previously, zic and tzsetup were both listed as install tools and basic
bootstrap tools. Actually, tzsetup is an install tool while zic is a
non-basic bootstrap tool.
These aren't just needed for compatibility with i386 binaries (which need
the 32-bit section), but potentially also for compatibility with older
binaries on all platforms.
Sponsored by: Klara, Inc.
Reviewed by: emaste
Differential Revision: https://reviews.freebsd.org/D38194
Emmanuel Vadot [Wed, 25 Jan 2023 07:28:22 +0000 (08:28 +0100)]
arm64: Move device scmi to std.arm
The scmi driver in its current form requires the arm_doorbell
driver to communicate with the firmware.
The arm_doorbell is only found in ARM Juno reference board (and
apparently on Morello too).
If we want to use scmi on other platform (like some rockchip or imx
soc), the driver needs to be updated to support svc/shmem communication
with the firmware.
For now since it can be only used with arm_doorbell move the device to
std.arm otherwise kernel configs like ALLWINNER or ROCKCHIP fails to build.
Ever since gtaskqueue_drain() was added to iflib_stop(), a kernel panic
occurs when the ice(4) driver is in recovery mode. Queues are not
initialized in this mode, so gt_taskqueue is not initialized, and
gtaskqueue_drain() will panic.
Fix this by only doing a drain if an RX queue's gt_taskqueue is
initialized.
Justin Hibbits [Fri, 13 Jan 2023 16:30:58 +0000 (11:30 -0500)]
debugnet: Add ifnet accessor to set debugnet methods
As part of the effort to hide the internals of the ifnet struct, convert
the DEBUGNET_SET() macro to use an accessor instead of directly touching
the methods member.
Justin Hibbits [Fri, 13 Jan 2023 15:43:51 +0000 (10:43 -0500)]
ifnet/API: Move struct ifnet definition to a <net/if_private.h>
Hide the ifnet structure definition, no user serviceable parts inside,
it's a netstack implementation detail. Include it temporarily in
<net/if_var.h> until all drivers are updated to use the accessors
exclusively.
Justin Hibbits [Thu, 12 Jan 2023 18:33:30 +0000 (13:33 -0500)]
ifnet API: Change if_init() to take context argument
Some drivers, like iflib drivers, take a 'context' argument instead of a
ifnet argument, as a single interface may have multiple contexts.
Follow this scheme by passing the context argument down. Most drivers
will likely pass 'ifp' as the context.
Coleman Kane [Tue, 24 Jan 2023 19:20:50 +0000 (14:20 -0500)]
linux 6.2 compat: zpl_set_acl arg2 is now struct dentry
Linux 6.2 changes the second argument of the set_acl operation to be a
"struct dentry *" rather than a "struct inode *". The inode* parameter
is still available as dentry->d_inode, so adjust the call to the _impl
function call to dereference and pass that pointer to it.
Also document that the get_acl -> get_inode_acl member name change from
commit 884a693 was an API change also introduced in Linux 6.2.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #14415
Alexander Motin [Tue, 24 Jan 2023 17:20:32 +0000 (12:20 -0500)]
Introduce minimal ZIL block commit delay
Despite all optimizations, tests on actual hardware show that FreeBSD
kernel can't sleep for less then ~2us. Similar tests on Linux show
~50us delay at least from nanosleep() (haven't tested inside kernel).
It means that on very fast log device ZIL may not be able to satisfy
zfs_commit_timeout_pct block commit timeout, increasing log latency
more than desired.
Handle that by introduction of zil_min_commit_timeout parameter,
specifying minimal timeout value where additional delays to aggregate
writes may be skipped. Also skip delays if the LWB is more than 7/8
full, that often happens if I/O sizes are constant and match one of
LWB sizes. Both things are applied only if there were no already
outstanding log blocks, that may indicate single-threaded workload,
that by definition can not benefit from the commit delays.
While there, add short time moving average to zl_last_lwb_latency to
make it more stable.
Tests of single-threaded 4KB writes to NVDIMM SLOG on FreeBSD show IOPS
increase by 9% instead of expected 5%. For zfs_commit_timeout_pct of
1 there IOPS increase by 5.5% instead of expected 1%.
Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14418
Attila Fülöp [Mon, 23 Jan 2023 19:25:21 +0000 (20:25 +0100)]
x86 asm: Replace .align with .balign
The .align directive used to align storage locations is
ambiguous. On some platforms and assemblers it takes a byte count,
on others the argument is interpreted as a shift value. The current
usage expects the first interpretation.
Replace it with the unambiguous .balign directive which always
expects a byte count, regardless of platform and assembler.
Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #14422
Attila Fülöp [Mon, 23 Jan 2023 18:48:39 +0000 (19:48 +0100)]
IPC: blake3 x86 asm: fix placement of .size directives
The .size directive used by the SET_SIZE C macro uses the special
dot symbol to calculate the size of a function. The dot symbol
refers to the current address, so for the calculation to be
meaningful the SET_SIZE macro must be placed immediately after the
end of the function the size is calculated for.
Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #14422
Michal Gulbicki [Tue, 24 Jan 2023 14:31:38 +0000 (09:31 -0500)]
qat: Add Intel® 4xxx Series platform support
Overview:
Intel(R) QuickAssist Technology (Intel(R) QAT) provides hardware
acceleration for offloading security, authentication and compression
services from the CPU, thus significantly increasing the performance and
efficiency of standard platform solutions.
This commit introduces:
- Intel® 4xxx Series platform support.
- QuickAssist kernel API implementation update for Generation 4 device.
Enabled services: symmetric cryptography and data compression.
- Increased default number of crypto instances in static configuration
for performance purposes.
OCF backend changes:
- changed GCM/CCM MAC validation policy to generate MAC by HW
and validate by SW due to the QAT HW limitations.
Patch co-authored by: Krzysztof Zdziarski <krzysztofx.zdziarski@intel.com>
Patch co-authored by: Michal Jaraczewski <michalx.jaraczewski@intel.com>
Patch co-authored by: Michal Gulbicki <michalx.gulbicki@intel.com>
Patch co-authored by: Julian Grajkowski <julianx.grajkowski@intel.com>
Patch co-authored by: Piotr Kasierski <piotrx.kasierski@intel.com>
Patch co-authored by: Adam Czupryna <adamx.czupryna@intel.com>
Patch co-authored by: Konrad Zelazny <konradx.zelazny@intel.com>
Patch co-authored by: Katarzyna Rucinska <katarzynax.kargol@intel.com>
Patch co-authored by: Lukasz Kolodzinski <lukaszx.kolodzinski@intel.com>
Patch co-authored by: Zbigniew Jedlinski <zbigniewx.jedlinski@intel.com>
Andrew Gallatin [Tue, 24 Jan 2023 01:27:17 +0000 (20:27 -0500)]
dtrace: conditionally load the systrace_linux klds when loading dtrace.
When dtrace starts, it tries to detect if the dtrace klds are loaded,
and if not, it loads them by loading the dtraceall kld. This module
depends on most dtrace modules, including systrace for the native
freebsd and freebsd32 ABIs. However, it does not depend on the
systrace_linux klds, as they in turn depend on the linux ABI klds, and
we don't want to load an ABI module that the user has not explicitly
requested. This can leave a naive user in a state where they think all
syscall providers have been loaded, yet linux ABI syscalls are
"invisible" to dtrace.
To fix this, check to see if the linux ABI modules are loaded. If they
are, then load their systrace klds.
libcxx: Implement atomic::wait/notify using _umtx_op(2) for 64bit arches
Only 64bit architectures can be supported this way, because libcxx
defines __cxx_contention_t to be int64_t for FreeBSD, and 32bit arches
do not have a kind of UMTX_OP_WAIT_INT64_PRIVATE operation.
David Hedberg [Mon, 23 Jan 2023 21:19:43 +0000 (22:19 +0100)]
Wait for txg sync if the last DRR_FREEOBJECTS might result in a hole
If we receive a DRR_FREEOBJECTS as the first entry in an object range,
this might end up producing a hole if the freed objects were the
only existing objects in the block.
If the txg starts syncing before we've processed any following
DRR_OBJECT records, this leads to a possible race where the backing
arc_buf_t gets its psize set to 0 in the arc_write_ready() callback
while still being referenced from a dirty record in the open txg.
To prevent this, we insert a txg_wait_synced call if the first
record in the range was a DRR_FREEOBJECTS that actually
resulted in one or more freed objects.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: David Hedberg <david.hedberg@findity.com>
Sponsored by: Findity AB
Closes #11893
Closes #14358
Richard Yao [Mon, 23 Jan 2023 21:16:22 +0000 (16:16 -0500)]
Reject streams that set ->drr_payloadlen to unreasonably large values
In the zstream code, Coverity reported:
"The argument could be controlled by an attacker, who could invoke the
function with arbitrary values (for example, a very high or negative
buffer size)."
It did not report this in the kernel. This is likely because the
userspace code stored this in an int before passing it into the
allocator, while the kernel code stored it in a uint32_t.
However, this did reveal a potentially real problem. On 32-bit systems
and systems with only 4GB of physical memory or less in general, it is
possible to pass a large enough value that the system will hang. Even
worse, on Linux systems, the kernel memory allocator is not able to
support allocations up to the maximum 4GB allocation size that this
allows.
This had already been limited in userspace to 64MB by
`ZFS_SENDRECV_MAX_NVLIST`, but we need a hard limit in the kernel to
protect systems. After some discussion, we settle on 256MB as a hard
upper limit. Attempting to receive a stream that requires more memory
than that will result in E2BIG being returned to user space.
Reported-by: Coverity (CID-1529836) Reported-by: Coverity (CID-1529837) Reported-by: Coverity (CID-1529838) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14285
rob-wing [Mon, 23 Jan 2023 21:14:25 +0000 (12:14 -0900)]
Configure zed's diagnosis engine with vdev properties
Introduce four new vdev properties:
checksum_n
checksum_t
io_n
io_t
These properties can be used for configuring the thresholds of zed's
diagnosis engine and are interpeted as <N> events in T <seconds>.
When this property is set to a non-default value on a top-level vdev,
those thresholds will also apply to its leaf vdevs. This behavior can be
overridden by explicitly setting the property on the leaf vdev.
Note that, these properties do not persist across vdev replacement. For
this reason, it is advisable to set the property on the top-level vdev
instead of the leaf vdev.
The default values for zed's diagnosis engine (10 events, 600 seconds)
remains unchanged.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Sponsored-by: Seagate Technology LLC
Closes #13805
Richard Yao [Mon, 23 Jan 2023 21:12:37 +0000 (16:12 -0500)]
free_blocks(): Fix reports from 2016 PVS Studio FreeBSD report
In 2016, the authors of PVS Studio ran it on the FreeBSD kernel, which
identified a number of bugs / cleanup opportunities in the FreeBSD ZFS kernel
code. A few of them persist to the present day:
In particular, we have the following in free_blocks():
\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\dnode_sync.c (174): error V547: Expression '__left >= __right' is always true. Unsigned type value is always >= 0.
\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\dnode_sync.c (171): error V634: The priority of the '*' operation is higher than that of the '<<' operation. It's possible that parentheses should be used in the expression.
\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\dnode_sync.c (175): error V547: Expression '__left >= __right' is always true. Unsigned type value is always >= 0.
A couple of assertions accidentally typecast the arguments they check to
unsigned in such a way that the result is always true. Also, parentheses
are missing around `1<<epbs` in `(db->db_blkid * 1<<epbs)`. This works
out to be okay due to multiplication not caring what order of operations
we use, but it is better to fix it to be `(db->db_blkid << epbs)`.
A few of the function local variables probably never should have been
32-bit in the first place, so we make them 64-bit. We also replace the
existing assertions with additional assertions to ensure that 64-bit
unsigned arithmetic is safe.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14407
Mark Johnston [Mon, 23 Jan 2023 19:42:41 +0000 (14:42 -0500)]
netmap: Try to count packet drops in emulated mode
Right now we have little visibility into packet drops within netmap.
Start trying to make packet loss issues more visible by counting queue
drops in the transmit path, and in the input path for interfaces running
in emulated mode, where we place received packets in a bounded software
queue that is processed by rxsync.
Mark Johnston [Mon, 23 Jan 2023 19:41:05 +0000 (14:41 -0500)]
netmap: Tell the compiler to avoid reloading ring indices
Per the removed comments these fields should be loaded only once, since
they can in principle be modified concurrently, though this would be a
violation of the userspace contract with netmap.
Mitchell Horne [Mon, 9 Jan 2023 17:14:19 +0000 (13:14 -0400)]
vtblk: secondary fix for dumping
The code paths while dumping do not got through busdma. As such,
safeguard against calling bus_dmamap_sync() with a NULL map. The x86
implementation tolerates this but others do not, resulting in a
NULL dereference panic when dumping to a vtblk device on arm64, riscv,
etc.
Fixes: 782105f7c898 ("vtblk: Use busdma")
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D37990
Mitchell Horne [Sat, 7 Jan 2023 18:09:28 +0000 (14:09 -0400)]
ddb: have 'reset' command use normal reboot path
This conditionally gives all registered shutdown handlers a chance to
perform the reboot, with cpu_reset() being the fallback. The '\s'
modifier can be used with the command to get the previous behaviour.
The motivation is that some platforms may not be able do anything
meaningful via cpu_reset(), due to a lack of standardized reset
mechanism and/or firmware shortcomings. However, they may have a
separate device driver attached that normally performs the reboot. Such
is the case for some versions of the Raspberry Pi, where reset via PSCI
fails, but the BCM2835 watchdog driver has a shutdown hook.
Currently shutdown_reset() is registered as the final entry of the
shutdown_final event handler. However, if a panic occurs early in boot
before the event is registered (SI_SUB_INTRINSIC), we may end up
spinning in the subsequent infinite for loop and failing to reset
altogether. Instead we can simply call this function unconditionally.
Jiajie Chen [Mon, 23 Jan 2023 16:36:59 +0000 (00:36 +0800)]
Add kf_file_nlink field to kf_file and populate it
This will allow user-space programs (e.g. lsof) to locate deleted files
whose nlink equals zero. Prior to this commit, programs has to use
stat(kf_path) to get nlink, but that will fail if the file is deleted.
netinet6: honor blackhole/unreach routes in the non-fastforwading code.
Currently, under the conditions specified below, IPv6 ingress packet
processing can ignore blackhole/reject flag on the prefix. The packet
will instead be looped locally till TTL expiration and a single ICMPv6
unreachable message will be send to the source even in case of
RTF_BLACKHOLE.
The following conditions needs hold to make the scenario happen:
* IPv6 forwarding is enabled
* Packet is not fast-forwarded
* Destination prefix has either RTF_BLACKHOLE or RTF_REJECT flag
Fix this behavior by checking for the blackhole/reject flags in
ip6_forward().
* Automatically use IPv6 when IPv6 addresses are used, --ip6 is not needed.
* Building of ping requests and parsing of ping replies is done layer by
layer. This way most arguments are available both for IPv6 and IPv4,
for ICMP and TCP.
* Use argument groups for improved readability.
* Change ToS and TTL argument name to TC and HL to reflect the modern
IPv6 nomenclature. The argument still set related IPv4 header fields
properly.
* Instead of sniffing for the very specific case of duplicated packets,
allow for sniffing on multiple interfaces.
* Report which sniffer has failed by setting bits of error code.
* Raise meaningful exceptions when irrecoverable errors happen.
* Make IPv4 fragmentation flags configurable.
* Make IPv6 HL / IPv4 TTL configurable.
* Make TCP MSS configurable.
* Make TCP sequence number configurable.
* Make ICMP payload size configurable.
* Add debug output.
* Move command line argument parsing out of network functions.
* Make the code somehow PEP-8 compliant.
* Remove ambiguity of configuring recvif, it must be now explicitly specified.
* Don't catch exceptions around creating the sniffer, let it properly
fail and display the whole stack trace.
* Count correct packets so that duplicates can be found.