Mark Johnston [Tue, 8 Jun 2021 13:40:30 +0000 (09:40 -0400)]
hyperv: Fix vmbus after the i386 4/4 split
The vmbus ISR needs to live in a trampoline. Dynamically allocating a
trampoline at driver initialization time poses some difficulties due to
the fact that the KENTER macro assumes that the offset relative to
tramp_idleptd is fixed at static link time. Another problem is that
native_lapic_ipi_alloc() uses setidt(), which assumes a fixed trampoline
offset.
Rather than fight this, move the Hyper-V ISR to i386/exception.s. Add a
new HYPERV kernel option to make this optional, and configure it by
default on i386. This is sufficient to make use of vmbus(4) after the
4/4 split. Note that vmbus cannot be loaded dynamically and both the
HYPERV option and device must be configured together. I think this is
not too onerous a requirement, since vmbus(4) was previously
non-functional.
Reported by: Harry Schmalzbauer <freebsd@omnilan.de>
Tested by: Harry Schmalzbauer <freebsd@omnilan.de>
Reviewed by: whu, kib
Sponsored by: The FreeBSD Foundation
Dimitry Andric [Mon, 14 Jun 2021 19:17:05 +0000 (21:17 +0200)]
Export various 128 bit long double functions from libgcc_s.so.1
These were already compiled for some time on aarch64 and riscv, by
including lib/libcompiler_rt/Makefile.inc, but never exported in the
shared library. Since gcc exports these under version GCC_4.6.0, we do
the same.
This review should replace D11482 for now. For e.g. amd64 more work is
still to be done, as compiler-rt does not seem to support 128 bit long
double math for that architecture.
Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D28690
Mark Johnston [Mon, 14 Jun 2021 21:32:27 +0000 (17:32 -0400)]
Consistently use the SOLISTENING() macro
Some code was using it already, but in many places we were testing
SO_ACCEPTCONN directly. As a small step towards fixing some bugs
involving synchronization with listen(2), make the kernel consistently
use SOLISTENING(). No functional change intended.
Mark Johnston [Mon, 14 Jun 2021 21:32:18 +0000 (17:32 -0400)]
amd64: Fix propagation of LDT updates
When a process has used sysarch(2) to specify descriptors for its
private LDT, upon rfork(RFMEM) descriptors are copied into the new child
process. Any updates to the descriptors are thus reflected to all other
processes sharing the vmspace. However, this is incorrect in the rather
obscure case where the child process was created before the LDT was
modified. Fix this by only modifying other processes which already
share the LDT.
Reported by: syzkaller
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Marko Zec [Thu, 17 Jun 2021 06:49:09 +0000 (08:49 +0200)]
tests: Revise FIB lookups per second benchmarking routines
Fix a bug in the LPM SEQ benchmark (missing break inside a switch block)
by restructuring the test loop, while introducing additional two
synthetic test options:
ANN: scan only the address space announced in current RIB
REP: repeat lookups over several keys in a sliding window scheme
The total of eight combinations of test options are now available
through dedicated sysctl hooks.
Differential Revision: <https://reviews.freebsd.org/D30311>
Reviewed by: melifaro
MFC after: 3 days
Marko Zec [Wed, 5 May 2021 10:28:17 +0000 (12:28 +0200)]
Revise FIB lookups per second benchmarking routines.
Add a LPS benchmark variant which introduces artificial dependencies
between successive lookups. While here, instead of writing the results
from the lookups to a huge array, add them to an accumulator, in a more
lightweight attempt at preventing the CPU's OOO machinery from
discarding the lookup results if they would be completely unused.
net.route.test.run_lps_rnd measures LPS throughput with independent
uniformly random keys
net.route.test.run_lps_seq measures LPS throughput with uniformly
random keys with artificial interdependencies
Reviewed by: melifaro
MFC after: 7 days
Differential Revision: https://reviews.freebsd.org/D30096
Mark Johnston [Wed, 16 Jun 2021 13:46:56 +0000 (09:46 -0400)]
ipfw: Update the pfil mbuf pointer in ipfw_check_frame()
ipfw_chk() might call m_pullup() and thus can change the mbuf chain
head. In this case, the new chain head has to be returned to the pfil
hook caller, otherwise the pfil hook caller is left with a dangling
pointer.
Note that this affects only the link-layer hooks installed when the
net.link.ether.ipfw sysctl is set to 1.
PR: 256439, 254015, 255069, 255104
Fixes: f355cb3e6
Reviewed by: ae
Sponsored by: The FreeBSD Foundation
George Amanakis [Thu, 17 Jun 2021 00:17:42 +0000 (03:17 +0300)]
Avoid deadlock when removing L2ARC devices under I/O
In case we have I/O and try to remove an L2ARC device a deadlock might
occur. arc_read()->zio_read()->zfs_blkptr_verify() waits for SCL_VDEV
to be dropped while holding the hash_lock. However, spa_l2cache_load()
holds SCL_ALL and waits for the hash_lock in l2arc_evict().
Fix this by moving zfs_blkptr_verify() to the top top arc_read() before
the hash_lock is taken. Verify the block pointer and return a checksum
error if damaged rather than halting the system, by using
BLK_VERIFY_LOG instead of BLK_VERIFY_HALT.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #12054
Alan Somers [Thu, 22 Apr 2021 21:09:03 +0000 (15:09 -0600)]
gmultipath: make physpath distinct from the underlying providers'
zfsd uses a device's physical path attribute to automatically replace a
missing ZFS disk when a blank disk is inserted into the same physical
slot. Currently gmultipath passes through its underlying providers'
physical path attribute. That may cause zfsd to replace a missing
gmultipath provider with a newly arrived, single-path disk. That would
be bad.
This commit fixes that problem by simply appending "/mp" to the
underlying providers' physical path, in a manner similar to what geli
already does.
Kristof Provost [Fri, 21 May 2021 12:26:49 +0000 (14:26 +0200)]
dummynet: Fix schedlist and aqmlist locking
These are global (i.e. shared across vnets) structures, so we need
global lock to protect them. However, we look up entries in these lists
(find_aqm_type(), find_sched_type()) and return them. We must ensure
that the returned structures cannot go away while we are using them.
Resolve this by using NET_EPOCH(). The structures can be safely accessed
under it, and we postpone their cleanup until we're sure they're no
longer used.
Marko Zec [Wed, 5 May 2021 11:45:52 +0000 (13:45 +0200)]
Introduce DXR as an IPv4 longest prefix matching / FIB module
DXR maintains compressed lookup structures with a trivial search
procedure. A two-stage trie is indexed by the more significant bits of
the search key (IPv4 address), while the remaining bits are used for
finding the next hop in a sorted array. The tradeoff between memory
footprint and search speed depends on the split between the trie and
the remaining binary search. The default of 20 bits of the key being
used for trie indexing yields good performance (see below) with
footprints of around 2.5 Bytes per prefix with current BGP snapshots.
Rebuilding lookup structures takes some time, which is compensated for by
batching several RIB change requests into a single FIB update, i.e. FIB
synchronization with the RIB may be delayed for a fraction of a second.
RIB to FIB synchronization, next-hop table housekeeping, and lockless
lookup capability is provided by the FIB_ALGO infrastructure.
DXR works well on modern CPUs with several MBytes of caches, especially
in VMs, where is outperforms other currently available IPv4 FIB
algorithms by a large margin.
Test functionality of ng_vlan_rotate(4):
- Rotate 1 to 9 stagged vlans in any possible direction and length
- Rotate random combinations of ethertypes (8100, 88a8, 9100)
- Automatic reverse rotating for backward data flow
- Test too many and too few vlans
Test functionality of ng_hub(4):
- replicting traffic to anything but the sending hook
- persistence
- an unrestricted loop
- implementation limits with many hooks.
Turns out $ZPOOL_IMPORT_OPTS expands in a shell-like fashion,
yielding 'import' '-aN' '-o' 'cachefile=none' for an unset variable,
and 'import' '-aN' '-o' 'cachefile=none' 'word1' 'word2' for a
white-spaced one, but ${ZPOOL_IMPORT_OPTS} expands like "${Z_I_O}"
would in a shell, yielding 'import' '-aN' '-o' 'cachefile=none' ''
(empty) and 'import' '-aN' '-o' 'cachefile=none' 'word1 word2' (spaced)
Matthew Ahrens [Sun, 13 Jun 2021 17:48:53 +0000 (10:48 -0700)]
vdev_draid_min_asize() ignores reserved space
vdev_draid_min_asize() returns the minimum size of a child vdev. This
is used when determining if a disk is big enough to replace a child.
It's also used by zdb to determine how big of a child to make to test
replacement.
vdev_draid_min_asize() says that the child’s asize has to be at least
1/Nth of the entire draid’s asize, which is the same logic as raidz.
However, this contradicts the code in vdev_draid_open(), which
calculates the draid’s asize based on a reduced child size:
An additional 32MB of scratch space is reserved at the end of each
child for use by the dRAID expansion feature
So the problem is that you can replace a draid disk with one that’s
vdev_draid_min_asize(), but it actually needs to be larger to accommodate
the additional 32MB. The replacement is allowed and everything works at
first (since the reserved space is at the end, and we don’t try to use
it yet), but when you try to close and reopen the pool,
vdev_draid_open() calculates a smaller asize for the draid, because of
the smaller leaf, which is not allowed.
I think the confusion is that vdev_draid_min_asize() is correctly
returning the amount of required *allocatable* space in a leaf, but the
actual *size* of the leaf needs to be at least 32MB more than that.
ztest_vdev_attach_detach() assumes that it can attach that size of
device, and it actually can (the kernel/libzpool accepts it), but it
then later causes zdb to not be able to open the pool.
This commit changes vdev_draid_min_asize() to return the required size
of the leaf, not the size that draid will make available to the metaslab
allocator.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11459
Closes #12221
Paul Zuchowski [Sat, 12 Jun 2021 00:00:33 +0000 (20:00 -0400)]
Do not hash unlinked inodes
In zfs_znode_alloc we always hash inodes. If the
znode is unlinked, we do not need to hash it. This
fixes the problem where zfs_suspend_fs is doing zrele
(iput) in an async fashion, and zfs_resume_fs unlinked
drain processing will try to hash an inode that could
still be hashed, resulting in a panic.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alan Somers <asomers@gmail.com> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #9741
Closes #11223
Closes #11648
Closes #12210
Brian Behlendorf [Fri, 11 Jun 2021 15:21:36 +0000 (08:21 -0700)]
ZTS: Add zfs_clone_livelist_dedup.ksh to Makefile.am
Commit 86b5f4c12 added a new zfs_clone_livelist_dedup.ksh test case
but didn't include it in the Makefile.am. This results in the test
not being included in the dist tarball so it's never run by the CI.
Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes: #12224
Pedro F. Giffuni [Mon, 31 May 2021 01:48:38 +0000 (20:48 -0500)]
fread: improve performance for unbuffered reads
We can use the buffer passed to fread(3) directly in the FILE *.
The buffer needs to be reset before each call to __srefill().
This preserves the expected behavior in all cases.
The change was found originally in OpenBSD and later adopted by NetBSD.
Casper services expect that the first 3 descriptors (stdin/stdout/stderr)
will point to /dev/null. Which Casper will ensure later. The Casper
services are forked from the original process. If the initial process
closes one of those descriptors, Casper may reuse one of them for it on
purpose. If this is the case, then renumarate the descriptors used by
Casper to higher numbers. This is done already after the fork, so it
doesn't break the parent process.
Test functionality of ng_bridge(4):
- replicating traffic to anything but the sending hook
- persistence
- detect loops
- unicast to only one link of many
- stretch to implementation limits on broadcast
tests/netgraph: Inital framework for testing libnetgraph
Provide a framework of functions to test various netgraph modules.
Tests contain:
- creating, renaming, and destroying nodes
- connecting and removing hooks
- sending and receiving data
- sending ASCII messages and receiving binary responses
- errors can be passed for indiviual inspection or fail the test
Also contains some fixups:
- indent all files correctly
- finish factoring out
- remove debugging code
- check for renaming issues reported in PR241954
Randall Stewart [Fri, 11 Jun 2021 15:38:08 +0000 (11:38 -0400)]
tcp: Missing mfree in rack and bbr
Recently (Nov) we added logic that protects against a peer negotiating a timestamp, and
then not including a timestamp. This involved in the input path doing a goto done_with_input
label. Now I suspect the code was cribbed from one in Rack that has to do with the SYN.
This had a bug, i.e. it should have a m_freem(m) before going to the label (bbr had this
missing m_freem() but rack did not). This then caused the missing m_freem to show
up in both BBR and Rack. Also looking at the code referencing m->m_pkthdr.lro_nsegs
later (after processing) is not a good idea, even though its only for logging. Best to
copy that off before any frees can take place.
Randall Stewart [Thu, 10 Jun 2021 12:33:57 +0000 (08:33 -0400)]
tcp: Mbuf leak while holding a socket buffer lock.
When running at NF the current Rack and BBR changes with the recent
commits from Richard that cause the socket buffer lock to be held over
the ip_output() call and then finally culminating in a call to tcp_handle_wakeup()
we get a lot of leaked mbufs. I don't think that this leak is actually caused
by holding the lock or what Richard has done, but is exposing some other
bug that has probably been lying dormant for a long time. I will continue to
look (using his changes) at what is going on to try to root cause out the issue.
In the meantime I can't leave the leaks out for everyone else. So this commit
will revert all of Richards changes and move both Rack and BBR back to just
doing the old sorwakeup_locked() calls after messing with the so_rcv buffer.
We may want to look at adding back in Richards changes after I have pinpointed
the root cause of the mbuf leak and fixed it.
Randall Stewart [Wed, 9 Jun 2021 17:58:54 +0000 (13:58 -0400)]
tcp: LRO timestamps have lost their previous precision
Recently we had a rewrite to tcp_lro.c that was tested but one subtle change
was the move to a less precise timestamp. This causes all kinds of chaos
in tcp's that do pacing and needs to be fixed to use the more precise
time that was there before.
Mark Johnston [Sun, 6 Jun 2021 20:40:45 +0000 (16:40 -0400)]
arm64: Fix pmap_copy()'s handling of 2MB mappings
When copying mappings from parent to child, we clear the accessed and
dirty bits. This is done for both 4KB and 2MB PTEs. However,
pmap_demote_l2() asserts that writable superpages must be dirty. This
is to avoid races with the MMU setting the dirty bit during promotion
and demotion. pmap_copy() can create clean, writable superpage
mappings, so it violates this assertion.
Modify pmap_copy() to preserve the accessed and dirty bits when copying
2MB mappings, like we do on amd64.
Fixes: ca2cae0b4dd
Reported by: Jenkins via mhorne
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30643
Mark Johnston [Sun, 6 Jun 2021 20:41:35 +0000 (16:41 -0400)]
riscv: Handle hardware-managed dirty bit updates in pmap_promote_l2()
pmap_promote_l2() failed to handle implementations which set the
accessed and dirty flags. In particular, when comparing the attributes
of a run of 512 PTEs, we must handle the possibility that the hardware
will set PTE_D on a clean, writable mapping.
Following the example of amd64 and arm64, change riscv's
pmap_promote_l2() to downgrade clean, writable mappings to read-only, so
that updates are synchronized by the pmap lock.
Mark Johnston [Sun, 6 Jun 2021 20:40:19 +0000 (16:40 -0400)]
Suppress D_NEEDGIANT warnings for some drivers
During boot we warn that the kbd and openfirm drivers are Giant-locked
and may be deleted. Generally, the warning helps signal that certain
old drivers are not being maintained and are subject to removal, but
this doesn't really apply to certain drivers which are harder to
detangle from Giant.
Add a flag, D_GIANTOK, that devices can specify to suppress the
misleading warning. Use it in the kbd and openfirm drivers.
Reviewed by: imp, jhb
Sponsored by: The FreeBSD Foundation
Mark Johnston [Sun, 6 Jun 2021 20:40:29 +0000 (16:40 -0400)]
arm64: Use the right PTE when downgrading perms in pmap_promote_l2()
When promoting a run of small mappings to a superpage, we have to
downgrade clean, writable mappings to read-only, to handle the
possibility that the MMU will concurrently mark one of the mappings as
dirty.
The code which performed this operation for the first PTE in the run
used the wrong PTE pointer. As a result, the comparison would always
fail, aborting the promotion. This only occurs when promoting writable,
clean mappings.
Fixes: ca2cae0b4dd
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Kristof Provost [Sat, 15 May 2021 11:45:55 +0000 (13:45 +0200)]
pf: Convenience function for optional (numeric) arguments
Add _opt() variants for the uint* functions. These functions set the
provided default value if the nvlist doesn't contain the relevant value.
This is helpful for optional values (e.g. when the API is extended to
add new fields).
While here simplify the header by also using macros to create the
prototypes for the macro-generated function implementations.
- Allow firmware downloading for hw_variant #8;
- Enter manufacturer mode for setting of event mask;
- Handle multi-event response on HCI commands for 7260;
This allows to remove kludge with skipping of 0xfc2f opcode.
- Disable patch and exit manufacturer mode on downloading failure;
- Use default firmware if correct firmware file is not found;
Marcin Wojtas [Tue, 6 Apr 2021 15:00:05 +0000 (17:00 +0200)]
pciconf: Use VM_MEMATTR_DEVICE on supported architectures
Some architectures - armv7, armv8 and riscv use VM_MEMATTR_DEVICE
when mapping device registers in kernel. Do the same in pciconf.
On armada8k SoC all reads from BARs mapped with hitherto attribute
(VM_MEMATTR_UNCACHEABLE) return 0xff's.
Andrew Turner [Thu, 20 May 2021 06:52:15 +0000 (06:52 +0000)]
Clean up early arm64 pmap code
Early in the arm64 pmap code we need to translate between a virtual
address and a physical address. Rather than manually walking the page
table we can ask the hardware to do it for us.
Andrew Turner [Sun, 11 Apr 2021 09:00:00 +0000 (09:00 +0000)]
Use if ... else when printing memory attributes
In vmstat there is a switch statement that converts these attributes to
a string. As some values can be duplicate we have to hide these from
userspace.
Replace this switch statement with an if ... else macro that lets us
repeat values without a compiler error.
Rick Macklem [Fri, 28 May 2021 02:08:36 +0000 (19:08 -0700)]
nfscl: Use hash lists to improve expected search performance for opens
A problem was reported via email, where a large (130000+) accumulation
of NFSv4 opens on an NFSv4 mount caused significant lock contention
on the mutex used to protect the client mount's open/lock state.
Although the root cause for the accumulation of opens was not
resolved, it is obvious that the NFSv4 client is not designed to
handle 100000+ opens efficiently. When searching for an open,
usually for a match by file handle, a linear search of all opens
is done.
Commit 3f7e14ad9345 added a hash table of lists hashed on file handle
for the opens. This patch uses the hash lists for searching for
a matching open based of file handle instead of an exhaustive
linear search of all opens.
This change appears to be performance neutral for a small number
of opens, but should improve expected performance for a large
number of opens.
This commit should not affect the high level semantics of open
handling.