tests/netgraph: Inital framework for testing libnetgraph
Provide a framework of functions to test various netgraph modules.
Tests contain:
- creating, renaming, and destroying nodes
- connecting and removing hooks
- sending and receiving data
- sending ASCII messages and receiving binary responses
- errors can be passed for indiviual inspection or fail the test
Also contains some fixups:
- indent all files correctly
- finish factoring out
- remove debugging code
- check for renaming issues reported in PR241954
Randall Stewart [Fri, 11 Jun 2021 15:38:08 +0000 (11:38 -0400)]
tcp: Missing mfree in rack and bbr
Recently (Nov) we added logic that protects against a peer negotiating a timestamp, and
then not including a timestamp. This involved in the input path doing a goto done_with_input
label. Now I suspect the code was cribbed from one in Rack that has to do with the SYN.
This had a bug, i.e. it should have a m_freem(m) before going to the label (bbr had this
missing m_freem() but rack did not). This then caused the missing m_freem to show
up in both BBR and Rack. Also looking at the code referencing m->m_pkthdr.lro_nsegs
later (after processing) is not a good idea, even though its only for logging. Best to
copy that off before any frees can take place.
Randall Stewart [Thu, 10 Jun 2021 12:33:57 +0000 (08:33 -0400)]
tcp: Mbuf leak while holding a socket buffer lock.
When running at NF the current Rack and BBR changes with the recent
commits from Richard that cause the socket buffer lock to be held over
the ip_output() call and then finally culminating in a call to tcp_handle_wakeup()
we get a lot of leaked mbufs. I don't think that this leak is actually caused
by holding the lock or what Richard has done, but is exposing some other
bug that has probably been lying dormant for a long time. I will continue to
look (using his changes) at what is going on to try to root cause out the issue.
In the meantime I can't leave the leaks out for everyone else. So this commit
will revert all of Richards changes and move both Rack and BBR back to just
doing the old sorwakeup_locked() calls after messing with the so_rcv buffer.
We may want to look at adding back in Richards changes after I have pinpointed
the root cause of the mbuf leak and fixed it.
Randall Stewart [Wed, 9 Jun 2021 17:58:54 +0000 (13:58 -0400)]
tcp: LRO timestamps have lost their previous precision
Recently we had a rewrite to tcp_lro.c that was tested but one subtle change
was the move to a less precise timestamp. This causes all kinds of chaos
in tcp's that do pacing and needs to be fixed to use the more precise
time that was there before.
Mark Johnston [Sun, 6 Jun 2021 20:40:45 +0000 (16:40 -0400)]
arm64: Fix pmap_copy()'s handling of 2MB mappings
When copying mappings from parent to child, we clear the accessed and
dirty bits. This is done for both 4KB and 2MB PTEs. However,
pmap_demote_l2() asserts that writable superpages must be dirty. This
is to avoid races with the MMU setting the dirty bit during promotion
and demotion. pmap_copy() can create clean, writable superpage
mappings, so it violates this assertion.
Modify pmap_copy() to preserve the accessed and dirty bits when copying
2MB mappings, like we do on amd64.
Fixes: ca2cae0b4dd
Reported by: Jenkins via mhorne
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30643
Mark Johnston [Sun, 6 Jun 2021 20:41:35 +0000 (16:41 -0400)]
riscv: Handle hardware-managed dirty bit updates in pmap_promote_l2()
pmap_promote_l2() failed to handle implementations which set the
accessed and dirty flags. In particular, when comparing the attributes
of a run of 512 PTEs, we must handle the possibility that the hardware
will set PTE_D on a clean, writable mapping.
Following the example of amd64 and arm64, change riscv's
pmap_promote_l2() to downgrade clean, writable mappings to read-only, so
that updates are synchronized by the pmap lock.
Mark Johnston [Sun, 6 Jun 2021 20:40:19 +0000 (16:40 -0400)]
Suppress D_NEEDGIANT warnings for some drivers
During boot we warn that the kbd and openfirm drivers are Giant-locked
and may be deleted. Generally, the warning helps signal that certain
old drivers are not being maintained and are subject to removal, but
this doesn't really apply to certain drivers which are harder to
detangle from Giant.
Add a flag, D_GIANTOK, that devices can specify to suppress the
misleading warning. Use it in the kbd and openfirm drivers.
Reviewed by: imp, jhb
Sponsored by: The FreeBSD Foundation
Mark Johnston [Sun, 6 Jun 2021 20:40:29 +0000 (16:40 -0400)]
arm64: Use the right PTE when downgrading perms in pmap_promote_l2()
When promoting a run of small mappings to a superpage, we have to
downgrade clean, writable mappings to read-only, to handle the
possibility that the MMU will concurrently mark one of the mappings as
dirty.
The code which performed this operation for the first PTE in the run
used the wrong PTE pointer. As a result, the comparison would always
fail, aborting the promotion. This only occurs when promoting writable,
clean mappings.
Fixes: ca2cae0b4dd
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Kristof Provost [Sat, 15 May 2021 11:45:55 +0000 (13:45 +0200)]
pf: Convenience function for optional (numeric) arguments
Add _opt() variants for the uint* functions. These functions set the
provided default value if the nvlist doesn't contain the relevant value.
This is helpful for optional values (e.g. when the API is extended to
add new fields).
While here simplify the header by also using macros to create the
prototypes for the macro-generated function implementations.
- Allow firmware downloading for hw_variant #8;
- Enter manufacturer mode for setting of event mask;
- Handle multi-event response on HCI commands for 7260;
This allows to remove kludge with skipping of 0xfc2f opcode.
- Disable patch and exit manufacturer mode on downloading failure;
- Use default firmware if correct firmware file is not found;
Marcin Wojtas [Tue, 6 Apr 2021 15:00:05 +0000 (17:00 +0200)]
pciconf: Use VM_MEMATTR_DEVICE on supported architectures
Some architectures - armv7, armv8 and riscv use VM_MEMATTR_DEVICE
when mapping device registers in kernel. Do the same in pciconf.
On armada8k SoC all reads from BARs mapped with hitherto attribute
(VM_MEMATTR_UNCACHEABLE) return 0xff's.
Andrew Turner [Thu, 20 May 2021 06:52:15 +0000 (06:52 +0000)]
Clean up early arm64 pmap code
Early in the arm64 pmap code we need to translate between a virtual
address and a physical address. Rather than manually walking the page
table we can ask the hardware to do it for us.
Andrew Turner [Sun, 11 Apr 2021 09:00:00 +0000 (09:00 +0000)]
Use if ... else when printing memory attributes
In vmstat there is a switch statement that converts these attributes to
a string. As some values can be duplicate we have to hide these from
userspace.
Replace this switch statement with an if ... else macro that lets us
repeat values without a compiler error.
Rick Macklem [Fri, 28 May 2021 02:08:36 +0000 (19:08 -0700)]
nfscl: Use hash lists to improve expected search performance for opens
A problem was reported via email, where a large (130000+) accumulation
of NFSv4 opens on an NFSv4 mount caused significant lock contention
on the mutex used to protect the client mount's open/lock state.
Although the root cause for the accumulation of opens was not
resolved, it is obvious that the NFSv4 client is not designed to
handle 100000+ opens efficiently. When searching for an open,
usually for a match by file handle, a linear search of all opens
is done.
Commit 3f7e14ad9345 added a hash table of lists hashed on file handle
for the opens. This patch uses the hash lists for searching for
a matching open based of file handle instead of an exhaustive
linear search of all opens.
This change appears to be performance neutral for a small number
of opens, but should improve expected performance for a large
number of opens.
This commit should not affect the high level semantics of open
handling.
Rick Macklem [Tue, 25 May 2021 21:19:29 +0000 (14:19 -0700)]
nfscl: Use hash lists to improve expected search performance for opens
A problem was reported via email, where a large (130000+) accumulation
of NFSv4 opens on an NFSv4 mount caused significant lock contention
on the mutex used to protect the client mount's open/lock state.
Although the root cause for the accumulation of opens was not
resolved, it is obvious that the NFSv4 client is not designed to
handle 100000+ opens efficiently. When searching for an open,
usually for a match by file handle, a linear search of all opens
is done.
Commit 3f7e14ad9345 added a hash table of lists hashed on file handle
for the opens. This patch uses the hash lists for searching for
a matching open based of file handle instead of an exhaustive
linear search of all opens.
This change appears to be performance neutral for a small number
of opens, but should improve expected performance for a large
number of opens. This patch also moves any found match to the front
of the hash list, to try and maintain the hash lists in recently
used ordering (least recently used at the end of the list).
This commit should not affect the high level semantics of open
handling.
Neel Chauhan [Wed, 9 Jun 2021 21:38:52 +0000 (14:38 -0700)]
linuxkpi: Add macros for might_lock_nested() and lockdep_(re/un/)pin_lock()
In Linux, these are macros to locks in the kernel for scheduling purposes.
But as with other macros in this header, we aren't doing anything with them
so we are doing `do {} while (0)` for now.
Andrew Turner [Sun, 2 May 2021 07:43:34 +0000 (07:43 +0000)]
Enable IPIs on CPU 0 on arm and arm64
Not all interrupt controllers enable IPIs by default as the Arm
GIC specs make it an implementation defined option. As at least two
hypervisors have also previously masked the IPIs on boot.
As we already enable these IPIs on the non-boot CPUs it is expected
this is a safe operation.
Robert Wing [Thu, 3 Jun 2021 01:41:31 +0000 (17:41 -0800)]
fsck_ufs: fix segfault with gjournal
The segfault was being hit in ckfini() (sbin/fsck_ffs/fsutil.c) while
attempting to traverse the buffer cache. The tail queue used for the
buffer cache was not initialized before dropping into gjournal_check().
Initialize the buffer cache before calling gjournal_check().
Robert Wing [Thu, 20 May 2021 20:53:52 +0000 (12:53 -0800)]
fsck_ffs(8): fix divide by zero when debug messages are enabled
Only print buffer cache debug message when a cache lookup has been done.
When running `fsck_ffs -d` on a gjournal'ed filesystem, it's possible
that totalreads is greater than zero when no cache lookup has been
done - causing a divide by zero. This commit fixes the following error:
Alexander Motin [Thu, 10 Jun 2021 16:42:31 +0000 (12:42 -0400)]
Re-embed multilist_t storage
This commit partially reverts changes to multilists in PR 7968
(multi-threaded spa-sync()) and adds some cache line alignments to
separate read-only multilists and heavily modified refcount's to
different cache lines.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-by: iXsystems, Inc.
Closes #12158
Alexander Motin [Thu, 10 Jun 2021 15:27:33 +0000 (11:27 -0400)]
Remove pool io kstats
This mostly reverts "3537 want pool io kstats" commit of 8 years ago.
From one side this code using pool-wide locks became pretty bad for
performance, creating significant lock contention in I/O pipeline.
From another, there are more efficient ways now to obtain detailed
statistics, while this statistics is illumos-specific and much less
usable on Linux and FreeBSD, reported only via procfs/sysctls.
This commit does not remove KSTAT_TYPE_IO implementation, that may
be removed later together with already unused KSTAT_TYPE_INTR and
KSTAT_TYPE_TIMER.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #12212
Rich Ercolani [Thu, 10 Jun 2021 00:57:57 +0000 (20:57 -0400)]
Added error for writing to /dev/ on Linux
Starting in Linux 5.10, trying to write to /dev/{null,zero} errors out.
Prefer to inform people when this happens rather than hoping they guess
what's wrong.
Reviewed-by: Antonio Russo <aerusso@aerusso.net> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes: #11991
The first warning of a misspelling is a false positive, so we annotate
the script accordingly. As for the x-prefix warnings update the check
to use the conventional '[ -z <string> ]' syntax.
all-syslog.sh:46:47: warning: Possible misspelling: ZEVENT_ZIO_OBJECT
may not be assigned, but ZEVENT_ZIO_OBJSET is. [SC2153]
make_gitrev.sh:53:6: note: Avoid x-prefix in comparisons as it no
longer serves a purpose [SC2268]
man-dates.sh:10:7: note: Avoid x-prefix in comparisons as it no
longer serves a purpose [SC2268]
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12208
Dmitry Chagin [Wed, 26 May 2021 05:34:32 +0000 (08:34 +0300)]
linux_common: retire extra module version.
The second 'linuxcommon' line was added by c66f5b079d2a259c3a65b1efe0f2143cd030dc52
but Linuxulator's modules dependend on 'linux_common'.
To avoid such mistakes in the future rename moduledata name and module
name to 'linux_common' and retire 'linuxcommon' line.
netgraph/ng_base: Renaming a node to the same name is a noop
Detailed analysis in https://github.com/genneko/freebsd-vimage-jails/issues/2
brought the problem down to a double call of ng_node_name() before and
after a vnet move. Because the name of the node is already known
(occupied by itself), the second call fails.
PR: 241954
Reported by: Paul Armstrong
Differential Revision: https://reviews.freebsd.org/D30110
John Baldwin [Tue, 20 Apr 2021 20:22:11 +0000 (13:22 -0700)]
etcupdate: Gracefully handle SIGINT when building trees.
Run the 'build_tree' function inside of a subshell and trap SIGINT to
return an error to the caller. This allows callers to gracefully
cleanup a partially created tree.
While here, redirect stdout/stderr of the subshell to the log file
instead of applying redirections individually to each command executed
while building the tree.
John Baldwin [Tue, 20 Apr 2021 20:21:42 +0000 (13:21 -0700)]
etcupdate: Always extract to a temporary tree.
etcupdate has had a somewhat nasty race condition since its creation
in that its state machine can get very confused if it is interrupted
while building the tree to compare against. This is exacerbated by
the fact that etcupdate doesn't emit any output while building the
tree which can take several seconds (especially in recent years with
the addition of the tree-wide buildconfig/installconfig passes).
To mitigate this, always install a new tree into a temporary directory
created via mktemp as was previously done only for dry-runs via -n.
The existing trees are only rotated and the new tree installed as
/var/db/etcupdate/current after the update command has completed.
This:
(a) improves the error log message,
(b) locks per pool instead of globally,
(c) locks the actual output file instead of /var/lock/zfs-list,
which would otherwise linger there forever (well, still will,
but you can remove it and it won't come back), and
(d) preserves attributes of the output file
instead of reverting them to 0:0 644
It is imperative that the previous commit
("zed-functions.sh: zed_lock(): don't truncate lock")
be included in any series that contains this one
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12042
наб [Fri, 14 May 2021 02:18:20 +0000 (04:18 +0200)]
zed.d/all-debug.sh: simplify
By locking the log file itself, we can omit arduous rebinding and
explicit umask setting, but, perhaps more importantly, avoid permanently
littering /var/lock/ with zed.debug.log.lock we will never delete
It is imperative that the previous commit
("zed-functions.sh: zed_lock(): don't truncate lock")
be included in any series that contains this one
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12042
наб [Fri, 14 May 2021 02:17:31 +0000 (04:17 +0200)]
zed-functions.sh: zed_lock(): don't truncate lock
By appending instead of truncating, we can lock on any file (with write
permissions) instead of only dedicated lock files, since the locking
process itself no longer alters the file in any way
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12042
Alan Somers [Tue, 8 Jun 2021 13:36:43 +0000 (07:36 -0600)]
libzfs: On FreeBSD, use MNT_NOWAIT with getfsstat
`getfsstat(2)` is used to retrieve the list of mounted file systems,
which libzfs uses when fetching properties like mountpoint, atime,
setuid, etc. The `mode` parameter may be `MNT_NOWAIT`, which uses
information in the VFS's cache, or `MNT_WAIT`, which effectively does a
`statfs` on every single mounted file system in order to fetch the most
up-to-date information. As far as I can tell, the only fields that
libzfs cares about are the filesystem's name, mountpoint, fstypename,
and mount flags. Those things are always updated on mount and unmount,
so they will always be accurate in the VFS's mount cache except in two
circumstances:
1) When a file system is busy unmounting
2) When a ZFS file system changes the value of a mount-overridable
property like atime or setuid, but doesn't remount the file system.
Right now that only happens when the property is changed by an
unprivileged user who has delegated authority to change the property
but not to mount the dataset. But perhaps libzfs could choose to do
it for other reasons in the future.
Switching to `MNT_NOWAIT` will greatly improve speed with no downside,
as long as we explicitly update the mount cache whenever we change a
mount-overridable property.
For comparison, Illumos gets this information using the native
`getmntany` and `getmntent` functions, which also use cached
information. The illumos function that would refresh the cache,
`resetmnttab`, is never called by libzfs.
And on GNU/Linux, `getmntany` and `getmntent` don't even communicate
with the kernel directly. They simply parse the file they are given,
which is usually /etc/mtab or /proc/mounts. Perhaps the implementation
of /proc/mounts is synchronous, ala MNT_WAIT; I don't know.
Sponsored-by: Axcient Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com> Closes: #12091
Rich Ercolani [Mon, 7 Jun 2021 19:29:27 +0000 (15:29 -0400)]
Force --enable-debug on FreeBSD if INVARIANTS is set
There's already logic to force INVARIANTS on for building if it's
present in the running kernel; however, not having DEBUG enabled
when DEBUG and INVARIANTS are can cause strange panics.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12185
Closes #12163
Update the logic to handle the dedup-case of consecutive
FREEs in the livelist code. The logic still ensures that
all the FREE entries are matched up with a respective
ALLOC by keeping a refcount for each FREE blkptr that we
encounter and ensuring that this refcount gets to zero
by the time we are done processing the livelist.
zdb -y no longer panics when encountering double frees
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #11480
Closes #12177