Warner Losh [Fri, 8 Jul 2022 17:07:39 +0000 (11:07 -0600)]
test: Update boot universe build architectures
We build lua by default, so we don't need another build to build it
enabled w/o FORTH. That gives little value over the without forth
builds. Remove all mips, they are no longer relevant. Build aarch64
everywhere we build amd64 (except firewire which is x86 only). Build a
few more architectures once so we have at least one of every arch we
support in at least the default build. This should increase coverage
and still take less time than before.
Warner Losh [Fri, 8 Jul 2022 15:42:34 +0000 (09:42 -0600)]
stans: Narrow the scope of includes and other flags
CFLAGS+= here affects *ALL* libsa files being built. However, this is
only needed for zfs.c, so define it only for this. Also, use the defines
from defs.mk. Move all the zfs.c include hacks together. Also, move the
-Wformat -Wall warnings that were added to CFLAGS+= to the individual
files instead for the same reason.
The original interface created by wtap is named "wlan%d", which is the
same as the name of the vap created by wlan(4) and cause ifconfig(8)
may output like this:
wlan0:
parent interface: wlan0
Rename the interface created by wtap(4) to "wtap%d" to avoid confusing.
Reviewed by: adrian
Sponsored by: Google, Inc. (GSoC 2022)
Differential Revision: https://reviews.freebsd.org/D35751
Rick Macklem [Fri, 8 Jul 2022 14:37:36 +0000 (07:37 -0700)]
nfscl: Fix setting of nfsess_defunct for nfscl_hasexpired()
Commit a7bb120f8b87 added a printf for the case where recovery
has not marked the session defunct by setting nfsess_defunct
to 1. It turns out that nfscl_hasexpired() calls
nfsrpc_setclient() directly, without setting nfsess_defunct.
This patch replaces the printf with code that sets
nfsess_defunct to 1 to handle this case.
If SIGTERM is issued to a process when it is doing I/O on
an "intr" mount, the NFSv4 server may reply NFSERR_BADSTATEID,
due to the Open being prematurely closed.
This can result in a call to nfscl_hasexpired() to do a
recovery.
This would explain at least one hang described in the PR.
Effectively selectroute() addresses two different cases:
providing interface info for multicast destinations and providing
nexthop data for unicast ones. Current implementation intertwines
handling of both cases, especially in the error handling part.
Factor out all route lookup logic in a separate function,
lookup_route() to simplify the code.
Ensure consistent KPI: no error means *retifp is set and otherwise.
netinet6: factor out cached route lookups from selectroute().
Currently selectroute() contains two nearly-identical versions of
the route lookup logic - one for original destination and another
for the case when IPV6_NEXTHOP option was set on the socket.
Factor out handling these route lookups in a separation function to
improve readability.
This change also fixes handling of link-local IPV6_NEXTHOPs.
Mike Walker [Thu, 7 Jul 2022 20:28:37 +0000 (22:28 +0200)]
rc.subr: Make sure oomprotect protects existing children
The rc(8) framework support protecting services from OOM killer.
The current implementation applies the protection after the service has
already started. This works fine if only the main process is to be
protected (*_oomprotect=yes). However, the current implementation fails
to protect existing children when children are also to be protected
(*_oomprotect=all). This patch fixes that.
Note: it is not easy to apply the protectoin earlier because we want to
support both the services which use the "command" variable and those
that use the "start_cmd" variable.
PR: 256148
Approved by: adrian, osogbo
Tested by: Jamie Landeg-Jones <jamie@catflap.org>
Fixes: 3bead71e959d - Add a global option where we can protect
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D35747
Notable upstream pull request merges:
#12438 Avoid panic with recordsize > 128k, raw sending and no large_blocks
#13015 Fix dnode byteswapping
#13256 Add a "zstream decompress" subcommand
#13555 Scrub mirror children without BPs
#13576 Several sorted scrub optimizations
#13579 Fix and disable blocks statistics during scrub
#13582 Several B-tree optimizations
#13591 Avoid two 64-bit divisions per scanned block
#13606 Avoid memory copies during mirror scrub
#13613 Avoid memory copy when verifying raidz/draid parity
CSU tests build fails with '/usr/lib/libgcc_s.so: undefined reference to
fma' when built with LLVM 14 for powerpcspe, so '-lm' is being added
explicitly.
It may be linked to https://reviews.llvm.org/D77558
Reviewed by: imp (earlier version)
MFC after: 2 days
Sponsored by: Instituto de Pesquisas Eldorado (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D35691
Fixes a small kernel memory leak which would occur if a pool failed
to import because the `DMU_POOL_VDEV_ZAP_MAP` key can't be read from
a presumably damaged MOS config. In the case of a missing key there
was no leak.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Finix1979 <yancw@info2soft.com>
Closes #13629
rc.conf.5: Improve documentation of <name>_oomprotect
Apart from improving readability, this commit mentions that
<name>_oomprotect is ignored in a jail environment. Also, replace
${name}_cmd with the correct ${argument}_cmd and point the reader to
rc.subr(8).
Albert Jakiela [Wed, 6 Jul 2022 14:40:20 +0000 (16:40 +0200)]
e6000sw: add readphy and writephy wrappers
New functions take lock and give lock after operation.
Removed locking and unlocking within other methods,
to prevent from recursive locking on non recursive lock.
Cy Schubert [Thu, 17 Mar 2022 18:05:05 +0000 (11:05 -0700)]
ipfilter: Support only jails in VNET
Jails without VNET have complete access to the ipfilter rules, NAT,
pools and logs. This is insecure. Only allow jails to manipulate
ipfilter rules, NAT tables and ippools if the jail has its own VNET.
Otherwise a jail can affect the global system.
This patch brings ipfilter in line with ipfw's support of VNET jails and
non-support of non-VNET jails.
testing: add ability to specify multi-vnet topologies in the pytest framework.
Notable amount of tests related to the packet IO require two VNET jails
for proper testing and avoiding side effects for the host system.
Additionally, it is often required to run actions in the jails seme-sequentially
- waiting for the listener initialisation can be an example of such
dependency.
This change extends pytest vnet framework to allow defining multi-vnet
multi-epair topologies in declarative style, without any need to bother
about jail or repair names. All jail creation/teardown, interface
creation/teardown and address assignments are handled automatically.
The definitions above will create 2 vnets ("jail_test_output6_base",
"jail_test_output6_base_2"), 3 epairs, attached to both first and
second jails, set up the IP addresses for each epair, spawn another
process for vnet2_handler and pass control to vnet2_handler and
test_output6_base. Both processes can pass objects between each
other using pre-created pipes.
VNET_FOREACH() is a LIST_FOREACH if VIMAGE is set, but empty if it's
not. This means that users of the macro couldn't use 'continue' or
'break' as one would expect of a loop.
Change VNET_FOREACH() to be a loop in all cases (although one that is
fixed to one iteration if VIMAGE is not set).
This allows for easy copy-and-paste of a unix(4) peer name for lookup.
With current implementation it is guaranteed that a peer listed could be
found in the output.
sockstat(1): print out full connection graph for unix(4) sockets
Kernel provides us with enough information to display all possible
connections between UNIX sockets.
o Store unp_conn, xu_firstref and xu_nextref in the faddr of a UNIX sock.
o Build tree of file descriptors, indexed by the socket pointer.
o In displaysock() print out all possible information:
1) if socket is bound, print name of this socket
2) if socket has connected to a peer with a name, print peers name
3) if socket has connected to a peer without a name, print [pid fd]
4) if a bound socket has received connections, print list of them
as [pid fd]
Previously, only 1) either 2) were printed.
o Use tree to lookup by socket kvaddr. The size of hash is too big for a
small virtual machine and at the same time too little for a large
production server. A tree would better fit here.
o For those pcbs, that don't have a socket associated, use a list.
o Provide a second tree to lookup by pcb kvaddr. These removes full hash
traversal when printing every unix(4) socket.
tcp: remove a condition in tcp_usr_detach() that never happens
The comment from Robert Watson doubts that this condition ever happens.
Our analysis confirm that. Also, we found that if you manage to create
such a connection with help of some other bug, then after the "second
case" code is executed, the kernel will panic in other part of the stack.
Bug fix to UFS/FFS superblock integrity checks when reading a superblock.
Older versions of growfs(8) failed to correctly update fs_dsize.
Filesystems that have been grown fail the test for fs_dsize's correct
value. For now we exclude the fs_dsize test from the requirements.
Reported by: Edward Tomasz Napiera
Tested by: Edward Tomasz Napiera
Tested by: Peter Holm
MFC after: 1 month (with 076002f24d35)
Differential Revision: https://reviews.freebsd.org/D35219
Bug fix to UFS/FFS superblock integrity checks when reading a superblock.
The original check verified that if an alternate superblock has not
been selected that the superblock is located in its standard location.
For UFS1 the with a 65536 block size, the first backup superblock
is at the same location as the UFS2 superblock. Since SBLOCK_UFS2
is the first location checked, the first backup is the superblock
that will be used for a UFS1 filesystems with a 65536 block size.
This patch allows the use of the first backup superblock in that
situation.
Reported by: Peter Holm
Tested by: Peter Holm
MFC after: 1 month (with 076002f24d35)
Differential Revision: https://reviews.freebsd.org/D35219
Bug fix to UFS/FFS superblock integrity checks when reading a superblock.
The tests for number of cylinder groups (fs_ncg), inodes per cylinder
group (fs_ipg), and the size and layout of the cylinder group summary
information (fs_csaddr and fs_cssize) were overly restrictive and
would exclude some valid filesystems. These updates avoid precluding
valid fiesystems while still detecting rogue values that can crash or
hang the kernel.
testing: provide meaningful error when pytest is not available
atf format does not provide any way of signalling any error message
back to the atf runner when listing tests. Work this around by
reporting "__test_cases_list_pytest_binary_not_found__" test instead.
These are all "standard microarchitectural events", which in theory are
supported by every ARMv8 processor. In practice, it depends on the
pmu-event definitions being complete and accurate, which they are not
for every processor. Still, these aliases should be functional on the
majority of systems.
PR: 254532
Reported by: emaste
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35550
Thanks to the recently updated import of the jevents utility by mav@, we
can now compile the latest version of these event definitions. This
should support a wider set of common ARMv8 processors, for example, the
Cortex-A72 in the Raspberry Pi 4.
This brings this folder in sync with Linux commit 62e6eb8d5454.
Reviewed by: emaste
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35549
Michal Krawczyk [Tue, 5 Jul 2022 10:59:25 +0000 (12:59 +0200)]
ena: Fix LLQ descriptor reconfiguration
After the device reset, the LLQ configuration descriptor wasn't passed
to the hardware. On a 6-generation AWS instances (like C6gn), it is
required to pass the LLQ descriptor after the device reset, otherwise
the hardware will be missing the LLQ configuration resulting in
performance degradation.
This patch reconfigures the LLQ each time the ena_device_init() is
called. This means that the LLQ descriptor will be passed during the
initial configuration and after a reset.
The ena_map_llq_mem_bar() function call was moved before the
ena_device_init() call, to make sure that the mem bar is available.
Michal Krawczyk [Mon, 4 Jul 2022 07:03:54 +0000 (09:03 +0200)]
ena: Align req_id and qid print order
In most places, the req_id is printed first, and the qid is printed as a
second. To align the driver, one printout was reworked and the print
order of those variables was changed.
Brooks Davis [Wed, 6 Jul 2022 13:03:48 +0000 (14:03 +0100)]
cddl/*: add a WITH(OUT)_DTRACE option
Add an option to enable/disable DTrace without disabling ZFS. New
architectures such as CHERI may support ZFS before they support DTrace
and the old model of WITHOUT_CDDL disabling both wasn't helpful.
For compatiblity, the CDDL option remains and WITHOUT_CDDL implies
WITHOUT_DTRACE. WITHOUT_DTRACE also implies WITHOUT_CTF.
As part of this change, largely convert cddl/*/Makefile to using the
more compact SUBDIR.${MK_<FOO>}+= form rather than using intermediate
variables.
Mike Karels [Sat, 2 Jul 2022 16:03:36 +0000 (11:03 -0500)]
netstat -i: do not truncate interface names
The field for interface names for netstat -i was 5 characters by
default, which is no longer sufficient with names like "vlan1234"
and "vtnet0". netstat -iW computed the necessary field width, but
also enlarged the address field by a lot (especially with IPv6 enabled).
Make netstat -i compute the field width for interface names with or
without -W. Note that the existing default output does not fit in
80 columns in any case. Update the man page accordingly, documenting
the remaining effect of -W with -i. Also add -W to the list of
General Options, as there are numerous pointers to this.
Reported by: Chris Ross
Reviewed by: melifaro, rgrimes, cy
Differential Revision: https://reviews.freebsd.org/D35703
MFC after: 1 week
Split the current FDT-only implementation up into an FDT and an
ACPI part reusing and sharing as much code as possible (thanks mw!).
This makes the Synopsis XHCI root hubs attach correctly on SolidRun's
HoenyComb instead of just the generic XHCI root and this means we
are also doing proper chip setup and applying the quirk needed there [1].
There is one problem with ACPI attachment in that it uses the generic
XHCI PNP ID. So we need to do extra checks in order to not claim
all xhci, which means we check for a known quirk to be present
in acpi_probe. Long term this isn't scaling and this was discussed
in SolidRun's Discord Channel in 2021 with the intend that "jnettlet"
will take this to a steering committee. Since then ACPI has kind-of
become a technology non grata (due to not getting changes into Linux
timely) so it is unclear if this will ever happen. If there will be
further hardware with dwc3/ACPI we should go and make sure this problem
gets solved.
Alexander Motin [Tue, 5 Jul 2022 23:27:29 +0000 (19:27 -0400)]
Avoid memory copy when verifying raidz/draid parity
Before this change for every valid parity column raidz_parity_verify()
allocated new buffer and copied there existing data, then recalculated
the parity and compared the result with the copy. This patch removes
the memory copy, simply swapping original buffer pointers with newly
allocated empty ones for parity recalculation and comparison. Original
buffers with potentially incorrect parity data are then just freed,
while new recalculated ones are used for repair.
On a pool of 12 4-wide raidz vdevs, storing 1.5TB of 16MB blocks, this
change reduces memory traffic during scrub by 17% and total unhalted
CPU time by 25%.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #13613
Alexander Motin [Tue, 5 Jul 2022 23:26:20 +0000 (19:26 -0400)]
Avoid memory copies during mirror scrub
Issuing several scrub reads for a block we may use the parent ZIO
buffer for one of child ZIOs. If that read complete successfully,
then we won't need to copy the data explicitly. If block has only
one copy (typical for root vdev, which is also a mirror inside),
then we never need to copy -- succeed or fail as-is. Previous
code also copied data from buffer of every successfully completed
child ZIO, but that just does not make any sense.
On healthy N-wide mirror this saves all N+1 (or even more in case
of ditto blocks) memory copies for each scrubbed block, allowing
CPU to focus mostly on check-summing. For other vdev types it
should save one memory copy per block copy at root vdev.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #13606
If we receive a UDP packet (directed towards an active OpenVPN socket)
which is too short to contain an OpenVPN header ('struct
ovpn_wire_header') we wound up making m_copydata() read outside the
mbuf, and panicking the machine.
Explicitly check that the packet is long enough to copy the data we're
interested in. If it's not we will pass the packet to userspace, just
like we'd do for an unknown peer.
If dummynet_task() is run on a vnet where dummynet is still initialising
(i.e. still running ip_dn_vnet_init()) we can attempt to use an
uninitialised mutex.
We can use the existing init_done field to check if the per-vnet
V_dn_cfg is fully set up, if we ensure that it's only set to 1 when
we've done all of the init work.
DB_COMMAND(9): update to mention additional macros
Document the existing alias definitions, and augment the example with
one of these. Also, describe the purpose of the newly added _FLAGS
variations of these command definitions.
Make some small style improvements to appease mandoc -Tlint.
Reviewed by: markj
MFC after: 3 days
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35664
Mitchell Horne [Thu, 30 Jun 2022 15:58:22 +0000 (12:58 -0300)]
ddb: add _FLAGS command variants
Provide _FLAGS variants of the various command definition macros, in
anticipation of adding a new flag. This can also be used for some
existing commands which require special flag values.
Reviewed by: markj
MFC after: 3 days
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35581
Provide separate helper macros for regular commands and next-level table
commands as they are mutually exclusive. This ensures proper
initialization of each element and allows us to exclude some redundant
fields, such as specifying .more = NULL for every regular command.
Reviewed by: markj, jhb
MFC after: 3 days
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35580
netinet6: perform out-of-bounds check for loX multicast statistics
Currently, some per-mbuf multicast statistics is stored in
the per-interface ip6stat.ip6s_m2m[] array of size 32 (IP6S_M2MMAX).
Check that loopback ifindex falls within 0.. IP6S_M2MMAX-1 range to
avoid silent data corruption. The latter cat happen with large
number of VNETs.
arp(8): use getifaddrs(3) instead of ioctl(SIOCGIFCONF)
The original code had used a fixed-size buffer for ioctl(SIOCGIFCONF),
that might cause the target ifreq spilled from the buffer. Use the handy
getifaddrs(3) to fix the problem.
During the review of 09cdf4878c621be4cd229fa88cdccdcdc8c101f7 we
switched from cached registers to reading them as needed.
One read of the two reads was moved after the softreset got triggered
and as a result returned 0 rather than the proper register value.
Moving the read before the softreset gets initiated seems to make
things work again and xhci.c no longer complains about
"Controller does not support 4K page size.".
sockets: use only soref()/sorele() as socket reference count
o Retire SS_FDREF as it is basically a debug flag on top of already
existing soref()/sorele().
o Convert SS_PROTOREF into soref()/sorele().
o Change reference model for the listen queues, see below.
o Make sofree() private. The correct KPI to use is only sorele().
o Make soabort() respect the model and sorele() instead of sofree().
Note on listening queues. Until now the sockets on a queue had zero
reference count. And the reference were given only upon accept(2). The
assumption was that there is no way to see the queued socket from anywhere
except its head. This is not true, since queued sockets already have pcbs,
which are linked at least into the global pcb lists. With this change we
put the reference right in the sonewconn() and on accept(2) path we just
hand the existing reference to the file descriptor.
sockets: use positive flag for file descriptor socket reference
Rename SS_NOFDREF to SS_FDREF and flip all bitwise operations.
Mark sockets created by socreate() with SS_FDREF.
This change is mostly illustrative. With it we see that SS_FDREF
is a debugging flag, since:
* socreate() takes a reference with soref().
* on accept path solisten_dequeue() takes a reference
with soref() and then soaccept() sets SS_FDREF.
* soclose() checks SS_FDREF, removes it and does sorele().
pca954x: harmonize pca9547 and pca954x and add pca9540 support
The two implementations for the pca9548 switch and the pca9547 mux
seemed close enough so we can put them together and with a bit more
abstraction add pca9540 support.
While here apply a bit of consistency in variable and driver naming and
use device_has_property instead of the FDT-only OF_ variant.
This disconnects pca9547 from the build but does not yet delete it.
Doug Moore [Mon, 4 Jul 2022 17:28:35 +0000 (12:28 -0500)]
rb_tree: fine-tune rebalancing code
Change parts of RB_INSERT_COLOR and RB_REMOVE_COLOR to reduce the
number of operations, by, in some cases, flipping two color bits at a
time instead of flipping each individually, in separate operations,
and by using a switch statement to replace a sequence of if-elses.
Rewrite RB_SET_PARENT to generate fewer instructions. These changes
reduce the code size by over 100 bytes on some architectures.
Also, allow RB users to define a preprocessor symbol to generate
RB_REMOVE_COLOR code that matches the implementation described in the
original weak-AVL paper in one particular case, instead of the still
correct, but slightly more efficient implementation of that case
currently implemented.
Andrew Gallatin [Mon, 4 Jul 2022 16:40:35 +0000 (12:40 -0400)]
pmcstat: fix log analysis
pmcstat has been broken for analyzing logs since D35342 / b6e28991bf3aadb.
This is because the pmc for the first CPU is not added when reading logs
because unlike its clones, its event id is not invalid. That causes us
to fail the assertion at lib/libpmcstat/libpmcstat_logging.c:293
when encountering samples from cpu0.
Fix this by removing the check that the PMC is invalid
When accessing a register directly from etherswitchcfg one must specify
a register group(e.g. registers of portN) and the register offset within
the group. The latter is passed as the 5 least significant bits.
Extract the former by dividing the register address by 32, not by 5.
lockstat: Fix construction of comparision predicates
Passing "0x%p" to sprintf results in double "0x" being printed.
This causes a dtrace script compilation failure when "-d" flag
is specified.
Fix that by removing the extraneous "0x".
Mike Karels [Sun, 3 Jul 2022 23:04:41 +0000 (18:04 -0500)]
mountd startup: enable NFSv4 if needed on restart
The mountd script in rc.d sets vfs.nfsd.server_max_nfsvers correctly
when it is run at system startup, relying on the kernel default.
However, if NFSv4 was enabled in /etc/rc.conf later, and the script
was re-run to restart mountd, the sysctl was still set to 3.
Set the sysctl to the right value in all cases.
Some systems put sensors on non-0 lun, so we should not omit it. This
was the only difference with the Linux driver, where DIMM sensors could
be queried, but not on FreeBSD.
See this report[1] on the FreeBSD forums:
https://forums.freebsd.org/threads/freebsd-cannot-get-dimm-temperature-sensor-value.85166/
Handle IPMB requests using SEND_MSG (sent as driver request as we do not
need to return anything back to userland for this) and GET_MSG (sent as
usual request so we can return the data for RECEIVE_MSG ioctl) pair.
This fixes fetching complete sensor data from boards (e.g. HP ProLiant
DL380 Gen10).
Reviewed by: philip
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D35605