Eric van Gyzen [Thu, 24 Feb 2022 22:53:03 +0000 (16:53 -0600)]
sendfile_test: fix copy-paste bug
Require the newly opened file descriptor to be good, instead of
re-requiring the one that was required three lines earlier.
Thankfully, opening /dev/null is really unlikely to fail.
Eric van Gyzen [Thu, 17 Feb 2022 15:53:48 +0000 (09:53 -0600)]
elfdump: handle small files more gracefully
elfdump -E on an empty file would complain "Invalid argument" because
it tried to mmap zero bytes. With the -E flag, elfdump should
simply exit non-zero. For tiny files, the code would reference off
the end of the mapped region.
Ensure the file is large enough to contain an ELF header before mapping it.
Eric van Gyzen [Fri, 14 Jan 2022 14:12:57 +0000 (08:12 -0600)]
Allow downstream projects to easily add private and internal libs
Allow projects based on the FreeBSD tree to append to _PRIVATELIBS
and _INTERNALLIBS by simply maintaining their own lists of
LOCAL_PRIVATELIBS and LOCAL_INTERNALLIBS, respectively.
Eric van Gyzen [Fri, 1 Oct 2021 11:25:48 +0000 (06:25 -0500)]
sem_clockwait_np test: fix usage of ATF API
ATF_REQUIRE_ERRNO requires the given errno iff the given expression is
true. These test cases used it incorrectly, potentially allowing
sem_clockwait_np to succeed when it was expected to fail. Use separate
ATF calls to require failure and the expected errno.
Eric van Gyzen [Fri, 1 Oct 2021 10:37:17 +0000 (05:37 -0500)]
sem_clockwait_np test: relax time constraint on VMs
In a guest on a busy hypervisor, the time remaining after an
interrupted sleep could be much lower than other environments.
Relax the lower bound on VMs.
Eric van Gyzen [Fri, 23 Jul 2021 13:49:55 +0000 (08:49 -0500)]
aio_md_test: NUL-terminate result of readlink
readlink does not NUL-terminate the output buffer. This led to spurious
failures to destroy the md device because the unit number was garbage.
NUL-terminate the output buffer.
Eric van Gyzen [Fri, 23 Jul 2021 13:24:52 +0000 (08:24 -0500)]
aio_md_test: fix cleanup
ATF cleanup functions cannot use functions such as ATF_REQUIRE
and atf_tc_fail. These functions assert that a test case is
currently running, which is not true during cleanup, so the
process aborts. Change the cleanup function to simply print
to stderr and return.
Eric van Gyzen [Thu, 27 May 2021 16:33:22 +0000 (11:33 -0500)]
libprocstat kstack: fix race with thread creation
When collecting kernel stacks for a target process, if the process
adds a thread between the two calls to sysctl, ignore the additional
threads. Previously, procstat would print only a useless error
message. Now, it prints a consistent snapshot of the stacks.
We know that snapshot is already stale, but it could still be stale
even with a more complex fix to reallocate and retry, so such a fix
is hardly worth the effort.
Eric van Gyzen [Mon, 26 Apr 2021 15:01:17 +0000 (10:01 -0500)]
Wait longer for a previous IPI to be sent
When sending an IPI, if a previous IPI is still pending delivery,
native_lapic_ipi_vectored() waits for the previous IPI to be sent.
We've seen a few inexplicable panics with the current timeout of 50 ms.
Increase the timeout to 1 second and make it tunable.
No hardware specification mentions a timeout in this case; I checked
the Intel SDM, Intel MP spec, and Intel x2APIC spec. Linux and illumos
wait forever. In Linux, see __default_send_IPI_shortcut() in
arch/x86/kernel/apic/ipi.c. In illumos, see apic_send_ipi() in
usr/src/uts/i86pc/io/pcplusmp/apic_common.c. However, misbehaving hardware
could hang the system if we wait forever.
Eric van Gyzen [Tue, 6 Apr 2021 14:36:52 +0000 (09:36 -0500)]
uefisign: fix handling of errors from child proc
Close the unused pipe file descriptors so the parent will notice if
the child exits prematurely. Previously, the parent would block
forever on a read from the pipe.
Mark Johnston [Tue, 22 Feb 2022 14:26:33 +0000 (09:26 -0500)]
riscv: Fix another race in pmap_pinit()
Commit c862d5f2a789 ("riscv: Fix a race in pmap_pinit()") did not really
fix the race. Alan writes,
Suppose that N entries in the L1 tables are in use, and we are in the
middle of the memcpy(). Specifically, we have read the zero-filled
(N+1)st entry from the kernel L1 table. Then, we are preempted. Now,
another core/thread does pmap_growkernel(), which fills the (N+1)st
entry. Finally, we return to the original core/thread, and overwrite
the valid entry with the zero that we earlier read.
Try to fix the race properly, by copying kernel L1 entries while holding
the allpmaps lock. To avoid doing unnecessary work while holding this
global lock, copy only the entries that we expect to be valid.
Fixes: c862d5f2a789 ("riscv: Fix a race in pmap_pinit()")
Reported by: alc, jrtc27
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Mark Johnston [Tue, 8 Feb 2022 18:15:54 +0000 (13:15 -0500)]
riscv: Fix a race in pmap_pinit()
All pmaps share the top half of the address space. With 3-level page
tables, the top-level kernel map entries are not static: they might
change if the kernel map is extended (via pmap_growkernel()) or a 1GB
mapping in the direct map is demoted (not implemented yet). Thus the
riscv pmap maintains the allpmaps list to synchronize updates to
top-level entries.
When a pmap is created, it is inserted into this list after copying
top-level entries from the kernel pmap. The copying is done without
holding the allpmaps lock, and it is possible for pmap_pinit() to race
with kernel map updates. In particular, if a thread is modifying L1
entries, and a concurrent pmap_pinit() copies the old version of the
entries, it might not receive the update.
Fix the problem by copying the kernel map entries after inserting the
pmap into the list. This ensures that the nascent pmap always receives
updates, though pmap_distribute_l1() may race with the page copy.
Reviewed by: mhorne, jhb
Sponsored by: The FreeBSD Foundation
Kristof Provost [Sat, 19 Feb 2022 15:34:31 +0000 (16:34 +0100)]
bridge: Don't share broadcast packets
if_bridge duplicates broadcast packets with m_copypacket(), which
creates shared packets. In certain circumstances these packets can be
processed by udp_usrreq.c:udp_input() first, which modifies the mbuf as
part of the checksum verification. That may lead to incorrect packets
being transmitted.
Kristof Provost [Tue, 15 Feb 2022 10:49:39 +0000 (11:49 +0100)]
netinet: allow UDP tunnels to be removed
udp_set_kernel_tunneling() rejects new callbacks if one is already set.
Allow callbacks to be cleared. The use case for this is OpenVPN DCO,
where the socket is opened by userspace and then adopted by the kernel
to run the tunnel. If the DCO interface is removed but userspace does
not close the socket (something the kernel cannot prevent) the installed
callbacks could be called with an invalidated context.
Allow new functions to be set, but only if they're NULL (i.e. allow the
callback functions to be cleared).
Mark Johnston [Fri, 14 Jan 2022 20:03:53 +0000 (15:03 -0500)]
vm_pageout: Print a more accurate message to the console before an OOM kill
Previously we'd always print "out of swap space." This can be
misleading, as there are other reasons an OOM kill can be triggered. In
particular, it's entirely possible to trigger an OOM kill on a system
with plenty of free swap space.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Navdeep Parhar [Sat, 12 Feb 2022 00:58:46 +0000 (16:58 -0800)]
cxgbe(4): Fix illegal hardware access in cxgbe_refresh_stats.
cxgbe_refresh_stats takes into account VI_SKIP_STATS but not
VI_INIT_DONE when deciding whether to read the hardware stats. But
before this change VI_SKIP_STATS was set only for VIs with VI_INIT_DONE.
That meant that cxgbe_refresh_stats always accessed the hardware for
uninitialized VIs, and this is a problem if the adapter is suspended or
in the middle of a reset.
Fix this by setting VI_SKIP_STATS on all VIs during suspend. While
here, ignore VI_INIT_DONE in vi_refresh_stats too to be consistent with
cxgbe_refresh_stats.
Navdeep Parhar [Thu, 13 Jan 2022 22:21:49 +0000 (14:21 -0800)]
cxgbe(4): Fix bad races between sysctl and driver detach.
The default sysctl context setup by newbus for a device is eventually
freed by device_sysctl_fini, which runs after the device driver's detach
routine. sysctl nodes associated with this context must not use any
resources (like driver locks, hardware access, counters, etc.) that are
released by driver detach.
There are a lot of sysctl nodes like this in cxgbe(4) and the fix is to
hang them off a context that is explicitly freed by the driver before it
releases any resource that might be used by a sysctl.
This fixes panics when running "sysctl dev.t6nex dev.cc" in a tight loop
and loading/unloading the driver in parallel.
Navdeep Parhar [Wed, 5 Jan 2022 18:45:06 +0000 (10:45 -0800)]
cxgbe(4): Do not request an FEC that is invalid for the requested speed.
This eliminates error messages like this from the driver when running at
50Gbps with 100G cables:
[3726] cc0: l1cfg failed: 71
[4407] cc0: l1cfg failed: 71
Note that link comes up anyway with or without this change.
Navdeep Parhar [Mon, 3 Jan 2022 22:35:45 +0000 (14:35 -0800)]
cxgbe(4): Update firmwares to 1.26.6.0.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
CHANGES
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Version : 1.26.6.0
Date : 01/03/2022
================================================================================
Fixes
-----
BASE:
- Fixed one module eeprom read failure.
- Fixed an issue with speed selection when 40G and 25G are advertised and
supported.
- Fixed a random traffic hang when T5 receives invalid ets BW in dcbx
messages from a switch.
- Fixed very long link up time with few switches.
================================================================================
Navdeep Parhar [Mon, 3 Jan 2022 21:31:46 +0000 (13:31 -0800)]
cxgbe(4): Fix stats collection for ports with port_id != tx_chan
This fixes a driver panic during stats collection when a port's id does
not match its tx channel. The bug affected only the T580 card running
with a non-default VPD.
Navdeep Parhar [Thu, 9 Dec 2021 19:10:48 +0000 (11:10 -0800)]
cxgbe(4): Update firmwares to 1.26.4.0
(Rest is from the README that came with the firmware)
Version : 1.26.4.0
Date : 12/02/2021
Fixes
-----
BASE:
- Fixed error on setting 25G speed on 100G copper with multiple FEC set in
firmware commands.
- Handle link of unknown optics modules by enabling module tx unconditionally.
- Fixed link not coming up for 25G CRS phys. Firmware incorrectly tried to
bring up the link in RS-FEC but as per IEEE spec, it must be BASER FEC.
- Fixed an issue where firmware doesn't automatically retry next FEC if driver
asks to bring up the link using RS-FEC and link doesn't come up.
Navdeep Parhar [Mon, 15 Nov 2021 18:55:04 +0000 (10:55 -0800)]
cxgbe(4): Change the way t4_shutdown_adapter brings the link(s) down.
Modify the GPIO pins only on the Base-T cards and even there drive all
of them low instead of putting them in hi-z state. For the rest (this
is the common case), directly power off the PLLs of the high speed
serdes. This is the simplest method that does not involve or conflict
with the firmware but still works with all T4-T6 cards regardless of
what's plugged into the port.
This fixes a problem where the peer wouldn't always see a link down if
it is connected to the device using a -CR4 copper cable.
Navdeep Parhar [Wed, 10 Nov 2021 19:38:54 +0000 (11:38 -0800)]
cxgbe(4): internal knob for flexible control over FEC selection.
Recent firmwares have support for autonomous FEC selection and a "force"
knob to let the driver control this behavior (or not) in a fine grained
manner. This change adds a driver knob so that all the different ways of
configuring the link FEC can be exercised. Note that this controls the
internal driver/firmware interaction for link configuration and is not
meant for general use.
Navdeep Parhar [Wed, 10 Nov 2021 18:54:53 +0000 (10:54 -0800)]
cxgbe(4): separate sysctls for user-requested and in-use FEC.
Recent firmwares have more leeway in FEC selection and there is a need
to track the FECs requested by the driver separately from the FEC in use
on the link. The existing dev.<port>.<inst>.fec sysctl can read both but
its behavior depends on the link state and it is sometimes hard to find
out what was requested when the link is up.
Split the fec sysctl into two (requested_fec and link_fec) to get access
to both pieces of information regardless of the link state.
Bjoern A. Zeeb [Thu, 24 Feb 2022 21:38:27 +0000 (21:38 +0000)]
iwlwifi: enhance debug information
Add a string of the debug type to the output of the debug message so it
is easier to search for specific events in a trace with lots of debugging
on. While here remove superflous ().
Bjoern A. Zeeb [Tue, 22 Feb 2022 22:48:08 +0000 (22:48 +0000)]
LinuxKPI: change DECLARE_FLEX_ARRAY()
DECLARE_FLEX_ARRAY can be used inside a structure. On FreeBSD due to
-Wgnu-variable-sized-type-not-at-end this yields an error. Use [0]
instead of [] to overcome this.
Sponsored by: The FreeBSD Foundation
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D34350
Chuck Tuffli [Wed, 23 Feb 2022 15:18:54 +0000 (07:18 -0800)]
bhyve nvme: Advertise Namespace changed AEN
Advertise Namespace Attribute Notices events in the Optional
Asynchronous Events Supported (OAES) field of the Identify Controller
data structure. Additionally, rename the enums and macros to clarify
these are AEN's related to Notices and not generic information.
Chuck Tuffli [Mon, 21 Feb 2022 18:34:14 +0000 (10:34 -0800)]
nvme: Add OAES bit-field definitions
Create definitions for the Optional Asynchronous Events Supported (OAES)
values. Also adds a helper macro for the common use case of "mask and
shift". E.g.
value = NVME_CTRLR_DATA_OAES_NS_ATTR_MASK << NVME_CTRLR_DATA_OAES_NS_ATTR_SHIFT;
becomes
value = NVMEB(NVME_CTRLR_DATA_OAES_NS_ATTR);
1. A delayed_work struct in the WORK_ST_TIMER state.
2. Thread A calls mod_delayed_work()
3. Thread B (a callout thread) simultaneously calls
linux_delayed_work_timer_fn()
The following sequence of events is possible:
A: Call linux_cancel_delayed_work()
A: Change state from TIMER TO CANCEL
B: Change state from CANCEL to TASK
B: taskqueue_enqueue() the task
A: taskqueue_cancel() the task
A: Call linux_queue_delayed_work_on(). This is a no-op because the
state is WORK_ST_TASK.
As a result, the delayed_work struct will never be invoked. This is
causing address resolution in ib_addr.c to stop permanently, as it
never tries to reschedule a task that it thinks is already scheduled.
Fix this by introducing locking into the cancel path (which
corresponds with the lock held while the callout runs). This will
prevent the callout from changing the state of the task until the
cancel is complete, preventing the race.
ip6_setpktopts() can look up ifnets via ifnet_by_index(), which
is only safe in the net epoch. Ensure that callers are in the net
epoch before calling this function.
If the UCL ctld parser encountered a port that used the CTL
ioctl device, it fell into a special case that had an erroneous
early return. This caused all configuration in the target
following the port attribute to be skipped. Fix this by replacing
the return with a continue so that the rest of the config is
parsed correctly.
Ed Maste [Mon, 21 Feb 2022 04:09:36 +0000 (23:09 -0500)]
vt: fix double-click word selection for first/last word on line
Previously when double-clicking on the first word on a line we would
select from the cursor position to the end of the word, not from the
beginning of the line. Similarly, when double-clicking on the last word
on a line we would select from the beginning of the word to the cursor
position rather than the end of the line.
This is because we searched backward or forward for a space character to
mark the beginning or end of a word. Now, use the beginning or end of
the line if we do not find a space.
- Do not set Os to FreeBSD explicitly. We don't do it in other manual
pages.
- Remove macros from the -width specifier.
- Use Xr instead of Cm to refer to the freebsd-update command.
- Address some mandoc lint warnings and use \(em instead of --.
- Wordsmith some paragraphs.
- Add a missing El macro.
Michal Krawczyk [Mon, 3 Jan 2022 13:51:59 +0000 (14:51 +0100)]
ena: update ENA version to v2.5.0
Some of the changes in this release:
- IPv6 L4 checksum offload fixes.
- Optimization of the Tx req_id validation.
- Timer service adjustments.
- NUMA awareness for the kernel RSS mode.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
Dawid Gorecki [Mon, 3 Jan 2022 13:50:29 +0000 (14:50 +0100)]
ena: do not call reset if device is unresponsive
If the device becomes unresponsive, the driver will not be able to
finish the reset process correctly. Timeout during version validation
indicates that the device is currently not responding. In that case
do not perform the reset and instead reschedule timer service. Because
of that the driver will continue trying to reset the device until it
succeeds or is detached.
Dawid Gorecki [Mon, 3 Jan 2022 13:50:13 +0000 (14:50 +0100)]
ena: start timer service on attach
The timer service was started when the interface was brought up and it
was stopped when it was brought down. Since ena_up requires the device
to be responsive, triggering the reset would become impossible if the
device became unresponsive with the interface down.
Since most of the functions in timer service already perform the check
to see if the device is running, this only requires starting the callout
in attach and stopping it when bringing the interface up or down to
avoid race between different admin queue calls.
Since callout functions for timer service are always called with the
same arguments, replace callout_{init,reset,drain} calls with
ENA_TIMER_{INIT,RESET,DRAIN} macros.
Artur Rojek [Mon, 3 Jan 2022 13:50:06 +0000 (14:50 +0100)]
ena: rework tx req_id validation logic
Since `ena_com_tx_comp_req_id_get` already checks for `req_id` validity,
the logic was exiting early, never giving `validate_tx_req_id` a chance
to trigger device reset.
Rewrite the logic so that device reset is called based on return value
of `ena_com_tx_comp_req_id_get` instead.
Submitted by: Artur Rojek <ar@semihalf.com>
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
Dawid Gorecki [Mon, 3 Jan 2022 13:49:58 +0000 (14:49 +0100)]
ena: properly handle IPv6 L4 checksum offload
ena_tx_csum function did not check if IPv6 checksum offload was
requested it only checked checksum offloading flags for IPv4 packets.
Because of that, when encountering CSUM_IP6_* flags, the function simply
returned without actually setting checksum offloading in ena_ctx.
Check CUSM_IP6_* flags to enable IPv6 checksum offload.
Additionally, only IPv4 header was being parsed regardless of EtherType
field, because of that, value of L4 protocol read when actually trying
to send IPv6 packets was wrong. Use ip6_lasthdr function to get length
of all IPv6 headers and payload protocol.
Set the DF flag to 1 in order to allow the device to offload the IPv6
checksum calculation and achieve optimal performance.
Add CSUM6_OFFLOAD and CSUM_OFFLOAD definitions into ena_datapath.h.
Adjust the driver to the upgraded ena-com part twofold:
First update is related to the driver's NUMA awareness.
Allocate I/O queue memory in NUMA domain local to the CPU bound to the
given queue, improving data access time. Since this can result in
performance hit for unaware users, this is done only when RSS
option is enabled, for other cases the driver relies on kernel to
allocate memory by itself.
Information about first CPU bound is saved in adapter structure, so
the binding persists after bringing the interface down and up again.
If there are more buckets than interface queues, the driver will try to
bind different interfaces to different CPUs using round-robin algorithm
(but it will not bind queues to CPUs which do not have any RSS buckets
associated with them). This is done to better utilize hardware
resources by spreading the load.
Add (read-only) per-queue sysctls in order to provide the following
information:
- queueN.domain: NUMA domain associated with the queue
- queueN.cpu: CPU affinity of the queue
The second change is for the CSUM_OFFLOAD constant, as ENA platform
file has removed its definition. To align to that change, it has been
added to the ena_datapath.h file.
These zones are cache zones used to allocate TLS offload contexts from
firmware. Releasing items from the cache is a sleepable operation due
to the need to await a response from the firmware command freeing the
tag, so items cannot be reclaimed from the zone in non-sleepable
contexts. Since the cache size is limited by firmware limits, avoid
this by setting UMA_ZONE_UNMANAGED to avoid reclamation by uma_timeout()
and the low memory handler.
Reviewed by: hselasky, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34142
Allow a zone to opt out of cache size management. In particular,
uma_reclaim() and uma_reclaim_domain() will not reclaim any memory from
the zone, nor will uma_timeout() purge cached items if the zone is idle.
This effectively means that the zone consumer has control over when
items are reclaimed from the cache. In particular, uma_zone_reclaim()
will still reclaim cached items from an unmanaged zone.
Reviewed by: hselasky, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34142
mlx5en: Use a UMA cache zone for managing TLS send tags
Instead of allocating directly from a normal zone. This way
import and release are guaranteed to process all allocated and then
deallocated items. Also, the release occurs in a sleepable context when
caller of uma_zfree() or uma_zdestroy() can sleep itself.
Use the send tag refcounting mechanism to refcount the RX- and TX- TLS
send tags. Then it is no longer needed to wait for refcounts to reach
zero when destroying RX- and TX- TLS send tags as a result of pending
data or WQE commands.
This also ensures that when TX-TLS and rate limiting is used at the same
time, the underlying SQ is not prematurely destroyed.
Robert Wing [Mon, 20 Dec 2021 20:30:24 +0000 (11:30 -0900)]
tcp_twrespond: send signed segment when connection is TCP-MD5
When a connection is established to use TCP-MD5, tcp_twrespond() doesn't
respond with a signed segment. This results in the host performing the
active close to remain in a TIME_WAIT state and the other host in the
LAST_ACK state. Fix this by sending a signed segment when the connection
is established to use TCP-MD5.