Warner Losh [Mon, 6 Dec 2021 17:23:14 +0000 (10:23 -0700)]
nvd: For AHCI attached devices, report ahci bridge
When an NVME device is attached via a AHCI controller, we have no access
to its config space. So instead of information about the nvme drive
itself, return info about the AHCI controller as the next best
thing. Since the Intel Hardware RAID support looks at these values, this
likely is best.
Gleb Smirnoff [Sun, 19 Dec 2021 01:19:26 +0000 (17:19 -0800)]
carp: fix send error demotion recovery
The problem is that carp(4) would clear the error counter on first
successful send, and stop counting successes after that. Fix this
logic and document it in human language.
This fixes a problem where ctld(8) would refuse to start on boot
with a specific IP address to listen on configured in ctl.conf(5).
It also fixes a problem where ctld(8) would fail to start with
some network interfaces which require a sysctl.conf(5) tweak
to configure them, eg to switch them from InfiniBand to IP mode.
Alexander Motin [Fri, 7 Jan 2022 17:54:49 +0000 (12:54 -0500)]
nvme: Do not rearm timeout for commands without one.
Admin queues almost always have several ASYNC_EVENT_REQUEST outstanding.
They have no timeouts, but their presence in qpair->outstanding_tr caused
useless timeout callout rearming twice a second.
While there, relax timeout callout period from 0.5s to 0.5-1s to improve
aggregation. Command timeouts are measured in seconds, so we don't need
to be precise here.
Warner Losh [Mon, 6 Dec 2021 17:23:23 +0000 (10:23 -0700)]
nvme_sim: Only report PCI related stats when we can
For AHCI attached devices, we report the location and identification
information of the AHCI controller that we're attached to. We also
don't reprot link speed in that case, since we can't get to the PCIe
config space registers to find that out.
Warner Losh [Mon, 6 Dec 2021 17:23:06 +0000 (10:23 -0700)]
nvme_ahci: Mark AHCI devices as such in the controller
Add a quirk to flag AHCI attachment to the controller. This is for any
of the strategies for attaching nvme devices as children of the AHCI
device for Intel's RAID devices. This also has a side effect of cleaning
up resource allocation from failed nvme_attach calls now.
Warner Losh [Mon, 6 Dec 2021 17:22:50 +0000 (10:22 -0700)]
nvme: Move to a quirk for the Intel alignment data
Prior to NVMe 1.3, Intel produced a series of drives that had
performance alignment data in the vendor specific space since no
standard had been defined. Move testing the versions to a quick so the
NVMe NS code doesn't know about PCI device info.
Warner Losh [Thu, 14 Oct 2021 14:44:37 +0000 (08:44 -0600)]
nvme: Reduce traffic to the doorbell register
Reduce traffic to doorbell register when processing multiple completion
events at once. Only write it at the end of the loop after we've
processed everything (assuming we found at least one completion,
even if that completion wasn't valid).
Warner Losh [Fri, 1 Oct 2021 17:32:48 +0000 (11:32 -0600)]
nvme: Use adaptive spinning when polling for completion or state change
We only use nvme_completion_poll in the initialization path. The
commands they queue and wait for finish quickly as they involve no I/O
to the drive's media. These command take about 20-200 microsecnds
each. Set the wait time to 1us and then increase it by 1.5 each
successive iteration (max 1ms). This reduces initialization time by
80ms in cpervica's tests.
Use this same technique waiting for RDY state transitions. This saves
another 20ms. In total we're down from ~330ms to ~2ms.
Warner Losh [Fri, 1 Oct 2021 17:09:34 +0000 (11:09 -0600)]
nvme: Only reset once on attach.
The FreeBSD nvme driver has reset the nvme controller twice on attach to
address a theoretical issue assuring the hardware is in a known
state. However, exierence has shown the second reset is unnecessary and
increases the time to boot. Eliminate the second reset. Should there be
a situation when you need a second reset (for buggy or at least somewhat
out of the mainstream hardware), the hardware option NVME_2X_RESET will
restore the old behavior. Document this in nvme(4).
If there's any trouble at all with this, I'll add a sysctl tunable to
control it.
Warner Losh [Fri, 1 Oct 2021 17:09:05 +0000 (11:09 -0600)]
nvme: Remove pause while resetting
After some study of the code and the standard, I think we can just drop
the pause(), unconditionally. If we're not initialized, then there's
nothing to wait for from a software perspective. If we are initialized,
then there might be outstanding I/O. If so, then the qpair 'recovery
state' will transition to WAITING in nvme_ctrlr_disable_qpairs, which
will ignore any interrupts for items that complete before we complete
the reset by setting cc.en=0.
If we go on to fail the controller, we'll cancel the outstanding I/O
transactions. If we reset the controller, the hardware throws away
pending transactions and we retry all the pending I/O transactions. Any
transactions that happend to complete before cc.en=0 will have the same
effect in the end (doing the same transaction twice is just inefficient,
it won't affect the state of the device any differently than having done
it once).
The standard imposes no wait times here, so it isn't needed from that
perspective.
Unanswered Question: Do we may need to disable interrupts while we
disable in legacy mode since those are level-sensitive.
Warner Losh [Fri, 1 Oct 2021 16:51:24 +0000 (10:51 -0600)]
nvme_ctrlr_enable: Small style nits
Rewrite the nested if's using the preferred FreeBSD style for branches
of ifs that return. NFC. Minor tweaks to the comments to better fit new
code layout.
Warner Losh [Fri, 1 Oct 2021 16:47:27 +0000 (10:47 -0600)]
nvme_ctrlr_enable: Remove unnecessary 5ms delays
Remove the 5ms delays after writing the administrative queue
registers. These delays are from the very earliest days of the driver
(they are in the first commit) and were most likely vestiges of the
Chatham NVMe prototype card that was used to create this driver. Many of
the workarounds necessary for it aren't necessary for standards
compliant cards. The original driver had other areas marked for Chatham,
but these were not. They are unneeded. There's three lines of supporting
evidence.
First, the NVMe standards make no mention of a delay time after these
registers are written. Second, the Linux driver doesn't have them, even
as an option. Third, all my nvme cards work w/o them.
To be safe, add a write barrier between setting up the admin queue and
enabling the controller.
Warner Losh [Wed, 29 Sep 2021 03:21:50 +0000 (21:21 -0600)]
nvme: Sanity check completion id
Make sure the completion ID is in the range of [0..num_trackers) since
the values past the end of the act_tr array are never going to be valid
trackers and will lead to pain and suffering if we try to dereference
them to get the tracker or to set the tracker back to NULL as we
complete the I/O.
Warner Losh [Wed, 29 Sep 2021 03:11:17 +0000 (21:11 -0600)]
nvme: Add sanity check for phase on startup.
The proper phase for the qpiar right after reset in the first interrupt
is 1. For it, make sure that we're not still in phase 0. This is an
illegal state to be processing interrupts and indicates that we've
failed to properly protect against a race between initializing our state
and processing interrupts. Modify stat resetting code so it resets the
number of interrpts to 1 instead of 0 so we don't trigger a false
positive panic.
Warner Losh [Wed, 29 Sep 2021 03:09:45 +0000 (21:09 -0600)]
nvme: start qpair in state RECOVERY_WAITING
An interrupt happens on the admin queue right away after the reset, so
as soon as we enable interrupts, we'll get a call to our interrupt
handler. It is safe to ignore this interrupt if we're not yet
initialized, or to process it if we are. If we are initialized, we'll
see there's no completion records and return. If we're not, we'll
process no completion records and return. Either way, nothing is
processed and nothing is lost.
Until we've completely setup the qpair, we need to avoid processing
completion records. Start the qpair in the waiting recovery state so we
return immediately when we try to process completions. The code already
sets it to 'NONE' when we're initialization is complete. It's safe to
defer completion processing here because we don't send any commands
before the initialization of the software state of the qpair is
complete. And even if we were to somehow send a command prior to that
completing, the completion record for that command would be processed
when we send commands to the admin qpair after we've setup the software
state. There's no good central point to add an assert for this last
condition.
This fixes an KASSERT "received completion for unknown cmd" panic on
boot.
Warner Losh [Thu, 23 Sep 2021 22:31:32 +0000 (16:31 -0600)]
nvme: Use shared timeout rather than timeout per transaction
Keep track of the approximate time commands are 'due' and the next
deadline for a command. twice a second, wake up to see if any commands
have entered timeout. If so, quiessce and then enter a recovery mode
half the timeout further in the future to allow the ISR to
complete. Once we exit recovery mode, we go back to operations as
normal.
Warner Losh [Fri, 17 Sep 2021 20:56:58 +0000 (14:56 -0600)]
nvme/nda: Fail all nvme I/Os after controller fails
Once the controller has failed, fail all I/O w/o sending it to the
device. The reset of the nvme driver won't schedule any I/O to the
failed device, and the controller is in an indeterminate state and can't
accept I/O. Fail both at the top end of the sim and the bottom
end. Don't bother queueing up the I/O for failure in a different task.
Alexander Motin [Fri, 7 Jan 2022 17:49:21 +0000 (12:49 -0500)]
cam: Relax callouts precisions.
On large systems even relatively rare callouts may fire many times
per second. This should allow them to aggregate better, since we do
not require any precision when polling for media change, etc.
Mark Johnston [Fri, 14 Jan 2022 20:00:01 +0000 (15:00 -0500)]
netbsd-tests: Fix the libc stat_socket test
The test tries to connect a socket to a closed port at 127.0.0.1. It
sets O_NONBLOCK on the socket first and expects to get EINPROGRESS from
connect(2), but this is not guaranteed, ECONNREFUSED is possible.
Handle both cases, and re-enable the test.
Kenneth D. Merry [Thu, 13 Jan 2022 15:50:40 +0000 (10:50 -0500)]
Free UMA zones when a pass(4) instance goes away.
If the UMA zones are not freed, we get warnings about re-using the
sysctl variables associated with the UMA zones, and we're leaking
the other memory associated with the zone structures. e.g.:
sysctl_warn_reuse: can't re-use a leaf (vm.uma.pass44.size)!
sysctl_warn_reuse: can't re-use a leaf (vm.uma.pass44.flags)!
sysctl_warn_reuse: can't re-use a leaf (vm.uma.pass44.bucket_size)!
sysctl_warn_reuse: can't re-use a leaf (vm.uma.pass44.bucket_size_max)!
sysctl_warn_reuse: can't re-use a leaf (vm.uma.pass44.keg.name)!
sysctl_warn_reuse: can't re-use a leaf (vm.uma.pass44.keg.rsize)!
sysctl_warn_reuse: can't re-use a leaf (vm.uma.pass44.keg.ppera)!
sysctl_warn_reuse: can't re-use a leaf (vm.uma.pass44.keg.ipers)!
Also, correctly clear the PASS_FLAG_ZONE_INPROG flag in
passcreatezone(). The way it was previously done, it would have
had set the flag and cleared all other flags that were set at
that point.
Bjoern A. Zeeb [Sun, 16 Jan 2022 23:52:39 +0000 (23:52 +0000)]
net80211: ieee80211_dump_node() cosmetics
Printing %p does not need the 0x prefix and while here mark the
ieee80211_node_table argument unused given we do not need it in the
current incarnation of the function.
enum ieee80211_channel_flags are used as bit fields and not as 1..n.
Correct the values using BIT(n).
This is also hoped to fix problems with 7260 cards which come up and
panic due to an empty channel list as all channels are set disabled [1][2].
It will hopefully also fix the one or other oddity.
Reported by: ambrisko, Mike Tancsa (mike sentex.net) [1]
Confirmed to fix by: ambrisko, Mike Tancsa (mike sentex.net) [2]
Sponsored by: The FreeBSD Foundation
Bjoern A. Zeeb [Sat, 15 Jan 2022 22:22:30 +0000 (22:22 +0000)]
LinuxKPI: 802.11 Refine/add DTIM/TSF handling
Correct data types related to delivery traffic indication map (DTIM)/
timing synchronization function (TSF) and implement/refine their
handling. This information is used/needed by iwlwifi to set a station
as associated. This will hopefully avoid more "no beacon heard"
time event failures.
The recording of the Linux specific sync_device_ts is done in the
receive path for now in case we do have the right information
available. I need to investigate as to how-much it may make sense
to also migrate it into net80211 in the future depending on the
usage in other drivers (or how we did handle this in the past in
natively ported versions, e.g. iwm).
Bjoern A. Zeeb [Sat, 15 Jan 2022 22:18:58 +0000 (22:18 +0000)]
LinuxKPI: 802.11 handle connection loss differently
Rather than just bouncing back to SCAN bounce to INIT on connection
loss. This is should be refined in the future as the comment already
indicates but we need to tie two different worlds together.
Michal Meloun [Wed, 27 Jan 2021 11:45:32 +0000 (12:45 +0100)]
pci_dw_mv: Don't enable unhandled interrupts.
Mainly link errors interrupts should only be activated on fully linked port,
otherwise noise on lanes can cause livelock. But we don't have error
counters yet, so leave these interrupts disabled.
Marcin Wojtas [Wed, 24 Feb 2021 17:02:40 +0000 (18:02 +0100)]
mvebu_gpio: fix interrupt cause register configuration
According to Armada 8k documentation, the interrupt cause register
(at offset 0x14) is RW0C. Update the configuration in attach and
the mvebu_gpio_isrc_eoi() to follow the description.
Michal Meloun [Sun, 17 Oct 2021 17:36:33 +0000 (19:36 +0200)]
arm: Fix handling of undefined instruction aborts in THUMB2 mode.
Correctly recognize NEON/SIMD and VFP instructions in THUMB2 mode and pass
these to the appropriate handler. Note that it is not necessary to filter
all undefined instruction variant or register combinations, this is a job
for given handler.
Michal Meloun [Thu, 7 Oct 2021 18:42:56 +0000 (20:42 +0200)]
dwmmc: Calculate the maximum transaction length correctly.
We should reserve two descriptors (not MMC_SECTORS) for potentially
unaligned (so bounced) buffer fragments, one for the starting fragment
and one for the ending fragment.
Michal Meloun [Fri, 2 Jul 2021 18:17:36 +0000 (20:17 +0200)]
intrng: Releasing interrupt source should clear interrupt table full state.
The first release of an interrupt in a situation where the interrupt table
is full should schedule a full table check the next time an interrupt is
allocated. A full check is necessary to ensure maximum separation between
the order of allocation and the order of release.
Joerg Wunsch [Mon, 20 Dec 2021 09:17:57 +0000 (10:17 +0100)]
usbconfig: documentation fixes, mainly for -i option
* in usage(), clearly mark -i interface as optional
* both, -u busnum and -a devaddr are optional as well
* various minor man page fixes
* clearly mark those two commands that actually use -i ifaceidx
* remove unused bitfield tag got_iface
* fix indentation level according to review comment
Joerg Wunsch [Sun, 19 Dec 2021 22:49:23 +0000 (23:49 +0100)]
usbconfig: use getopt(3) for option handling
This makes option handling consistent with other utilities as well as
Posix rules. By that, it's no longer important whether option name and
its argument are separated by a space or not, so -d5.3 works the same
as -d 5.3.
Also, recognize either /dev/ugen or ugen as prefix to the -d argument.
Note that this removes the undocumented feature that allowed to
specify multiple -d n.m options interleaved with commands referring to
that particular device in a single run.
Alan Somers [Thu, 2 Dec 2021 02:38:04 +0000 (19:38 -0700)]
fusefs: in the tests, always assume debug.try_reclaim_vnode is available
In an earlier version of the revision that created that sysctl (D20519)
the sysctl was gated by INVARIANTS, so the test had to check for it.
But in the committed version it is always available.
fusefs: fix .. lookups when the parent has been reclaimed.
By default, FUSE file systems are assumed not to support lookups for "."
and "..". They must opt-in to that. To cope with this limitation, the
fusefs kernel module caches every fuse vnode's parent's inode number,
and uses that during VOP_LOOKUP for "..". But if the parent's vnode has
been reclaimed that won't be possible. Previously we paniced in this
situation. Now, we'll return ESTALE instead. Or, if the file system
has opted into ".." lookups, we'll just do that instead.
This commit also fixes VOP_LOOKUP to respect the cache timeout for ".."
lookups, if the FUSE file system specified a finite timeout.
Alan Somers [Mon, 29 Nov 2021 02:17:34 +0000 (19:17 -0700)]
Fix a race in fusefs that can corrupt a file's size.
VOPs like VOP_SETATTR can change a file's size, with the vnode
exclusively locked. But VOPs like VOP_LOOKUP look up the file size from
the server without the vnode locked. So a race is possible. For
example:
1) One thread calls VOP_SETATTR to truncate a file. It locks the vnode
and sends FUSE_SETATTR to the server.
2) A second thread calls VOP_LOOKUP and fetches the file's attributes from
the server. Then it blocks trying to acquire the vnode lock.
3) FUSE_SETATTR returns and the first thread releases the vnode lock.
4) The second thread acquires the vnode lock and caches the file's
attributes, which are now out-of-date.
Fix this race by recording a timestamp in the vnode of the last time
that its filesize was modified. Check that timestamp during VOP_LOOKUP
and VFS_VGET. If it's newer than the time at which FUSE_LOOKUP was
issued to the server, ignore the attributes returned by FUSE_LOOKUP.
Corvin Köhne [Mon, 3 Jan 2022 13:19:39 +0000 (14:19 +0100)]
bhyve: add more slop to 64 bit BARs
Bhyve allocates small 64 bit BARs below 4 GB and generates ACPI tables
based on this allocation. If the guest decides to relocate those BARs
above 4 GB, it could lead to mismatching ACPI tables. Especially
when using OVMF with enabled bus enumeration it could cause
issues. OVMF relocates all 64 bit BARs above 4 GB. The guest OS
may be unable to recover from this situation and disables some PCI
devices because their BARs are located outside of the MMIO space
reported by ACPI. Avoid this situation by giving the guest more
space for relocating BARs.
Let's be paranoid. The available space for BARs below 4 GB is 512 MB
large. Use a slop of 512 MB. It'll allow the guest to relocate all
BARs below 4 GB to an address above 4 GB. We could run into issues
when we exceeding the memlimit above 4 GB. However, this space has
a size of 32 GB. Even when using many PCI device with large BARs
like framebuffer or when using multiple PCI busses, it's very
unlikely that we run out of space due to the large slop.
Additionally, this situation will occur on startup and not at runtime
which is much better.
Corvin Köhne [Mon, 3 Jan 2022 13:18:31 +0000 (14:18 +0100)]
bhyve: allow reading of fwctl signature multiple times
At the moment, you only have one single chance to read the fwctl
signature. At boot bhyve is in the state IDENT_WAIT. It's then
possible to switch to IDENT_SEND. After bhyve sends the signature,
it switches to REQ. From now on it's impossible to switch back to
IDENT_SEND to read the signature. For that reason, only a single
driver can read the signature. A guest can't use two drivers to
identify that fwctl is present. It gets even worse when using
OVMF. OVMF uses a library to access fwctl. Therefore, every single
OVMF driver would try to read the signature. Currently, only a
single OVMF driver accesses the fwctl. So, there's no issue with
it yet. However, no OS driver would have a chance to detect fwctl when
using OVMF because it's signature was already consumed by OVMF.
Corvin Köhne [Mon, 3 Jan 2022 13:16:59 +0000 (14:16 +0100)]
bhyve: enumerate BARs by size
E.g. Framebuffers can require large space and BARs need to be aligned
by their size. If BARs aren't allocated by size, it'll cause much
fragmentation of the MMIO space. Reduce fragmentation by ordering
the BAR allocation on their size to reduce the risk of
OUT_OF_MMIO_SPACE issues.
Bjoern A. Zeeb [Mon, 10 Jan 2022 22:12:53 +0000 (22:12 +0000)]
LinuxKPI: 802.11 fix locking in lkpi_stop_hw_scan()
In lkpi_stop_hw_scan() we have to unlock around cancelling the
hardware scan and an msleep to wait for the confirmation that the
scan ended. Otherwise we are sleeping with the non-sleepable
net80211 com lock held. At the same time we need to hold the lhw
lock for the msleep().
This lock change got lost in the refactoring of lkpi_iv_newstate().
Reported by: ambrisko, delphij
PR: 261075
Sponsored by: The FreeBSD Foundation
Bjoern A. Zeeb [Sun, 9 Jan 2022 02:21:05 +0000 (02:21 +0000)]
LinuxKPI / iwlwifi: fix spelling of constants
Fix the spelling of IEEE80211_HE_PHY_CAP9_NOMINAL_PKT_PADDING_*
(was "NOMIMAL"). The original version came from iwlwifi
in iwlwifi-next. Other drivers (from wireless-testing) already
use the correct spelling and need this change in LinuxKPI.
We never initialized hw->conf.flags for IEEE80211_CONF_IDLE but
on set_channel we would clear it and announce a change.
This lead to a problem that drivers may do some work every time
which was not needed and may lead to unexpected behaviour (for no
better driver code).
Properly initialize conf.flags with IEEE80211_CONF_IDLE.
Factor out the toggling into a function and clear IDLE while
sw scanning and when associated and set again when scan ends
or we are bouncing out of assoc.
Bjoern A. Zeeb [Sun, 9 Jan 2022 01:12:05 +0000 (01:12 +0000)]
LinuxKPI: bitfields add more *replace_bits()
Add or extend the already existing *_replace_bits() implementations
using macros as we do for the other parts in the file for
le<n>p_replace_bits(), u<n>p_replace_bits(), and _u<n>_replace_bits().