Alexander Motin [Thu, 2 Dec 2021 23:01:02 +0000 (18:01 -0500)]
APEI: Improve multiple error sources handling.
Some AMD systems I have report 8 NMI and 3591 polled error sources.
Previous code could handle only one NMI source and used separate
callout for each polled source. New code can handle multiple NMIs
and groups polled sources by power of 2 of the polling period.
procstat_getfiles_sysctl: do not require non-null ki_fd
ki_fd is legitimately NULL when 32bit process requests process data
from 64bit host kernel. The field is not used by the code for sysctl
case; procstat_getfiles_kvm() checks ki_fd.
PR: 260174
Reported by: Damjan Jovanovic <damjan.jov@gmail.com>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Warner Losh [Thu, 2 Dec 2021 20:53:44 +0000 (13:53 -0700)]
mps(4): Fix unmatched devq release.
Port 9781c28c6d63 and a8837c77efd0 to the mps driver. Before this
change devq was frozen only if some command was sent to the target after
reset started, but release was called always. This change freezes the
devq immediately, leaving mprsas_action_scsiio() check only to cover
race condition due to different lock devq use.
This should also avoid unnecessary requeue of the commands, creating
additional log noise and confusing some broken apps. It also avoids a
'busy' requeue of I/Os failing when we're doing recovery that takes
longer than the normal busy timeout. These I/Os failing can lead to
filesystems being unmounted in the force unmount case for I/O errors.
Gleb Smirnoff [Thu, 2 Dec 2021 18:59:43 +0000 (10:59 -0800)]
epoch: with EPOCH_TRACE add epoch_where_report()
which will report where the epoch was entered and also
mark the tracker, so that exit will also be reported.
Helps to understand epoch entrance/exit scenarios in
complex cases, like network stack. As everything else
under EPOCH_TRACE it is a developer only tool.
Gleb Smirnoff [Thu, 2 Dec 2021 18:48:49 +0000 (10:48 -0800)]
tcp_hpts: rewrite inpcb synchronization
Just trust the pcb database, that if we did in_pcbref(), no way
an inpcb can go away. And if we never put a dropped inpcb on
our queue, and tcp_discardcb() always removes an inpcb to be
dropped from the queue, then any inpcb on the queue is valid.
Now, to solve LOR between inpcb lock and HPTS queue lock do the
following trick. When we are about to process a certain time
slot, take the full queue of the head list into on stack list,
drop the HPTS lock and work on our queue. This of course opens
a race when an inpcb is being removed from the on stack queue,
which was already mentioned in comments. To address this race
introduce generation count into queues. If we want to remove
an inpcb with generation count mismatch, we can't do that, we
can only mark it with desired new time slot or -1 for remove.
Gleb Smirnoff [Thu, 2 Dec 2021 18:48:48 +0000 (10:48 -0800)]
tcp_hpts: rename input queue to drop queue and trim dead code
The HPTS input queue is in reality used only for "delayed drops".
When a TCP stack decides to drop a connection on the output path
it can't do that due to locking protocol between main tcp_output()
and stacks. So, rack/bbr utilize HPTS to drop the connection in
a different context.
In the past the queue could also process input packets in context
of HPTS thread, but now no stack uses this, so remove this
functionality.
Gleb Smirnoff [Thu, 2 Dec 2021 18:48:48 +0000 (10:48 -0800)]
SMR protection for inpcbs
With introduction of epoch(9) synchronization to network stack the
inpcb database became protected by the network epoch together with
static network data (interfaces, addresses, etc). However, inpcb
aren't static in nature, they are created and destroyed all the
time, which creates some traffic on the epoch(9) garbage collector.
Fairly new feature of uma(9) - Safe Memory Reclamation allows to
safely free memory in page-sized batches, with virtually zero
overhead compared to uma_zfree(). However, unlike epoch(9), it
puts stricter requirement on the access to the protected memory,
needing the critical(9) section to access it. Details:
- The database is already build on CK lists, thanks to epoch(9).
- For write access nothing is changed.
- For a lookup in the database SMR section is now required.
Once the desired inpcb is found we need to transition from SMR
section to r/w lock on the inpcb itself, with a check that inpcb
isn't yet freed. This requires some compexity, since SMR section
itself is a critical(9) section. The complexity is hidden from
KPI users in inp_smr_lock().
- For a inpcb list traversal (a pcblist sysctl, or broadcast
notification) also a new KPI is provided, that hides internals of
the database - inp_next(struct inp_iterator *).
Gleb Smirnoff [Thu, 2 Dec 2021 18:48:48 +0000 (10:48 -0800)]
Remove "options PCBGROUP"
With upcoming changes to the inpcb synchronisation it is going to be
broken. Even its current status after the move of PCB synchronization
to the network epoch is very questionable.
This experimental feature was sponsored by Juniper but ended never to
be used in Juniper and doesn't exist in their source tree [sjg@, stevek@,
jtl@]. In the past (AFAIK, pre-epoch times) it was tried out at Netflix
[gallatin@, rrs@] with no positive result and at Yandex [ae@, melifaro@].
I'm up to resurrecting it back if there is any interest from anybody.
Randall Stewart [Thu, 2 Dec 2021 11:12:16 +0000 (06:12 -0500)]
tcp: unloading a module that is set to default should error.
I just discovered that the return of the EBUSY error was incorrectly
rigged so that you could unload a CC module that was set to default.
Its supposed to be an EBUSY error. Make it so.
Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D33229
Ed Maste [Wed, 1 Dec 2021 21:49:16 +0000 (16:49 -0500)]
OptionalObsoleteFiles.inc: remove MK_CXX rule for usr/bin/c++
In fact MK_CXX does not control whether /usr/bin/c++ is built -- it is
installed as a link to Clang (which is always a C/C++ compiler), and it
already exists in OptionalObsoleteFiles under MK_TOOLCHAIN.
Rick Macklem [Wed, 1 Dec 2021 21:55:17 +0000 (13:55 -0800)]
nfsd: Sanity check the ACL attribute
When an ACL is presented to the NFSv4 server in
Setattr or Verify, parsing of the ACL assumed a
sane acecnt and sane sizes for the "who" strings.
This patch adds sanity checks for these.
The patch also fixes handling of an error
return from nfsrv_dissectacl() for one broken
case.
Rick Macklem [Wed, 1 Dec 2021 21:46:41 +0000 (13:46 -0800)]
nfsd: Do not try to cache a reply for NFSERR_BADSLOT
When nfsrv_checksequence() replies NFSERR_BADSLOT,
the value of nd_slotid is not valid. As such, the
reply cannot be cached in the session.
Do not set ND_HASSEQUENCE for this case.
Ed Maste [Wed, 1 Dec 2021 21:38:10 +0000 (16:38 -0500)]
OptionalObsoleteFiles: move /usr/bin/CC to MK_TOOLCHAIN section
/usr/bin/CC is installed by usr.bin/clang/clang/Makefile, as with
/usr/bin/cc, /usr/bin/cpp, etc., and is not controlled by MK_CXX.
Move it to the same section as those tools.
(It may be that these should all be under
MK_TOOLCHAIN == no || MK_CLANG_IS_CC == no, but that seems like
unnecessary complexity.)
The ifp (struct ifnet) backpointer in the e1000 private ifnet
data is not used anymore since the iflib transition.
Remove it so that developers are not tempted to use it and
get a NULL pointer dereference.
Michael Tuexen [Wed, 1 Dec 2021 15:20:17 +0000 (16:20 +0100)]
libc sctp: fix sctp_getladdrs() when reporting no addresses
Section 9.5 of RFC 6458 (SCTP Socket API) requires that
sctp_getladdrs() returns 0 in case the socket is unbound. This
is the cause of reporting 0 addresses. So don't indicate an
error, just report this case as required.
Michael Tuexen [Wed, 1 Dec 2021 09:13:20 +0000 (10:13 +0100)]
libc sctp: fix sctp_getladdrs() for 64-bit BE platforms
When calling getsockopt() with SCTP_GET_LOCAL_ADDR_SIZE, use a
pointer to a 32-bit variable, since this is what the kernel
expects.
While there, do some cleanups.
Warner Losh [Tue, 30 Nov 2021 22:03:26 +0000 (15:03 -0700)]
Make device_busy/unbusy work w/o Giant held
The vast majority of the busy/unbusy users in the tree don't acquire
Giant before calling device_busy/unbusy. However, if multiple threads
are opening a file, say, that causes the device to busy/unbusy, then we
can race to the root marking things busy. Move to using a reference
count to keep track of how many times a device_t has been made busy. Use
that count to make the same decisions that we'd make with the old device
state.
Note: gpiopps.c uses D_TRACKCLOSE. Others do as well. However, there's a
known race with closes that will be corrected for all the drivers that
do this in a future commit.
Commit message was for a very old version of the patch. Will re-commit
with the right one since it's so bad. There's no locked versions of
it...that code was reworked to use refcnt APIs.
Warner Losh [Tue, 30 Nov 2021 22:03:26 +0000 (15:03 -0700)]
Make device_busy/unbusy work w/o Giant held
The vast majority of the busy/unbusy users in the tree don't acquire Giant
before calling device_busy/unbusy. However, if multiple threads are opening a
file, say, that causes the device to busy/unbusy, then we can race to the root
marking things busy. Create a new device_busy_locked and device_unbusy_locked
that are the current implemntations of device_busy and device_unbusy. Make
device_busy and unbusy acquire Giant before calling the _locked versrions. Since
we never sleep in the busy/unbusy path, Giant's single threaded semantics
suffice to keep this safe.
Chuck Tuffli [Wed, 1 Dec 2021 05:07:32 +0000 (21:07 -0800)]
bhyve blockif: fix blockif_candelete with Capsicum
NVMe conformance tests for the Format command failed if the
backing-storage for the bhyve device was a file instead of a Zvol. The
tests (and the specification) expect a Format to destroy all previously
written data. The bhyve NVMe emulation implements this by trimming /
deallocating all data from the backing-storage.
The blockif_candelete() function indicated the file did not support
deallocation (i.e. fpathconf(..., _PC_DEALLOC_PRESENT) returned FALSE)
even though the kernel supported file hole punching. This occurs on
builds with Capsicum enabled because blockif did not allow the
fpathconf(2) right.
Fix is to add CAP_FPATHCONF to the cap_rights_init(3) call.
This was found while looking for driver_filter_t functions which got the
trap frame from the argument. This particular instance it isn't even
used, so remove now lest someone else get to it first.
* It sets SO_SNDBUF to be as large as MAXLINE. But for unix domain
sockets, the send buffer is bypassed. Packets go directly to the
peer's receive buffer, so setting and querying SO_SNDBUF is
ineffective. To ensure that the socket can accept messages of a
certain size, it would be necessary to add a SO_PEERRCVBUF socket
option that could query the connected peer's receive buffer size.
* It sets MAXLINE to 8 kB, which is larger than the default sockbuf size
of 4 kB. That's ok for the builtin syslogd, which sets its recvbuf
to 80 kB, but not ok for alternative sysloggers, like rsyslogd, which
use the default size.
As a consequence, writing messages of more than 4 kB with syslog() as a
non-root user while running rsyslogd would cause the logging application
to spin indefinitely within syslog().
Stefan Eßer [Tue, 30 Nov 2021 17:33:40 +0000 (18:33 +0100)]
vendor/bc: import release 5.2.1
This release fixes two parse bugs when in POSIX standard mode. One of
these bugs was due to a quirk of the POSIX grammar, and the other was
because bc was too strict.
Kristof Provost [Tue, 30 Nov 2021 15:30:22 +0000 (16:30 +0100)]
if_stf: KASAN fix
In in_stf_input() we grabbed a pointer to the IPv4 header and later did
an m_pullup() before we look at the IPv6 header. However, m_pullup()
could rearrange the mbuf chain and potentially invalidate the pointer to
the IPv4 header.
Avoid this issue by copying the IP header rather than getting a pointer
to it.
Mitchell Horne [Thu, 25 Nov 2021 16:01:11 +0000 (12:01 -0400)]
Implement GET_STACK_USAGE on remaining archs
This definition enables callers to estimate remaining space on the
kstack, and take action on it. Notably, it enables optimizations in the
GEOM and netgraph subsystems to directly dispatch work items when there
is sufficient stack space, rather than queuing them for a worker thread.
Implement it for riscv, arm, and mips. Remove the #ifdefs, so it will
not go unimplemented elsewhere.
Mitchell Horne [Tue, 30 Nov 2021 15:15:44 +0000 (11:15 -0400)]
arm64, powerpc: fix calculation of 'used' in GET_STACK_USAGE
We do not consider the space reserved for the pcb to be part of the
total kstack size, so it should not be included in the calculation of
the used stack size.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Mitchell Horne [Thu, 25 Nov 2021 15:54:33 +0000 (11:54 -0400)]
i386: take pcb and fpu area into account in GET_STACK_USAGE
On this platform, the pcb and FPU save area are allocated from the top
of each kernel stack, so they should be excluded from the calculation of
the total and used stack sizes.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32581
Bjoern A. Zeeb [Tue, 30 Nov 2021 14:23:18 +0000 (14:23 +0000)]
fw_stub: fix -Wunused-but-set-variable for firmware files
In case we are only embedding a single firmware image the variable
"parent" gets set but never used. Add checks for the number of files
for it and only print it out if we are exceeding the single file count.
This fixes -Wunused-but-set-variable warnings for the majority of
firmware files in the tree.
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Andriy Gapon [Tue, 30 Nov 2021 13:23:23 +0000 (15:23 +0200)]
kern_tc: unify timecounter to bintime delta conversion
There are two places where we convert from a timecounter delta to
a bintime delta: tc_windup and bintime_off.
Both functions use the same calculations when the timecounter delta is
small. But for a large delta (greater than approximately an equivalent
of 1 second) the calculations were different. Both functions use
approximate calculations based on th_scale that avoid division. Both
produce values slightly greater than a true value, calculated with
division by tc_frequency, would be. tc_windup is slightly more
accurate, so its result is closer to the true value and, thus, smaller
than bintime_off result.
As a consequence there can be a jump back in time when time hands are
switched after a long period of time (a large delta). Just before the
switch the time would be calculated with a large delta from
th_offset_count in bintime_off. tc_windup does the switch using its own
calculations of a new th_offset using the large delta. As explained
earlier, the new th_offset may end up being less than the previously
produced binuptime. So, for a period of time new binuptime values may
be "back in time" comparing to values just before the switch.
Such a jump must never happen. All the code assumes that the uptime is
monotonically nondecreasing and some code works incorrectly when that
assumption is broken. For example, we have observed sleepq_timeout()
ignoring a timeout when the sbinuptime value obtained by the callout
code was greater than the expiration value, but the sbinuptime obtained
in sleepq_timeout() was less than it. In that case the target thread
would never get woken up.
The unified calculations should ensure the monotonic property of the
uptime.
The problem is quite rare as normally tc_windup should be called HZ
times per second (typically 1000 or 100). But it may happen in VMs on
very busy hypervisors where a VM's virtual CPU may not get an execution
time slot for a second or more.
Wojciech Macek [Thu, 25 Nov 2021 09:36:55 +0000 (10:36 +0100)]
flex_spi: Support for FlexSPI Flash controller.
NXP FlexSPI is a complex SPI controller which provides
full offload for accessing NOR Flash.
Create a Flash driver which attaches to existing FreeBSD
infrastructure and exports generic READ and WRITE disk commands.
The Flash has to be identified first to configure controller
internals. For now, only one NOR Flash chip is supported.
Future commits shall either increase number of known chips
or implement SFDP mechanism which can be used by other Flash
drivers.