Warner Losh [Thu, 23 Sep 2021 22:31:32 +0000 (16:31 -0600)]
nvme: Use shared timeout rather than timeout per transaction
Keep track of the approximate time commands are 'due' and the next
deadline for a command. twice a second, wake up to see if any commands
have entered timeout. If so, quiessce and then enter a recovery mode
half the timeout further in the future to allow the ISR to
complete. Once we exit recovery mode, we go back to operations as
normal.
We can't copyout() while holding a lock, in case it triggers a page
fault.
Release the lock before copyout, which is safe because we've already
copied all the data into the nvlist.
Wenzhuo Lu [Fri, 16 Oct 2015 02:51:09 +0000 (10:51 +0800)]
e1000: fix K1 configuration
This patch is for the following updates to the K1 configurations:
Tx idle period for entering K1 should be 128 ns.
Minimum Tx idle period in K1 should be 256 ns.
From jilles: POSIX requires that a script set `OPTIND=1` before using
different sets of parameters with `getopts`, or the results will be
unspecified.
The specific problem observed here is that we would execute `man -f` or
`man -k` without cleaning up state from man_parse_args()' `getopts`
loop. FreeBSD's /bin/sh seems to reset OPTIND to 1 after we hit the
second getopts loop, rendering the following shift harmless; other
/bin/sh implementations will leave it at what we came into the loop at
(e.g., bash as /bin/sh), shifting off any keywords that we had.
Alexander Motin [Thu, 23 Sep 2021 17:41:02 +0000 (13:41 -0400)]
x86: Add NUMA nodes into CPU topology.
Depending on hardware, NUMA nodes may match last level caches, or
they may be above them (AMD Zen 2/3) or below (Intel Xeon w/ SNC).
This information is provided by ACPI instead of CPUID, and it is
provided for each CPU individually instead of mask widths, but
this code should be able to properly handle all the above cases.
This change should immediately allow idle stealing in sched_ule(4)
to prefer load from NUMA-local CPUs to remote ones when the node
does not match LLC. Later we may think of how to better handle it
on sched_pickcpu() side.
Randall Stewart [Thu, 23 Sep 2021 15:43:29 +0000 (11:43 -0400)]
tcp: Rack compressed ack path updates the recv window too easily
The compressed ack path of rack is not following proper procedures in updating
the peers window. It should be checking the seq and ack values before updating and
instead it is blindly updating the values. This could in theory get the wrong window
in the connection for some length of time.
Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32082
Randall Stewart [Thu, 23 Sep 2021 14:54:23 +0000 (10:54 -0400)]
tcp: Two bugs in rack one of which can lead to a panic.
In extensive testing in NF we have found two issues inside
the rack stack.
1) An incorrect offset is being generated by the fast send path when a fast send is initiated on
the end of the socket buffer and before the fast send runs, the sb_compress macro adds data to the trailing socket.
This fools the fast send code into thinking the sb offset changed and it miscalculates a "updated offset".
It should only do that when the mbuf in question got smaller.. i.e. an ack was processed. This can lead to
a panic deref'ing a NULL mbuf if that packet is ever retransmitted. At the best case it leads to invalid data being
sent to the client which usually terminates the connection. The fix is to have the proper logic (that is in the rsm fast path)
to make sure we only update the offset when the mbuf shrinks.
2) The other issue is more bothersome. The timestamp check in rack needs to use the msec timestamp when
comparing the timestamp echo to now. It was using a microsecond timestamp which ends up giving error
prone results but causes only small harm in trying to identify which send to use in RTT calculations if its a retransmit.
Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32062
kern: random: collect ~16x less from fast-entropy sources
Previously, we were collecting at a base rate of:
64 bits x 32 pools x 10 Hz = 2.5 kB/s
This change drops it to closer to 64-ish bits per pool per second, to
work a little better with entropy providers in virtualized environments
without compromising the security goals of Fortuna.
kern: random: drop read_rate and associated functionality
Refer to discussion in PR 230808 for a less incomplete discussion, but
the gist of this change is that we currently collect orders of magnitude
more entropy than we need.
The excess comes from bytes being read out of /dev/*random. The default
rate at which we collect entropy without the read_rate increase is
already more than we need to recover from a compromise of an internal
state.
Avoid using atomics as it_wait is guarded by td_lock.
Report threshold calculation is done only if at least one PMC hook
is installed
Fixes:
* avoid unnecessary branching (if frame != null ...)
by having PMC_HOOK_INSTALLED_ANY
condition on the top of them, which should hint
the core not to execute speculatively anything
which us underneath;
* access intr_hwpmc_waiting_report_threshold cacheline
only if at least one hook is loaded;
truss: Decode correctly 64bits arguments on 32bits arm.
Mostly revert ebbc3140ca0d7eee154f7a67ccdae7d3d88d13fd.
We don't need to special-case anything for arm64, the check for the pointer
size is already done for us, just keep the bits about having arm and arm64
having to add padding for 32bits binaries.
Eliminate an unnecessary rerun request in fsck_ffs.
When fsck_ffs is running in preen mode and finds a zero-length directory,
it deletes that directory. In doing this operation, it unnecessary set
its internal flag saying that fsck_ffs needed to be rerun. This patch
deletes the rerun request for this case.
Reported by: Mark Johnson
PR: 246962
MFC after: 1 week
Sponsored by: Netflix
truss: Decode correctly 64bits arguments on 32bits arm.
When decoding 32bits arm syscall, make sure we account for the padding when
decoding 64bits args. Do it too when using a 64bits truss on a 32bits binary.
Add aarch64 to the list of architectures that can run 32bits FreeBSD binaries,
so that truss works correctly with an arm32 binary.
The same should probably be done with mips.
Revert "linux32: add a hack to avoid redefining the type of the savefpu tag"
This reverts commit 0f6829488ef32142b9ea1c0806fb5ecfe0872c02.
Also it changes the type of md_usr_fpu_save struct mdthread member
to void *, which is what uncovered this trouble. Now the save area
is untyped, but since it is hidden behind accessors, it is not too
significant. Since apparently there are consumers affected outside
the tree, this hack is better than one from the reverted revision.
Stefan Eßer [Wed, 22 Sep 2021 11:59:01 +0000 (13:59 +0200)]
ObsoleteFiles.inc: Add sponge(1) command and man-page
The sponge command has been imported on 2017-12-05 but the import has
been reverted the next day.
A script failed and I found that it was due to the left-over broken
sponge binary in base being prefered over the port version. To prevent
a known non-working binary to persist in /usr/bin, I'm adding sponge
to the obsolete files list even though it could only be installed on
a single day in 2017.
I do not plan to MFC this change since the issue will only exist on
systems installed from -CURRENT sources in 2017, and I do assume that
such systems are not running -STABLE today
Until this change, any bindings set in histedit() were lost on calls to
bindcmd().
Only bind -e and bind -v call libedit's keymacro_reset(). Currently you
cannot fool libedit/map.c:map_bind() by trying something like bind -le
as when p[0] == '-', it does a switch statement on p[1].
Alexander Motin [Tue, 21 Sep 2021 22:14:22 +0000 (18:14 -0400)]
sched_ule(4): Improve long-term load balancer.
Before this change long-term load balancer was unable to migrate
running threads, only ones waiting on run queues. But with growing
number of CPU cores it is quite typical now for system to not have
many waiting threads. But same time if due to some coincidence two
long-running CPU-bound threads ended up sharing same physical CPU
core, they could suffer from the SMT penalty indefinitely, and the
load balancer couldn't help.
Improve that by teaching the load balancer to hint running threads
to migrate by marking them with TDF_NEEDRESCHED and new TDF_PICKCPU
flag, making sched_pickcpu() to search for better CPU later, when
it is convenient.
Fix CPU search logic when balancing to limit round-robin migrations
in case of almost equal load to the group of physical cores. The
previous code bounced threads across all the system, that should be
pretty bad for caches and NUMA affinity, while additional fairness
was almost invisible, diminishing with number of cores in the group.
According to https://github.com/NuxiNL/cloudlibc:
CloudABI is no longer being maintained. It was an awesome experiment,
but it never got enough traction to be sustainable.
There is no reason to keep it in FreeBSD.
Approved by: ed (private mail)
Reviewed by: emaste
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D31923
After b4cb3fe0e39a, loader started crashing on PowerPC64, with a
Program Exception (700) error. The problem was that archsw was
used before being initialized, with the new mount feature. This
change fixes the issue by initializing archsw earlier, before
setting currdev, that triggers the mount.
Reviewed by: tsoome
MFC after: 1 month
X-MFC-With: b4cb3fe0e39a
Sponsored by: Instituto de Pesquisas Eldorado (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D32027
Alan Somers [Thu, 16 Sep 2021 19:19:21 +0000 (13:19 -0600)]
fusefs: don't panic if FUSE_GETATTR fails durint VOP_GETPAGES
During VOP_GETPAGES, fusefs needs to determine the file's length, which
could require a FUSE_GETATTR operation. If that fails, it's better to
SIGBUS than panic.
For signal send, copyout from the user FPU save area directly.
For sigreturn, we are in sleepable context and can do temporal
allocation of the transient save area. We cannot copying from userspace
directly to user save area because XSAVE state needs to be validated,
also partial copyins can corrupt it.
amd64: stop using top of the thread' kernel stack for FPU user save area
Instead do one more allocation at the thread creation time. This frees
a lot of space on the stack.
Also do not use alloca() for temporal storage in signal delivery sendsig()
function and signal return syscall sys_sigreturn(). This saves equal
amount of space, again by the cost of one more allocation at the thread
creation time.
A useful experiment now would be to reduce KSTACK_PAGES.
Reviewed by: jhb, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31954
amd64: move signal handling and register structures manipulations into exec_machdep.c
from machdep.c which is too large pile of unrelated things.
Some ptrace functions are moved from machdep.c to ptrace_machdep.c.
Now machdep.c contains code mostly related to the low level initialization
and regular low level operation of the architecture, while signal MD code
and registers handling is placed in exec_machdep.c.
Write to the PWREN register should be done in update_ios based
on the power_mode value in the ios struct.
Also none of the manual (RockChip and Altera) and Linux talks about
the needed for an inverted PWREN value so just remove this.
This fixes eMMC (and possibly SD) when u-boot didn't setup the controller.
Mark Johnston [Tue, 21 Sep 2021 15:32:23 +0000 (11:32 -0400)]
bitset(9): Introduce BIT_FOREACH_ISSET and BIT_FOREACH_ISCLR
These allow one to non-destructively iterate over the set or clear bits
in a bitset. The motivation is that we have several code fragments
which iterate over a CPU set like this:
This is slow since CPU_FFS begins the search at the beginning of the
bitset each time. On amd64 and arm64, CPU sets have size 256, so there
are four limbs in the bitset and we do a lot of unnecessary scanning.
A second problem is that this is destructive, so code which needs to
preserve the original set has to make a copy. In particular, we have
quite a few functions which take a cpuset_t parameter by value, meaning
that each call has to copy the 32 byte cpuset_t.
The new macros address both problems.
Reviewed by: cem, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32028
Michael Tuexen [Tue, 21 Sep 2021 15:13:57 +0000 (17:13 +0200)]
sctp: Simplify stream scheduler usage
Callers are getting the stcb send lock, so just KASSERT that.
No need to signal this when calling stream scheduler functions.
No functional change intended.
A different exception is raised when we hit a 32bits breakpoint, rather than
a 64bits one, so handle those as well when COMPAT_FREEBSD32 is defined.
This should fix SIGBUS at least when using breakpoints with thumb2 code.
When handling a data irq, the sdhci driver calls the
sdhci_platform_will_handle() method, to determine if it should allow the
platform driver to handle the transfer or fall back to programmed I/O.
While dumping, the data irq path may be invoked directly (not from an
interrupt context), which the bcm2835_sdhci DMA code is not prepared to
handle. Return early in this case, to force the fallback to PIO.
Otherwise, the KASSERT that follows will be triggered, and the dump will
fail. On non-INVARIANTS kernels, the system will hang, waiting for a DMA
interrupt that will never arrive.
Reviewed by: kevans
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31893
Warner Losh [Tue, 21 Sep 2021 04:02:35 +0000 (22:02 -0600)]
endian.h: Use the __bswap* versions
Make it possible to have all these macros work without bswap* being
defined. bswap* is part of the application namespace and applications
are free to redefine those functions.
Warner Losh [Fri, 17 Sep 2021 22:30:06 +0000 (16:30 -0600)]
camcontrol: depop command
Implement and document the new depop command. This command manages drive elements
for drives that support it. Storage elements are typically heads. Element status
can be discovered. Elements may be removed or restored. And the status of any
current depop operation can be assessed.
depop -d elm will remove element elm and truncate available capacity.
depop -l will list the current drive elements and their current status.
depop -r elm will try to restore all retired elements and rebuild capacity.
Changing storage elements may reinitialize the drive. This operation will lose
data and may take hours to complete. Use the drive provided timeout for
operations by default.
Warner Losh [Fri, 17 Sep 2021 22:29:22 +0000 (16:29 -0600)]
libcam: Define depop structures and introduce scsi_wrap
Define structures related to the depop set of commands (GET PHYSICAL ELEMENT
STATUS, REMOVE ELEMENT AND TRUNCATE, and RESTORE ELEMENT AND REBUILD) as
well as the CDB construction routines.
Also create scsi_wrap.c. This will have convenience routines that will do all
the elements of allocating the ccb, generating the CDB, sending the command
(looping as necessary for cases where data is returned, but it's size isn't
known up front), etc. As this functionality is fleshed out, calling many
camcontrol commands programatically gets much easier.
Greg V [Sat, 24 Apr 2021 11:53:34 +0000 (14:53 +0300)]
vt: call driver's postswitch when panicking on ttyv0
In vt_kms, the postswitch callback restores fbdev mode when
panicking or entering the debugger. This ensures that even when
a graphical applicatino was running on the first tty, simple framebuffer
mode would be restored and the panic would be visible instead
of the frozen GUI. But vt wouldn't call the postswitch callback
when we're already on the first tty, so running a GUI on it
would prevent you from reading any panics.
Add generic mmc_helper which uses newly introduced device_*_property
api. Thanks to this change the sd/mmc drivers will be capable
of parsing both DT and ACPI description.
Ensure backward compatibility for all mmc_fdt_helper users.
Andrew Turner [Mon, 20 Sep 2021 08:55:44 +0000 (08:55 +0000)]
Add ELF macros found in the aaelf64 spec
The arm64 aaelf64 spec [0] has DT_AARCH64_ that could be used with
dynamic linking. It also adds GNU_PROPERTY_AARCH64_FEATURE_1_AND used
to tell the kernel which CPU features the binary is compatible with,
but does not require to execute correctly.
Add these values so the kernel and elf tools can make use of them.
Xin LI [Mon, 20 Sep 2021 05:25:23 +0000 (22:25 -0700)]
The linux rc.d script mounts several filesystems related to Linux ABI
compatibility layer. When /compat is located on a ZFS other than /,
mount would fail because they were not mounted.
Solve this by moving `linux` to depend on `zfs` which mounts all ZFS
filesystems.
Mark Johnston [Sun, 19 Sep 2021 17:45:09 +0000 (13:45 -0400)]
freebsd32: Fix a double copyin in sendmsg() and recvmsg()
freebsd32_sendmsg() and freebsd32_recvmsg() both copyin the message
header twice, once directly and once in freebsd32_copyinmsghdr(). The
iovec length from the former is used when copying in msg_iov, but the
rest of the kernel uses the iovec length from the latter. When
kern_sendit() and kern_recvit() iterate over the iovec to compute the
residual for I/O, they can therefore end up walking past the end of the
copied in iovec, either resulting in a system call error, userspace
memory corruption from uiomove() with invalid iovecs, or a kernel page
fault if the copied-in iovec is followed by an unmapped KVA region.
Reported by: syzbot+7cc64cd0c49605acd421@syzkaller.appspotmail.com
Reviewed by: kib, emaste
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32010