amd64: stop using top of the thread' kernel stack for FPU user save area
Instead do one more allocation at the thread creation time. This frees
a lot of space on the stack.
Also do not use alloca() for temporal storage in signal delivery sendsig()
function and signal return syscall sys_sigreturn(). This saves equal
amount of space, again by the cost of one more allocation at the thread
creation time.
A useful experiment now would be to reduce KSTACK_PAGES.
Reviewed by: jhb, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31954
amd64: move signal handling and register structures manipulations into exec_machdep.c
from machdep.c which is too large pile of unrelated things.
Some ptrace functions are moved from machdep.c to ptrace_machdep.c.
Now machdep.c contains code mostly related to the low level initialization
and regular low level operation of the architecture, while signal MD code
and registers handling is placed in exec_machdep.c.
Write to the PWREN register should be done in update_ios based
on the power_mode value in the ios struct.
Also none of the manual (RockChip and Altera) and Linux talks about
the needed for an inverted PWREN value so just remove this.
This fixes eMMC (and possibly SD) when u-boot didn't setup the controller.
Mark Johnston [Tue, 21 Sep 2021 15:32:23 +0000 (11:32 -0400)]
bitset(9): Introduce BIT_FOREACH_ISSET and BIT_FOREACH_ISCLR
These allow one to non-destructively iterate over the set or clear bits
in a bitset. The motivation is that we have several code fragments
which iterate over a CPU set like this:
This is slow since CPU_FFS begins the search at the beginning of the
bitset each time. On amd64 and arm64, CPU sets have size 256, so there
are four limbs in the bitset and we do a lot of unnecessary scanning.
A second problem is that this is destructive, so code which needs to
preserve the original set has to make a copy. In particular, we have
quite a few functions which take a cpuset_t parameter by value, meaning
that each call has to copy the 32 byte cpuset_t.
The new macros address both problems.
Reviewed by: cem, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32028
Michael Tuexen [Tue, 21 Sep 2021 15:13:57 +0000 (17:13 +0200)]
sctp: Simplify stream scheduler usage
Callers are getting the stcb send lock, so just KASSERT that.
No need to signal this when calling stream scheduler functions.
No functional change intended.
A different exception is raised when we hit a 32bits breakpoint, rather than
a 64bits one, so handle those as well when COMPAT_FREEBSD32 is defined.
This should fix SIGBUS at least when using breakpoints with thumb2 code.
When handling a data irq, the sdhci driver calls the
sdhci_platform_will_handle() method, to determine if it should allow the
platform driver to handle the transfer or fall back to programmed I/O.
While dumping, the data irq path may be invoked directly (not from an
interrupt context), which the bcm2835_sdhci DMA code is not prepared to
handle. Return early in this case, to force the fallback to PIO.
Otherwise, the KASSERT that follows will be triggered, and the dump will
fail. On non-INVARIANTS kernels, the system will hang, waiting for a DMA
interrupt that will never arrive.
Reviewed by: kevans
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31893
Warner Losh [Tue, 21 Sep 2021 04:02:35 +0000 (22:02 -0600)]
endian.h: Use the __bswap* versions
Make it possible to have all these macros work without bswap* being
defined. bswap* is part of the application namespace and applications
are free to redefine those functions.
Warner Losh [Fri, 17 Sep 2021 22:30:06 +0000 (16:30 -0600)]
camcontrol: depop command
Implement and document the new depop command. This command manages drive elements
for drives that support it. Storage elements are typically heads. Element status
can be discovered. Elements may be removed or restored. And the status of any
current depop operation can be assessed.
depop -d elm will remove element elm and truncate available capacity.
depop -l will list the current drive elements and their current status.
depop -r elm will try to restore all retired elements and rebuild capacity.
Changing storage elements may reinitialize the drive. This operation will lose
data and may take hours to complete. Use the drive provided timeout for
operations by default.
Warner Losh [Fri, 17 Sep 2021 22:29:22 +0000 (16:29 -0600)]
libcam: Define depop structures and introduce scsi_wrap
Define structures related to the depop set of commands (GET PHYSICAL ELEMENT
STATUS, REMOVE ELEMENT AND TRUNCATE, and RESTORE ELEMENT AND REBUILD) as
well as the CDB construction routines.
Also create scsi_wrap.c. This will have convenience routines that will do all
the elements of allocating the ccb, generating the CDB, sending the command
(looping as necessary for cases where data is returned, but it's size isn't
known up front), etc. As this functionality is fleshed out, calling many
camcontrol commands programatically gets much easier.
Greg V [Sat, 24 Apr 2021 11:53:34 +0000 (14:53 +0300)]
vt: call driver's postswitch when panicking on ttyv0
In vt_kms, the postswitch callback restores fbdev mode when
panicking or entering the debugger. This ensures that even when
a graphical applicatino was running on the first tty, simple framebuffer
mode would be restored and the panic would be visible instead
of the frozen GUI. But vt wouldn't call the postswitch callback
when we're already on the first tty, so running a GUI on it
would prevent you from reading any panics.
Add generic mmc_helper which uses newly introduced device_*_property
api. Thanks to this change the sd/mmc drivers will be capable
of parsing both DT and ACPI description.
Ensure backward compatibility for all mmc_fdt_helper users.
Andrew Turner [Mon, 20 Sep 2021 08:55:44 +0000 (08:55 +0000)]
Add ELF macros found in the aaelf64 spec
The arm64 aaelf64 spec [0] has DT_AARCH64_ that could be used with
dynamic linking. It also adds GNU_PROPERTY_AARCH64_FEATURE_1_AND used
to tell the kernel which CPU features the binary is compatible with,
but does not require to execute correctly.
Add these values so the kernel and elf tools can make use of them.
Xin LI [Mon, 20 Sep 2021 05:25:23 +0000 (22:25 -0700)]
The linux rc.d script mounts several filesystems related to Linux ABI
compatibility layer. When /compat is located on a ZFS other than /,
mount would fail because they were not mounted.
Solve this by moving `linux` to depend on `zfs` which mounts all ZFS
filesystems.
Mark Johnston [Sun, 19 Sep 2021 17:45:09 +0000 (13:45 -0400)]
freebsd32: Fix a double copyin in sendmsg() and recvmsg()
freebsd32_sendmsg() and freebsd32_recvmsg() both copyin the message
header twice, once directly and once in freebsd32_copyinmsghdr(). The
iovec length from the former is used when copying in msg_iov, but the
rest of the kernel uses the iovec length from the latter. When
kern_sendit() and kern_recvit() iterate over the iovec to compute the
residual for I/O, they can therefore end up walking past the end of the
copied in iovec, either resulting in a system call error, userspace
memory corruption from uiomove() with invalid iovecs, or a kernel page
fault if the copied-in iovec is followed by an unmapped KVA region.
Reported by: syzbot+7cc64cd0c49605acd421@syzkaller.appspotmail.com
Reviewed by: kib, emaste
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32010
When multiple matches are found, we keep the provided string on the
input line and print unique matches as suggestions.
But the multiple matches might be the same command found in different
directories, so we should deduplicate the matches first and then decide
whether to autocomplete the command or not, based on the number of
unique matches.
Eliminate snaplk / bufwait LOR when creating UFS snapshots
Each vnode has an embedded lock that controls access to its contents.
However vnodes describing a UFS snapshot all share a single snapshot
lock to coordinate their access and update. As part of creating a
new UFS snapshot, it has to have its individual vnode lock replaced
with the filesystem's snapshot lock.
The lock order for regular vnodes with respect to buffer locks is that
they must first acquire the vnode lock, then a buffer lock. The order
for the snapshot lock is reversed: a buffer lock must be acquired before
the snapshot lock.
When creating a new snapshot, the snapshot file must retain its vnode
lock until it has allocated all the blocks that it needs before
switching to the snapshot lock. This update moves one final piece of
the initial snapshot block allocation so that it is done before the
newly created snapshot is switched to use the snapshot lock.
Rick Macklem [Sat, 18 Sep 2021 21:38:43 +0000 (14:38 -0700)]
nfscl: Use vfs.nfs.maxalloclen to limit Deallocate RPC RTT
Unlike Copy, the NFSv4.2 Allocate and Deallocate operations do not
allow a reply with partial completion. As such, the only way to
limit the time the operation takes to provide a reasonable RPC RTT
is to limit the size of the allocation/deallocation in the NFSv4.2
client.
This patch uses the sysctl vfs.nfs.maxalloclen to set
the limit on the size of the Deallocate operation.
There is no way to know how long a server will take to do an
deallocate operation, but 64Mbytes results in a reasonable
RPC RTT for the slow hardware I test on.
For an 8Gbyte deallocation, the elapsed time for doing it in 64Mbyte
chunks was the same (within margin of variability) as the
elapsed time taken for a single large deallocation
operation for a FreeBSD server with a UFS file system.
Mark Johnston [Sat, 18 Sep 2021 14:38:39 +0000 (10:38 -0400)]
unix: Fix a use-after-free in unp_drop()
We need to load the socket pointer after locking the PCB, otherwise
the socket may have been detached and freed by the time that unp_drop()
sets so_error.
This previously went unnoticed as the socket zone was _NOFREE.
Warner Losh [Fri, 17 Sep 2021 20:56:58 +0000 (14:56 -0600)]
nvme/nda: Fail all nvme I/Os after controller fails
Once the controller has failed, fail all I/O w/o sending it to the
device. The reset of the nvme driver won't schedule any I/O to the
failed device, and the controller is in an indeterminate state and can't
accept I/O. Fail both at the top end of the sim and the bottom
end. Don't bother queueing up the I/O for failure in a different task.
Wenzhuo Lu [Fri, 16 Oct 2015 02:51:03 +0000 (10:51 +0800)]
e1000: prevent ULP flow if cable connected
Enabling ulp on link down when cable is connect caused an infinite
loop of linkup/down indications in the NDIS driver.
After discussed, correct flow is to enable ULP only when cable is
disconnected.
During LTO build compiler reports some 'false positive' warnings about
variables being possibly used uninitialized. This patch silences these
warnings.
Exemplary compiler warning to suppress (with LTO enabled):
error: 'link' may be used uninitialized in this function
[-Werror=maybe-uninitialized]
if (link) {
Yong Wang [Tue, 21 Feb 2017 09:33:23 +0000 (04:33 -0500)]
e1000: fix multicast setting in VF
In function e1000_update_mc_addr_list_vf(), "msgbuf[0]" is used prior
to initialization at "msgbuf[0] |= E1000_VF_SET_MULTICAST_OVERFLOW".
And "msgbuf[0]" is overwritten at "msgbuf[0] = E1000_VF_SET_MULTICAST".
Fix it by moving the second line prior to the first one that mentioned
above.
Fixes: dffbaf7880a8 ("e1000: revert fix for multicast in VF") Cc: stable@dpdk.org Signed-off-by: Yong Wang <wang.yong19@zte.com.cn> Acked-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
Approved by: imp
Obtained from: DPDK (f58ca2f9ef6)
MFC after: 1 week
Guinan Sun [Mon, 6 Jul 2020 08:12:09 +0000 (08:12 +0000)]
e1000: increase timeout for ME ULP exit
Due timing issues in WHL and since recovery by host is
not always supported, increased timeout for Manageability Engine(ME)
to finish Ultra Low Power(ULP) exit flow for Nahum before timer expiration.
Guinan Sun [Mon, 6 Jul 2020 08:12:08 +0000 (08:12 +0000)]
e1000: add missing register defines
Added defines for the EEC, SHADOWINF and FLFWUPDATE registers needed for
the nvmupd_validate_offset function to correctly validate the NVM update
offset.
Guinan Sun [Mon, 6 Jul 2020 08:12:05 +0000 (08:12 +0000)]
e1000: remove duplicated phy codes
Add two files base.c and base.h to reduce the redundancy
in the silicon family code.
Remove the code duplication from e1000_82575 files.
Clean family specific functions from base.
Fix up a stray and duplicate function declaration.
Guinan Sun [Mon, 6 Jul 2020 08:12:03 +0000 (08:12 +0000)]
e1000: fix minor issues and improve code style
Fix typo in piece of code of NVM access for SPT.
And cleans up the remaining instances in the shared code
where it was not adhering to the Linux code standard.
Wrong description was found in the mentioned file, so fix them.
Remove shadowing variable declarations.
Relating to operands in bitwise operations having different sizes.
Unreachable code since *clock_in_i2c_* always return success.
Don't return unused s32 and don't check for constants.
Mark Johnston [Fri, 17 Sep 2021 16:34:21 +0000 (12:34 -0400)]
vfs: Permit unix sockets to be opened with O_PATH
As with FIFOs, a path descriptor for a unix socket cannot be used with
kevent().
In principle connectat(2) and bindat(2) could be modified to support an
AT_EMPTY_PATH-like mode which operates on the socket referenced by an
O_PATH fd referencing a unix socket. That would eliminate the path
length limit imposed by sockaddr_un.
Update O_PATH tests.
Reviewed by: kib
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31970
Mark Johnston [Fri, 17 Sep 2021 16:29:28 +0000 (12:29 -0400)]
socket: Synchronize soshutdown() with listen(2) and AIO
To handle shutdown(SHUT_RD) we flush the receive buffer of the socket.
This may involve searching for control messages of type SCM_RIGHTS,
since we need to close the file references. Closing arbitrary files
with socket buffer locks held is undesirable, mainly due to lock
ordering issues, so we instead make a copy of the socket buffer and
operate on that without any locks. Fields in the original buffer are
cleared.
This behaviour clobbered the AIO job queue associated with a receive
buffer. It could also cause us to leak a KTLS session reference.
Reorder socket buffer fields to address this.
An alternate solution would be to remove the hack in sorflush(), but
this is not quite feasible (yet). In particular, though sorflush()
flags the sockbuf with SBS_CANTRCVMORE, it is possible for more data to
be queued - the flag just prevents userspace from reading more data. I
suspect we should fix this; SBS_CANTRCVMORE represents a terminal state
and protocols can likely just drop any data destined for such a buffer.
Many of them already do, but in some cases the check is racy, and some
KPI churn will be needed to fix everything. This approach is more
straightforward for now.
Reported by: syzbot+104d8ee3430361cb2795@syzkaller.appspotmail.com
Reported by: syzbot+5bd2e7d05f84a59d0d1b@syzkaller.appspotmail.com
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31976
Mark Johnston [Fri, 17 Sep 2021 16:27:26 +0000 (12:27 -0400)]
socket: Remove NOFREE from the socket zone
This flag was added during the transition away from the legacy zone
allocator, commit c897b81311792ccf6a93feff2a405e2ae53f664e. The old
zone allocator effectively provided _NOFREE semantics, but it seems that
they are not required for sockets. In particular, we use reference
counting to keep sockets live.
One somewhat dangerous case is sonewconn(), which returns a pointer to a
socket with reference count 0. This socket is still effectively owned
by the listening socket. Protocols must therefore be careful to
synchronize sonewconn() calls with their pru_close implementations,
since for listening sockets soclose() will abort the child sockets. For
example, TCP holds the listening socket's PCB read locked across the
sonewconn() call, which blocks tcp_usr_close(), and sofree()
synchronizes with a concurrent soabort() of the nascent socket.
However, _NOFREE semantics are not required here.
Eliminating _NOFREE has several benefits: it enables use-after-free
detection (e.g., by KASAN) and lets the system reclaim memory from the
socket zone under memory pressure. No functional change intended.
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31975
Mark Johnston [Fri, 17 Sep 2021 16:26:56 +0000 (12:26 -0400)]
socket: Add assertions around naked refcount decrements
Sockets in a listen queue hold a reference to the parent listening
socket. Several code paths release this reference manually when moving
a child socket out of the queue.
Replace comments about the expected post-decrement refcount value with
assertions. Use refcount_load() instead of a plain load. No functional
change intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31974
Mark Johnston [Fri, 17 Sep 2021 16:26:06 +0000 (12:26 -0400)]
socket: Fix a use-after-free in soclose()
After releasing the fd reference to a socket "so", we should avoid
testing SOLISTENING(so) since the socket may have been freed. Instead,
directly test whether the list of unaccepted sockets is empty.
Fixes: f4bb1869ddd2 ("Consistently use the SOLISTENING() macro")
Pointy hat: markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31973
Mark Johnston [Fri, 17 Sep 2021 16:14:29 +0000 (12:14 -0400)]
ktls: Fix error/mode confusion in TCP_*TLS_MODE getsockopt handlers
ktls_get_(rx|tx)_mode() can return an errno value or a TLS mode, so
errors are effectively hidden. Fix this by using a separate output
parameter. Convert to the new socket buffer locking macros while here.
Note that the socket buffer lock is not needed to synchronize the
SOLISTENING check here, we can rely on the PCB lock.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31977
Mark Johnston [Fri, 17 Sep 2021 14:44:23 +0000 (10:44 -0400)]
libc/locale: Fix races between localeconv(3) and setlocale(3)
Each locale embeds a lazily initialized lconv which is populated by
localeconv(3) and localeconv_l(3). When setlocale(3) updates the global
locale, the lconv needs to be (lazily) reinitialized. To signal this,
we set flag variables in the locale structure. There are two problems:
- The flags are set before the locale is fully updated, so a concurrent
localeconv() call can observe partially initialized locale data.
- No barriers ensure that localeconv() observes a fully initialized
locale if a flag is set.
So, move the flag update appropriately, and use acq/rel barriers to
provide some synchronization. Note that this is inadequate in the face
of multiple concurrent calls to setlocale(3), but this is not expected
to work regardless.
Thanks to Henry Hu <henry.hu.sh@gmail.com> for providing a test case
demonstrating the race.
PR: 258360
MFC after: 3 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31899
Ed Maste [Fri, 17 Sep 2021 13:59:41 +0000 (09:59 -0400)]
readelf: document that -u / --unwind is not yet implemented
ELF tool chain readelf accepts -u / --unwind but just ignores the
option. This was previously undocumented, which could be confusing for
someone encountering `readelf -u` (in a script or GNU readelf example).
Reported by: markj (in D32003)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
It allows to override kern.elf{32,64}.allow_wx on per-process basis.
In particular, it makes it possible to run binaries without PT_GNU_STACK
and without elfctl note while allow_wx = 0.
Reviewed by: brooks, emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31779