Bjoern A. Zeeb [Wed, 25 Nov 2020 20:58:01 +0000 (20:58 +0000)]
IPv6: set ifdisabled in the kernel rather than in rc
Enable ND6_IFF_IFDISABLED when the interface is created in the
kernel before return to user space.
This avoids a race when an interface is create by a program which
also calls ifconfig IF inet6 -ifdisabled and races with the
devd -> /etc/pccard_ether -> .. netif start IF -> ifdisabled
calls (the devd/rc framework disabling IPv6 again after the program
had enabled it already).
In case the global net.inet6.ip6.accept_rtadv was turned on,
we also default to enabling IPv6 on the interfaces, rather than
disabling them.
Ian Lepore [Wed, 25 Nov 2020 20:05:05 +0000 (20:05 +0000)]
Extend the imx6 gpc->gic interrupt controller fixup of fdt data at runtime
to work with the pmu and tempmon nodes as well as the soc node. This allows
interrupts to work on the pmu and tempmon devices even though we don't have
a driver for the low-power gpc interrupt controller (which is not a problem
because we also don't have support for entering deep power-down modes where
it gets used).
Ian Lepore [Wed, 25 Nov 2020 19:08:22 +0000 (19:08 +0000)]
Add the standard extres pseudo devices to the IMX6 kernel config.
Some imx6 drivers are being converted to use features that weren't available
when they were first written (such as accessing shared device registers via
the syscon pseudo-device), so imx6 custom kernels that reference those
devices will now need this infrastructure in place.
Ian Lepore [Wed, 25 Nov 2020 18:09:01 +0000 (18:09 +0000)]
A couple small fixes for the imx6_sdma driver...
Attach after interrupt controllers, since the attach function tries to
set up an interrupt handler.
Check for the availability of the required firmware early in the attach
code (before allocating resources). If the firmware is not available, set
a static var to remember that, so that if the device is re-probed on later
passes it won't repeatedly try to attach and then complain again about
missing firmware.
Pawel Biernacki [Wed, 25 Nov 2020 16:30:57 +0000 (16:30 +0000)]
libsysdecode: correctly decode mmap flags
r352913 added decoding of mmap PROT_MAX()'d flags but didn’t account for the
case where different values were specified for PROT_MAX and regular flags.
Fix it.
Kristof Provost [Wed, 25 Nov 2020 15:07:22 +0000 (15:07 +0000)]
if: Protect V_ifnet in vnet_if_return()
When we terminate a vnet (i.e. jail) we move interfaces back to their home
vnet. We need to protect our access to the V_ifnet CK_LIST.
We could enter NET_EPOCH, but if_detach_internal() (called from if_vmove())
waits for net epoch callback completion. That's not possible from NET_EPOCH.
Instead, we take the IFNET_WLOCK, build a list of the interfaces that need to
move and, once we've released the lock, move them back to their home vnet.
We cannot hold the IFNET_WLOCK() during if_vmove(), because that results in a
LOR between ifnet_sx, in_multi_sx and iflib ctx lock.
Separate out moving the ifp into or out of V_ifnet, so we can hold the lock as
we do the list manipulation, but do not hold it as we if_vmove().
Kyle Evans [Wed, 25 Nov 2020 03:14:25 +0000 (03:14 +0000)]
kern: cpuset: properly rebase when attaching to a jail
The current logic is a fine choice for a system administrator modifying
process cpusets or a process creating a new cpuset(2), but not ideal for
processes attaching to a jail.
Currently, when a process attaches to a jail, it does exactly what any other
process does and loses any mask it might have applied in the process of
doing so because cpuset_setproc() is entirely based around the assumption
that non-anonymous cpusets in the process can be replaced with the new
parent set.
This approach slightly improves the jail attach integration by modifying
cpuset_setproc() callers to indicate if they should rebase their cpuset to
the indicated set or not (i.e. cpuset_setproc_update_set).
If we're rebasing and the process currently has a cpuset assigned that is
not the containing jail's root set, then we will now create a new base set
for it hanging off the jail's root with the existing mask applied instead of
using the jail's root set as the new base set.
Note that the common case will be that the process doesn't have a cpuset
within the jail root, but the system root can freely assign a cpuset from
a jail to a process outside of the jail with no restriction. We assume that
that may have happened or that it could happen due to a race when we drop
the proc lock, so we must recheck both within the loop to gather up
sufficient freed cpusets and after the loop.
To recap, here's how it worked before in all cases:
0 4 <-- jail 0 4 <-- jail / process
| |
1 -> 1
|
3 <-- process
0 4 <-- jail 0 4 <-- jail / process
| |
1 <-- process -> 1
More importantly, in both cases, the attaching process still retains the
mask it had prior to attaching or the attach fails with EDEADLK if it's
left with no CPUs to run on or the domain policy is incompatible. The
author of this patch considers this almost a security feature, because a MAC
policy could grant PRIV_JAIL_ATTACH to an unprivileged user that's
restricted to some subset of available CPUs the ability to attach to a jail,
which might lift the user's restrictions if they attach to a jail with a
wider mask.
In most cases, it's anticipated that admins will use this to be able to,
for example, `cpuset -c -l 1 jail -c path=/ command=/long/running/cmd`,
and avoid the need for contortions to spawn a command inside a jail with a
more limited cpuset than the jail.
Kyle Evans [Wed, 25 Nov 2020 02:12:24 +0000 (02:12 +0000)]
kern: cpuset: rename _cpuset_create() to cpuset_init()
cpuset_init() is better descriptor for what the function actually does. The
name was previously taken by a sysinit that setup cpuset_zero's mask
from all_cpus, it was removed in r331698 before stable/12 branched.
A comment referencing the removed sysinit has now also been removed, since
the setup previously done was moved into cpuset_thread0().
Kyle Evans [Wed, 25 Nov 2020 01:42:32 +0000 (01:42 +0000)]
kern: cpuset: allow cpuset_create() to take an allocated *setp
Currently, it must always allocate a new set to be used for passing to
_cpuset_create, but it doesn't have to. This is purely kern_cpuset.c
internal and it's sparsely used, so just change it to use *setp if it's
not-NULL and modify the two consumers to pass in the address of a NULL
cpuset.
This paves the way for consumers that want the unr allocation without the
possibility of sleeping as long as they've done their due diligence to
ensure that the mask will properly apply atop the supplied parent
(i.e. avoiding the free_unr() in the last failure path).
Kyle Evans [Wed, 25 Nov 2020 01:08:57 +0000 (01:08 +0000)]
kern: never restart syscalls calling closefp(), e.g. close(2)
All paths leading into closefp() will either replace or remove the fd from
the filedesc table, and closefp() will call fo_close methods that can and do
currently sleep without regard for the possibility of an ERESTART. This can
be dangerous in multithreaded applications as another thread could have
opened another file in its place that is subsequently operated on upon
restart.
The following are seemingly the only ones that will pass back ERESTART
in-tree:
- sockets (SO_LINGER)
- fusefs
- nfsclient
John Baldwin [Wed, 25 Nov 2020 00:10:54 +0000 (00:10 +0000)]
Remove the cloned file descriptors for /dev/crypto.
Crypto file descriptors were added in the original OCF import as a way
to provide per-open data (specifically the list of symmetric
sessions). However, this gives a bit of a confusing API where one has
to open /dev/crypto and then invoke an ioctl to obtain a second file
descriptor. This also does not match the API used with /dev/crypto on
other BSDs or with Linux's /dev/crypto driver.
Character devices have gained support for per-open data via cdevpriv
since OCF was imported, so use cdevpriv to simplify the userland API
by permitting ioctls directly on /dev/crypto descriptors.
To provide backwards compatibility, CRIOGET now opens another
/dev/crypto descriptor via kern_openat() rather than dup'ing the
existing file descriptor. This preserves prior semantics in case
CRIOGET is invoked multiple times on a single file descriptor.
John Baldwin [Tue, 24 Nov 2020 23:56:33 +0000 (23:56 +0000)]
Pull the check for VM ownership into ppt_find().
This reduces some code duplication. One behavior change is that
ppt_assign_device() will now only succeed if the device is unowned.
Previously, a device could be assigned to the same VM multiple times,
but each time it was assigned, the device's state was reset.
John Baldwin [Tue, 24 Nov 2020 23:18:52 +0000 (23:18 +0000)]
Honor the disabled setting for MSI-X interrupts for passthrough devices.
Add a new ioctl to disable all MSI-X interrupts for a PCI passthrough
device and invoke it if a write to the MSI-X capability registers
disables MSI-X. This avoids leaving MSI-X interrupts enabled on the
host if a guest device driver has disabled them (e.g. as part of
detaching a guest device driver).
This was found by Chelsio QA when testing that a Linux guest could
switch from MSI-X to MSI interrupts when using the cxgb4vf driver.
While here, explicitly fail requests to enable MSI on a passthrough
device if MSI-X is enabled and vice versa.
Jung-uk Kim [Tue, 24 Nov 2020 21:28:44 +0000 (21:28 +0000)]
Port rtsx(4) driver for Realtek SD card reader from OpenBSD.
This driver provides support for Realtek PCI SD card readers. It attaches
mmc(4) bus on card insertion and detaches it on card removal. It has been
tested with RTS5209, RTS5227, RTS5229, RTS522A, RTS525A and RTL8411B. It
should also work with RTS5249, RTL8402 and RTL8411.
PR: 204521
Submitted by: Henri Hennebert (hlh at restart dot be)
Reviewed by: imp, jkim
Differential Revision: https://reviews.freebsd.org/D26435
Emmanuel Vadot [Tue, 24 Nov 2020 17:53:13 +0000 (17:53 +0000)]
release: Merge the RPI2 and BEAGLEBONE image with the GENERICSD one
Both RPI2 and BEAGLEBONE are still popular and used arm boards.
Both u-boots can coexist as they are named differently and live in the
fat partition.
This leave us with only one image that can be used for both of those
boards and all the other ones supported by FreeBSD provided that you
install the correct u-boot on it.
Emmanuel Vadot [Tue, 24 Nov 2020 17:51:10 +0000 (17:51 +0000)]
arm: Remove old amlogic support
Remove the port for aml8726.
Kernel config was removed in r346096 and this port was never migrated
to GENERIC.
It is also impossible to obtain such hardware nowadays.
Emmanuel Vadot [Tue, 24 Nov 2020 17:50:22 +0000 (17:50 +0000)]
arm: Remove old rockchip support
Remove the port for rk30xx.
Kernel config was removed in r346096 and this port was never migrated
to GENERIC.
It is also impossible to obtain such hardware nowadays and this code
don't provide anything beside booting.
Mark Johnston [Tue, 24 Nov 2020 16:18:47 +0000 (16:18 +0000)]
pf: Make tag hashing more robust
tagname2tag() hashes the tag name before truncating it to 63 characters.
tag_unref() removes the tag from the name hash by computing the hash
over the truncated name. Ensure that both operations compute the same
hash for a given tag.
The larger issue is a lack of string validation in pf(4) ioctl handlers.
This is intended to be fixed with some future work, but an extra safety
belt in tagname2hashindex() is worthwhile regardless.
Reported by: syzbot+a0988828aafb00de7d68@syzkaller.appspotmail.com
Reviewed by: kp
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27346
Alexander Motin [Tue, 24 Nov 2020 15:32:25 +0000 (15:32 +0000)]
Remove concept of mbox_sleep_ok.
It was broken by design and unused for years due to conflicts between
different threads, fighting for the same set of mailbox registers, not
designed for multiple requests at a time. So either request has to be
synchronous and spin under the lock, or it should be sent asynchronously
through the queues as Mailbox Command IOCB or some other way.
This removes any OS specifics from the wait code, so it can be inlined.
Remove erradic assert after SVN r367149 in mlx5en(4).
The ratelimit tags may be shared, especially for unlimited TLS
traffic, and then the refcount is allowed to be greater than one
when freeing the send tag.
Alexander Motin [Tue, 24 Nov 2020 04:16:49 +0000 (04:16 +0000)]
Implement request queue overflow protection.
Before this change in case of request queue overflow driver just froze the
device queue for 100ms to retry after. It was pretty bad for performance.
This change introduces SIM queue freezing when free space on the request
queue drops below 255 entries (worst case of maximum I/O size S/G list),
checking for a chance to release it on I/O completion. If the queue still
get overflowed somehow, the old mechanism is still in place, just with
delay reduced to 10ms.
With the earlier queue length increase overflows should not happen often,
but it is still easily reachable on synthetic tests.
Provide ABI modules hooks for process exec/exit and thread exit.
Exec and exit are same as corresponding eventhandler hooks.
Thread exit hook is called somewhat earlier, while thread is still
owned by the process and enough context is available. Note that the
process lock is owned when the hook is called.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27309
Gleb Popov [Mon, 23 Nov 2020 17:00:06 +0000 (17:00 +0000)]
bin/setfacl: Little refactoring, no functional change.
The acl_from_stat function accepts a stat_t * argument, but only uses its
st_mode field. There is no reason to pass the whole struct, so make it accept
a mode_t and rename the function to acl_from_mode.
Linux has non-standard acl_from_mode function in its libacl, so naming the
function this way may help discovering it during porting efforts.
Eitan Adler [Mon, 23 Nov 2020 04:39:29 +0000 (04:39 +0000)]
arcconfig: add callsign again
Problem
When using git-svn or other non-pure-svn tooling the original subversion
URL is not present. This causes arcanist/phabricator to be unable to
determine which repository is being modified.
Solution
Restore callsign to .arcconfig to enable exact repository matching even
with git-svn.
Kyle Evans [Mon, 23 Nov 2020 02:49:53 +0000 (02:49 +0000)]
cpuset_setproc: use the appropriate parent for new anonymous sets
As far as I can tell, this has been the case since initially committed in
2008. cpuset_setproc is the executor of cpuset reassignment; note this
excerpt from the description:
* 1) Set is non-null. This reparents all anonymous sets to the provided
* set and replaces all non-anonymous td_cpusets with the provided set.
However, reviewing cpuset_setproc_setthread() for some jail related work
unearthed the error: if tdset was not anonymous, we were replacing it with
`set`. If it was anonymous, then we'd rebase it onto `set` (i.e. copy the
thread's mask over and AND it with `set`) but give the new anonymous set
the original tdset as the parent (i.e. the base of the set we're supposed to
be leaving behind).
The primary visible consequences were that:
1.) cpuset_getid() following such assignment returns the wrong result, the
setid that we left behind rather than the one we joined.
2.) When a process attached to the jail, the base set of any anonymous
threads was a set outside of the jail.
This was initially bundled in D27298, but it's a minor fix that's fairly
easy to verify the correctness of.
A test is included in D27307 ("badparent"), which demonstrates the issue
with, effectively:
osetid = cpuset_getid()
newsetid = cpuset()
cpuset_setaffinity(thread)
cpuset_setid(osetid)
cpuset_getid(thread) -> observe that it matches newsetid instead of osetid.
Kyle Evans [Mon, 23 Nov 2020 00:58:14 +0000 (00:58 +0000)]
freebsd32: take the _umtx_op struct definitions back
Providing these in freebsd32.h facilitates local testing/measuring of the
structs rather than forcing one to locally recreate them. Sanity checking
offsets/sizes remains in kern_umtx.c where these are typically used.
Kyle Evans [Mon, 23 Nov 2020 00:33:06 +0000 (00:33 +0000)]
kern: dup: do not assume oldfde is valid
oldfde may be invalidated if the table has grown due to the operation that
we're performing, either via fdalloc() or a direct fdgrowtable_exp().
This was technically OK before rS367927 because the old table remained valid
until the filedesc became unused, but now it may be freed immediately if
it's an unshared table in a single-threaded process, so it is no longer a
good assumption to make.
This fixes dup/dup2 invocations that grow the file table; in the initial
report, it manifested as a kernel panic in devel/gmake's configure script.
* Make rib_walk() order of arguments consistent with the rest of RIB api
* Add rib_walk_ext() allowing to exec callback before/after iteration.
* Rename rt_foreach_fib_walk_del -> rib_foreach_table_walk_del
* Rename rt_forach_fib_walk -> rib_foreach_table_walk
* Move rib_foreach_table_walk{_del} to route/route_helpers.c
* Slightly refactor rib_foreach_table_walk{_del} to make the implementation
consistent and prepare for upcoming iterator optimizations.
Emmanuel Vadot [Sun, 22 Nov 2020 20:16:46 +0000 (20:16 +0000)]
if_dwc: Correctly configure the DMA engine based on the fdt properties
Do not hardcode what we setup for the DMA engine configuration but
lookup the fdt properties and configuring accordingly.
Use a default value of 8 for the burst dma length for both TX and
RX, this is what we used for TX before.
Kyle Evans [Sun, 22 Nov 2020 05:47:45 +0000 (05:47 +0000)]
[2/2] _umtx_op: introduce 32-bit/i386 flags for operations
This patch takes advantage of the consolidation that happened to provide two
flags that can be used with the native _umtx_op(2): UMTX_OP___32BIT and
UMTX_OP__I386.
UMTX_OP__32BIT iindicates that we are being provided with 32-bit structures.
Note that this flag alone indicates a 64bit time_t, since this is the
majority case.
UMTX_OP__I386 has been provided so that we can emulate i386 as well,
regardless of whether the host is amd64 or not.
Both imply a different set of copyops in sysumtx_op. freebsd32__umtx_op
simply ignores the flags, since it's already doing a 32-bit operation and
it's unlikely we'll be running an emulator under compat32. Future work
could consider it, but the author sees little benefit.
This will be used by qemu-bsd-user to pass on all _umtx_op calls to the
native interface as long as the host/target endianness matches, effectively
eliminating most if not all of the remaining unresolved deadlocks for most.
This version changed a fair amount from what was under review, mostly in
response to refactoring of the prereq reorganization and battle-testing
it with qemu-bsd-user. The main changes are as follows:
1.) The i386 flag got renamed to omit '32BIT' since this is redundant.
2.) The flags are now properly handled on 32-bit platforms to emulate other
32-bit platforms.
3.) Robust list handling was fixed, and the 32-bit functionality that was
previously gated by COMPAT_FREEBSD32 is now unconditional.
4.) Robust list handling was also improved, including the error reported
when a process has already registered 32-bit ABI lists and also
detecting if native robust lists have already been registered. Both
scenarios now return EBUSY rather than EINVAL, because the input is
technically valid but we're too busy with another ABI's lists.
libsysdecode/kdump/truss support will go into review soon-ish, along with
the associated manpage update.
Robert Wing [Sun, 22 Nov 2020 05:00:28 +0000 (05:00 +0000)]
fd: free old file descriptor tables when not shared
During the life of a process, new file descriptor tables may be allocated. When
a new table is allocated, the old table is placed in a free list and held onto
until all processes referencing them exit.
When a new file descriptor table is allocated, the old file descriptor table
can be freed when the current process has a single-thread and the file
descriptor table is not being shared with any other processes.
Alexander Motin [Sun, 22 Nov 2020 04:29:55 +0000 (04:29 +0000)]
Make handlers and atpds overflows unlikely.
- Allocate 256 handlers more than payload commands for management purposes.
- Increase maximum number of handlers from 8K to 16K by tuning the format.
- Just to be safe limit the number of payload commands to 16K - 256.
- Limit number of target exchanges in mixed mode to the number of atpds.
- If we still somehow get out of atpds -- return BUSY, since we really are.
Stop using eventhandlers for itimers subsystem exec and exit hooks.
While there, do some minor cleanup for kclocks. They are only
registered from kern_time.c, make registration function static.
Remove event hooks, they are not used by both registered kclocks.
Add some consts.
Perhaps we can stop registering kclocks at all and statically
initialize them.
Reviewed by: mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27305
Navdeep Parhar [Sat, 21 Nov 2020 03:27:32 +0000 (03:27 +0000)]
cxgbe(4): Catch up with in-flight netmap rx before destroying queues.
The netmap application using the driver is responsible for replenishing
the receive freelists and they may be totally depleted when the
application exits. Packets in flight, if any, might block the pipeline
in case there aren't enough buffers left in the freelist. Avoid this by
filling up the freelists with a driver allocated buffer.
Rick Macklem [Fri, 20 Nov 2020 22:29:38 +0000 (22:29 +0000)]
Document the new "tls" NFS mount option.
Recent commits to head have added support for NFS over TLS
to the FreeBSD kernel.
To enable use of this for an NFS mount, the "tls" mount_nfs
option has been added.
Once the IETF has assigned an RFC number, I will replace "NNNN"
with the number.
Rick Macklem [Fri, 20 Nov 2020 22:14:51 +0000 (22:14 +0000)]
Update man page for new TLS export options.
NFS over TLS uses three new export options, added by r364979.
This patch updates the exports.5 man page for these new options.
Once assigned by IETF, "NNNN" will be replaced with the RFC number.