avg [Thu, 17 Oct 2019 06:32:34 +0000 (06:32 +0000)]
provide a way to assign taskqueue threads to a kernel process
This can be used to group all threads belonging to a single logical
entity under a common kernel process.
I am planning to use the new interface for ZFS threads.
avg [Thu, 17 Oct 2019 06:21:09 +0000 (06:21 +0000)]
wbwd: small clean-ups and improvements
This change applies some suggestions by delphij from D21979.
A write-only variable is removed.
There is a diagnostic message if the driver does not recognize the chip.
A chained if-statement is converted to a switch.
markj [Wed, 16 Oct 2019 22:12:34 +0000 (22:12 +0000)]
Introduce pmap_change_prot() for amd64.
This updates the protection attributes of subranges of the kernel map.
Unlike pmap_protect(), which is typically used for user mappings,
pmap_change_prot() does not perform lazy upgrades of protections.
pmap_change_prot() also updates the aliasing range of the direct map.
markj [Wed, 16 Oct 2019 22:03:27 +0000 (22:03 +0000)]
Remove page locking from pmap_mincore().
After r352110 the page lock no longer protects a page's identity, so
there is no purpose in locking the page in pmap_mincore(). Instead,
if vm.mincore_mapped is set to the non-default value of 0, re-lookup
the page after acquiring its object lock, which holds the page's
identity stable.
The change removes the last callers of vm_page_pa_tryrelock(), so
remove it.
chs [Wed, 16 Oct 2019 21:49:44 +0000 (21:49 +0000)]
Make all the gnop parameters optional in the request from userland,
filling in the same defaults that the current userland module uses.
This allows an old geom_nop.so userland module to work with a new kernel.
chs [Wed, 16 Oct 2019 21:49:39 +0000 (21:49 +0000)]
Add a new gctl_get_paraml_opt() interface to extract optional parameters from
the request. It is the same as gctl_get_paraml() except that the request
is not marked with an error if the parameter is not present.
kevans [Wed, 16 Oct 2019 18:33:31 +0000 (18:33 +0000)]
libbe(3): Fix destroy of imported BE w/ AUTOORIGIN
Imported BE, much like the activated BE, will not have an origin that we can
fetch/examine for destruction. be_destroy should not return BE_ERR_NOORIGIN
for failure to get the origin property for BE_DESTROY_AUTOORIGIN, because
we don't really know going into it that there's even an origin to be
destroyed.
BE_DESTROY_NEEDORIGIN has been renamed to BE_DESTROY_WANTORIGIN because only
a subset of it *needs* the origin, so 'need' is too strong of verbiage.
This was caught by jenkins and the bectl tests, but kevans failed to run the
bectl tests prior to commit.
erj [Wed, 16 Oct 2019 17:19:17 +0000 (17:19 +0000)]
ixl: report whether device received pause frames
From Jake:
When updating the device statistics, report whether or not we have
received any pause frames to the iflib stack. This allows the iflib
stack to avoid generating a Tx hang message while the device is paused.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: gallatin@
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21870
erj [Wed, 16 Oct 2019 17:16:32 +0000 (17:16 +0000)]
ix: report isc_pause_frames during stat update
From Jake:
Notify the iflib stack of whether we received any pause frames during
the timer window. This allows the stack to avoid reporting a Tx hang due
to the device being paused.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: gallatin@
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21869
erj [Wed, 16 Oct 2019 17:13:46 +0000 (17:13 +0000)]
e1000: correctly set isc_pause_frames only when XOFF increases
From Jake:
The e1000 driver sets the iflib shared context isc_pause_frames value to
the number of received xoff frames. This is done so that the iflib
watchdog timer won't trigger a Tx Hang due to pause frames.
Unfortunately, the function simply sets it to the value of the xoffrxc
counter. Once the device has received a single XOFF packet, the driver
always reports that we received pause frames. This will prevent the Tx
hang detection entirely from that point on.
Fix this by assigning isc_pause_frames to a non-zero value if we
received any XOFF packets in the last timer interval.
We could attempt to calculate the total number of received packets by
doing a subtraction, but the iflib stack only seems to check if
isc_pause_frames is non-zero.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: gallatin@
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21868
dim [Wed, 16 Oct 2019 17:11:18 +0000 (17:11 +0000)]
Ensure lld respects the WITH/WITHOUT_SHARED_TOOLCHAIN option
Traditionally, toolchain components such as cc, as, and ld have been
built as static executables. The WITH_SHARED_TOOLCHAIN option from
src.conf(5) is meant to link these as regular executables, e.g. using
shared libraries.
The build of ld.lld did not yet check this option. Fix the Makefile so
it will do so now.
Reported by: Mike Cui <cuicui@gmail.com>
PR: 241257
MFC after: 3 days
glebius [Wed, 16 Oct 2019 16:32:58 +0000 (16:32 +0000)]
do_link_state_change() is executed in taskqueue context and in
general is allowed to sleep. Don't enter the epoch for the
whole duration. If some event handlers need the epoch, they
should handle that theirselves.
ian [Wed, 16 Oct 2019 16:26:35 +0000 (16:26 +0000)]
Update some comments; no functional changes. Some historical old comments
in this driver indicate that the SD_CAPA register is write-once and after
being set one time the values in it cannot be changed. That turns out not
to be the case -- the values written to it survive a reset, but they can
be rewritten/changed at any time.
ian [Wed, 16 Oct 2019 16:19:21 +0000 (16:19 +0000)]
Revert r351218 (by manu). While the changes in r351218 appear to be (and
should be) correct, they lead to the eMMC on a Beaglebone failing to work
in some situations.
The TI sdhci hardware is kind of strange. The first device inherently
supports 1.8v and 3.3v and the abililty to switch between them, and the
other two devices must be set to 1.8v in the sdhci power control register to
operate correctly, but doing so actually makes them run at 3.3v (unless an
external level-shifter is present in the signal path). Even the 1.8v on the
first device may actually be 3.3v (or any other value), depending on what
voltage is fed to the VDDS1-VDDS7 power supply pins on the am335x chip.
Another strange quirk is that the convention for am335x sdhci drivers in
linux and uboot and the am335x boot ROM seems to be to set the voltage in
the sdhci capabilities register to 3.0v even though the actual voltage is
3.3v. Why this is done is a complete mystery to me, but it seems to be
required for correct operation.
If we had complete modern support for the am335x chip we could get the
actual voltages from the FDT data and the regulator framework. But our
am335x code currently doesn't have any regulator framework support.
Reverting to the prior code will get the popular Beaglebone boards working
again.
This is part of the fix for PR 241301, but also requires r353651 for a
complete fix.
ian [Wed, 16 Oct 2019 16:03:19 +0000 (16:03 +0000)]
Relax the sdhci(4) check that filters out the 1.8v voltage option unless
the slot is flagged as 'embedded'.
The features related to embedded and shared slots were added in v3.0 of
the sdhci spec. Hardware prior to v3 sometimes supported 1.8v on non-
removable devices in embedded systems, but had no way to indicate that
via the standard sdhci registers (instead they use out of band metadata
such as FDT data).
This change adds the controller specification version to the check for
whether to filter out the 1.8v selection. On older hardware, the 1.8v
option is allowed to remain. On 3.0 or later it still requires the
embedded-slot flag to remain.
This is part of the fix for PR 241301 (eMMC not detected on Beaglebone).
Changes to the sdhci_ti driver are also needed for a full fix.
markj [Wed, 16 Oct 2019 15:50:12 +0000 (15:50 +0000)]
Clear PGA_WRITEABLE in moea_pvo_remove().
moea_pvo_remove() might remove the last mapping of a page, in which case
it is clearly no longer writeable. This can happen via pmap_remove(),
or when a CoW fault removes the last mapping of the old page.
Reported and tested by: bdragon
Reviewed by: alc, bdragon, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22044
kevans [Wed, 16 Oct 2019 14:55:56 +0000 (14:55 +0000)]
bectl(8): destroy: use BE_DESTROY_AUTOORIGIN if -o is not specified
-o will force the origin to be destroyed unconditionally.
BE_DESTROY_AUTOORIGIN, on the other hand, will only destroy the origin if it
matches the format used by be_snapshot. This lets us clean up the snapshots
that are clearly not user-managed (because we're creating them) while
leaving user-created snapshots in place and warning that they're still
around when the BE created goes away.
avg [Wed, 16 Oct 2019 14:46:04 +0000 (14:46 +0000)]
wbwd: move to superio(4) bus
This allows to remove a bunch of low level code.
Also, superio(4) provides safer interaction with other drivers
that work with Super I/O configuration registers.
Tested only on PCengines APU2:
superio0: <Nuvoton NCT5104D/NCT6102D/NCT6106D (rev. B+)> at port 0x2e-0x2f on isa0
wbwd0: <Nuvoton NCT6102 (0xc4/0x53) Watchdog Timer> at WDT ldn 0x08 on superio0
The watchdog output is incorrectly wired on that system and the watchdog
does not really do it its job, but the pulse can be seen with a signal
analyzer.
kevans [Wed, 16 Oct 2019 14:43:05 +0000 (14:43 +0000)]
libbe(3): add needed bits for be_destroy to auto-destroy some origins
New BEs can be created from either an existing snapshot or an existing BE.
If an existing BE is chosen (either implicitly via 'bectl create' or
explicitly via 'bectl create -e foo bar', for instance), then bectl will
create a snapshot of the current BE or "foo" with be_snapshot, with a name
formatted like: strftime("%F-%T") and a serial added to it.
This commit adds the needed bits for libbe or consumers to determine if a
snapshot names matches one of these auto-created snapshots (with some light
validation of the date/time/serial), and also a be_destroy flag to specify
that the origin should be automatically destroyed if possible.
A future commit to bectl will specify BE_DESTROY_AUTOORIGIN by default so we
clean up the origin in the most common case, non-user-managed snapshots.
andrew [Wed, 16 Oct 2019 13:30:28 +0000 (13:30 +0000)]
Use tables to store the information to decode the arm64 ID registers.
Arm updates these with each new architecture revision. To help keep them
updated use a collection of tables to hold the needed information to
decode these registers.
andrew [Wed, 16 Oct 2019 13:21:01 +0000 (13:21 +0000)]
Stop leaking information from the kernel through timespec
The timespec struct holds a seconds value in a time_t and a nanoseconds
value in a long. On most architectures these are the same size, however
on 32-bit architectures other than i386 time_t is 8 bytes and long is
4 bytes.
Most ABIs will then pad a struct holding an 8 byte and 4 byte value to
16 bytes with 4 bytes of padding. When copying one of these structs the
compiler is free to copy the padding if it wishes.
In this case the padding may contain kernel data that is then leaked to
userspace. Fix this by copying the timespec elements rather than the
entire struct.
This doesn't affect Tier-1 architectures so no SA is expected.
imp [Wed, 16 Oct 2019 13:20:36 +0000 (13:20 +0000)]
bsd.compat.mk isn't setup to be included outside of Makefile.inc so comment it
out here until that's sorted out. Otherwise the build is broken. when
TARGET_ARCH isn't defined.
https://www.illumos.org/issues/10809
Port ZoL ee36c709c3d Performance optimization of AVL tree comparator functions
This is a followup to r337567 that imported the ZoL commit directly into
FreeBSD. It seems that at the time we did not have some of the earlier
changes, so some pieces of the ZoL change were not applicable. Also,
the illumos version got a few style cleanups. Some changes were missed
or incorrectly merged (e.g., vdev_cache_lastused_compare and
metaslab_rangesize_compare).
hselasky [Wed, 16 Oct 2019 09:11:49 +0000 (09:11 +0000)]
Fix panic in network stack due to use after free when receiving
partial fragmented packets before a network interface is detached.
When sending IPv4 or IPv6 fragmented packets and a fragment is lost
before the network device is freed, the mbuf making up the fragment
will remain in the temporary hashed fragment list and cause a panic
when it times out due to accessing a freed network interface
structure.
1) Make sure the m_pkthdr.rcvif always points to a valid network
interface. Else the rcvif field should be set to NULL.
2) Use the rcvif of the last received fragment as m_pkthdr.rcvif for
the fully defragged packet, instead of the first received fragment.
hselasky [Wed, 16 Oct 2019 08:55:29 +0000 (08:55 +0000)]
Replace rdma_is_upper_dev_rcu() with rdma_vlan_dev_real_dev() in ibcore.
This reduces the number of references to VLAN_TRUNKDEV() in ibcore.
Currently only VLAN is supported as a child interface in FreeBSD.
Remove superfluous RCU locking.
kib [Wed, 16 Oct 2019 07:09:15 +0000 (07:09 +0000)]
Fix assert in PowerPC pmaps after introduction of object busy.
The VM_PAGE_OBJECT_BUSY_ASSERT() in pmap_enter() implementation should
be only asserted when the code is executed as result of pmap_enter(),
not when the same code is entered from e.g. pmap_enter_quick(). This
is relevant for all PowerPC pmap variants, because mmu_*_enter() is
used as the backend, and assert is located there.
Add a PowerPC private pmap_enter() PMAP_ENTER_QUICK_LOCKED flag to
indicate that the call is not from pmap_enter(). For non-quick-locked
calls, assert that the object is locked.
Reported and tested by: bdragon
Reviewed by: alc, bdragon, markj
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D22041
https://www.illumos.org/issues/9691
When iterating over a ZAP object, we're almost always certain to
iterate over the entire object. If there are multiple leaf blocks, we
can realize a performance win by issuing reads for all the leaf blocks
in parallel when the iteration begins.
For example, if we have 10,000 snapshots, "zfs destroy -nv
pool/fs@1%9999" can take 30 minutes when the cache is cold. This
change provides a >3x performance improvement, by issuing the reads
for all ~64 blocks of each ZAP object in parallel.
https://www.illumos.org/issues/9425
Problem Statement
ZFS Channel program scripts currently require a timeout, so that hung
or long-running scripts return a timeout error instead of causing ZFS
to get wedged. This limit can currently be set up to 100 million Lua
instructions. Even with a limit in place, it would be desirable to
have a sys admin (support engineer) be able to cancel a script that is
taking a long time.
Proposed Solution
Make it possible to abort a channel program by sending an interrupt
signal.In the underlying txg_wait_sync function, switch the cv_wait to
a cv_wait_sig to catch the signal. Once a signal is encountered, the
dsl_sync_task function can install a Lua hook that will get called
before the Lua interpreter executes a new line of code. The
dsl_sync_task can resume with a standard txg_wait_sync call and wait
for the txg to complete. Meanwhile, the hook will abort the script and
indicate that the channel program was canceled. The kernel returns a
EINTR to indicate that the channel program run was canceled.
FreeBSD note: the return value of cv_wait_sig() has inverted meaning
between us and illumos.
https://www.illumos.org/issues/10330
3 recent ZoL changes in the vdev and metaslab code which we can pull over:
PR 8324 c853f382db 8324 Change target size of metaslabs from 256GB to 16GB
PR 8290 b194fab0fb 8290 Factor metaslab_load_wait() in metaslab_load()
PR 8286 419ba59145 8286 Update vdev_is_spacemap_addressable() for new spacemap
encoding
jhibbits [Wed, 16 Oct 2019 00:38:50 +0000 (00:38 +0000)]
powerpc: Add AmigaOne platform, a subclass of MPC85xx
Summary:
The AmigaOne platform, encompassing the X5000 and A1222 at this time, is
based on the mpc85xx platform, but includes some things not listed in
the device tree. Some custom devices, like CPLD, could be added to the
device tree with an overlay, or other means. However, some cannot
easily be done, such as the power button interrupt.
The directory will also become a location to add AmigaOne platform drivers,
such as the aforementioned CPLD, and its children.
erj [Tue, 15 Oct 2019 21:56:19 +0000 (21:56 +0000)]
ixgbe: Disable EEE for backplane X550EM_X
From Zach:
Intel documentation indicates that backplane X550EM_X KR devices do not
support Energy Efficient Ethernet. Prior to this patch, X552 devices
(device ID 0x15AB) will crash the system when transitioning EEE state
via sysctl.
brooks [Tue, 15 Oct 2019 21:27:06 +0000 (21:27 +0000)]
Add the ability to link programs against a compat ABI.
Linkage is controlled by two make knobs:
WANT_COMPAT - Prefer to link against the compat ABI.
NEED_COMPAT - Link against the compat ABI or fail to build.
Supported values are "32", "soft", and "any". The latter meaning pick
the first[0] supported compat ABI.
This can be used to provide test binaries for compat ABIs or to link
ABI-specific programs.
[0] We currently support only one compat ABI at a time, but this may
change in the future and some code in this commit is structured to ease
that change.
jhb [Tue, 15 Oct 2019 19:04:39 +0000 (19:04 +0000)]
Support hot insertion and removal of PCI devices on EC2.
Install ACPI notify handlers on PCI devices with an _EJ0 method. This
handler is invoked when devices are added or removed.
- When an ACPI_NOTIFY_DEVICE_CHECK event posts, rescan the parent bus
device. Note that strictly speaking we only need to rescan the
specified device, but BUS_RESCAN is what is available, so we rescan
the entire bus.
- When an ACPI_NOTIFY_EJECT_REQUEST event posts, detach the device
associated with the ACPI handle, invoke the _EJ0 method, and then
delete the device.
Eventually this might be changed to vector notify events to devd in
userspace where devctl can be used instead to permit more complex
actions such as graceful unmounting of filesystems.
jhb [Tue, 15 Oct 2019 18:16:10 +0000 (18:16 +0000)]
Use __FreeBSD_version to determine if gets() has been removed.
GCC compilers set __FreeBSD__ statically to a build-time determined
targeted version (which in ports always matches the build host's
version). This means that when building any version (12 or 13, etc.)
of riscv or some other architecture via GCC on a 12.x host,
__FreeBSD__ will always be set to 12. As a result, __FreeBSD__ cannot
be used to reliably detect the target FreeBSD version being built.
Instead, __FreeBSD_version from either <sys/param.h> (in the kernel)
or <osreldate.h> (in userland) should be used.
This changes the gets() test in libc++ to use __FreeBSD_version from
<osreldate.h>.
jhb [Tue, 15 Oct 2019 17:11:42 +0000 (17:11 +0000)]
Update MIPS kernel builds to work with mips-gcc.
- Use a default -march of mips64 on N64 and N32 kernels.
- Set the endianness (via MIPS_ENDIAN) and ABI (via MIPS_ABI) in
CFLAGS from MACHINE_ARCH. ARCH_FLAGS now only sets a different
-march value if needed.
- TRAMP_ARCH_FLAGS inherits MIPS_ENDIAN from MACHINE_ARCH but does
not set the ABI since XLPN32 needs an N64 ABI for the trampoline
loader. When TRAMP_ARCH_FLAGS is used it must set both -march
and -mabi.
https://www.illumos.org/issues/10343
On the openzfs feature/porting matrix, this is listed as:
prefix to refcount funcs/types
Having these changes will make it easier to share other work across the
different ZFS operating systems.
PR 7963 424fd7c3e Prefix all refcount functions with zfs_
PR 7885 & 7932 c13060e47 Linux 4.19-rc3+ compat: Remove refcount_t compat
PR 5823 & 5842 4859fe796 Linux 4.11 compat: avoid refcount_t name conflict
Author: Tim Schumacher <timschumi@gmx.de>
Obtained from: illumos, ZoL
MFC after: 3 weeks
10572 Fix race in dnode_check_slots_free()
https://www.illumos.org/issues/10572
The Fix from ZoL:
Currently, dnode_check_slots_free() works by checking dn->dn_type
in the dnode to determine if the dnode is reclaimable. However,
there is a small window of time between dnode_free_sync() in the
first call to dsl_dataset_sync() and when the useraccounting code
is run when the type is set DMU_OT_NONE, but the dnode is not yet
evictable, leading to crashes. This patch adds the ability for
dnodes to track which txg they were last dirtied in and adds a
check for this before performing the reclaim.
This patch also corrects several instances when dn_dirty_link was
treated as a list_node_t when it is technically a multilist_node_t.
10579 Don't allow dnode allocation if dn_holds != 0
https://www.illumos.org/issues/10579
The fix from ZoL:
This patch simply fixes a small bug where dnode_hold_impl() could
attempt to allocate a dnode that was in the process of being freed,
but which still had active references. This patch simply adds the
required check.
hselasky [Tue, 15 Oct 2019 12:08:09 +0000 (12:08 +0000)]
The two functions ifnet_byindex() and ifnet_byindex_locked() are exactly the
same after the network stack was epochified. Merge the two into one function
and cleanup all uses of ifnet_byindex_locked().
While at it:
- Add branch prediction macros.
- Make sure the ifnet pointer is only deferred once,
also when code optimisation is disabled.
hselasky [Tue, 15 Oct 2019 11:20:16 +0000 (11:20 +0000)]
Exclude the network link eventhandler from epochification after r353292.
This fixes the following assert when "options RATELIMIT" is used:
panic()
malloc()
sysctl_add_oid()
tcp_rl_ifnet_link()
do_link_state_change()
taskqueue_run_locked()
ae [Tue, 15 Oct 2019 09:50:02 +0000 (09:50 +0000)]
Explicitly initialize the memory buffer to store O_ICMP6TYPE opcode.
By default next_cmd() initializes only first u32 of opcode. O_ICMP6TYPE
opcode has array of bit masks to store corresponding ICMPv6 types.
An opcode that precedes O_ICMP6TYPE, e.g. O_IP6_DST, can have variable
length and during opcode filling it can modify memory that will be used
by O_ICMP6TYPE opcode. Without explicit initialization this leads to
creation of wrong opcode.
Reported by: Boris N. Lytochkin
Obtained from: Yandex LLC
MFC after: 3 days
jeff [Tue, 15 Oct 2019 03:45:41 +0000 (03:45 +0000)]
(4/6) Protect page valid with the busy lock.
Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.
jeff [Tue, 15 Oct 2019 03:41:36 +0000 (03:41 +0000)]
(3/6) Add a shared object busy synchronization mechanism that blocks new page
busy acquires while held.
This allows code that would need to acquire and release a very large number
of page busy locks to use the old mechanism where busy is only checked and
not held. This comes at the cost of false positives but never false
negatives which the single consumer, vm_fault_soft_fast(), handles.