Before UMA CCBs, all CCBs were of the same size, and could
be trivially copied using bcopy(9). Now we have to preserve
alloc_flags, otherwise we might end up attempting to free
stack-allocated CCB to UMA; we also need to take CCB size
into account.
This fixes kernel panic which would occur when trying to access
a stopped (as in, SCSI START STOP, also "ctladm stop") SCSI device.
Reported By: Gary Jennejohn <gljennjohn@gmail.com>
Tested By: Gary Jennejohn <gljennjohn@gmail.com>
Reviewed By: imp
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D31054
Alexander Motin [Tue, 6 Jul 2021 02:19:48 +0000 (22:19 -0400)]
nvme(4): Report NPWA before NPWG as stripesize.
New Samsung 980 SSDs report Namespace Preferred Write Alignment of
8 (4KB) and Namespace Preferred Write Granularity of 32 (16KB).
My quick tests show that 16KB is a minimal sequential write size
when the SSD reaches peak IOPS, so writing much less is very slow.
But writing slightly less or slightly more does not change much,
so it seems not so much a size granularity as minimum I/O size.
Thinking about different stripesize consumers:
- Partition alignment should be based on NPWA by definition.
- ZFS ashift in part of forcing alignment of all I/Os should also
be based on NPWA. In part of forcing size granularity, if really
needed, it may be set to NPWG, but too big value can make ZFS too
space-inefficient, and the 16KB is actually the biggest supported
value there now.
- ZFS recordsize/volblocksize could potentially be tuned up toward
NPWG to work as I/O size granularity, but enabled compression makes
it too fuzzy. And those are normally user-configurable things.
- ZFS I/O aggregation code could definitely use Optimal Write Size
value and may be NPWG, but we don't have fields in GEOM now to report
the minimal and optimal I/O sizes, and even maximal is not reported
outside GEOM DISK to be used by ZFS.
Alan Cox [Sun, 4 Jul 2021 05:20:42 +0000 (00:20 -0500)]
On a failed fcmpset don't pointlessly repeat tests
In a few places, on a failed compare-and-set, both the amd64 pmap and
the arm64 pmap repeat tests on bits that won't change state while the
pmap is locked. Eliminate some of these unnecessary tests.
geom_label: Remove an old sysinstall(8) workaround
We removed sysinstall(8) back in 2011, so this workaround should be long
since unnecessary. This workaround can end up breaking cases that are
hit in the real world, such as dd'ing a small pre-built disk image to a
large partition that you intend to grow on first boot and uses a UFS
disk label for / in its /etc/fstab (as the only reliable thing a raw UFS
image can reference).
rman: Remove an outdated comment that no longer applies
Since commit 2dd1bdf1834c in 2016 the r_start and r_end fields have been
rman_res_t, which was briefly unsigned long, but commit da1b038af9f9
changed the typedef to be uintmax_t instead. C99 is also something we
assume these days.
When calling file_findfile with only a type it returns
the first file matching the type. But in fdt_apply_overlays we
then iterate on the next files and try loading them as dtb overlays.
Fix this by checking the type one more time.
Sponsored by: Diablotin Systems
Reported by: Mark Millard <marklmi@yahoo.com>
pf: allow table stats clearing and reading with ruleset rlock
Instead serialize against these operations with a dedicated lock.
Prior to the change, When pushing 17 mln pps of traffic, calling
DIOCRGETTSTATS in a loop would restrict throughput to about 7 mln. With
the change there is no slowdown.
Creating tables and zeroing their counters induces excessive IPIs (14
per table), which in turns kills single- and multi-threaded performance.
Work around the problem by extending per-CPU counters with a general
counter populated on "zeroing" requests -- it stores the currently found
sum. Then requests to report the current value are the sum of per-CPU
counters subtracted by the saved value.
Sample timings when loading a config with 100k tables on a 104-way box:
stock:
pfctl -f tables100000.conf 0.39s user 69.37s system 99% cpu 1:09.76 total
pfctl -f tables100000.conf 0.40s user 68.14s system 99% cpu 1:08.54 total
patched:
pfctl -f tables100000.conf 0.35s user 6.41s system 99% cpu 6.771 total
pfctl -f tables100000.conf 0.48s user 6.47s system 99% cpu 6.949 total
strscpy copies the src string, or as much of it as fits, into the dst
buffer. The dst buffer is always NUL terminated, unless it's zero-sized.
strscpy returns the number of characters copied (not including the
trailing NUL) or -E2BIG if len is 0 or src was truncated.
Currently drm-kmod replaces strscpy with strncpy that is not quite
correct as strncpy does not NUL-terminate truncated strings and returns
different values on exit.
LinuxKPI: Use macro for implementation of some dma_map_* functions
This allows to remove unimplemented attrs parameter which type differs
between Linux kernel versions and to compile both drm-kmod and ofed
callers unmodified.
Also convert it to 'unsigned long' type to match modern Linuxes.
LinuxKPI: Do not wait for a grace period in rcu_barrier()
Linux docs explicitly state that this is not required [1]:
"Important note: The rcu_barrier() function is not, repeat, not,
obligated to wait for a grace period. It is instead only required to
wait for RCU callbacks that have already been posted. Therefore, if
there are no RCU callbacks posted anywhere in the system, rcu_barrier()
is within its rights to return immediately. Even if there are
callbacks posted, rcu_barrier() does not necessarily need to wait for
a grace period."
LinuxKPI: Add compiler barriers to list_for_each_entry_lockless macro
so this list-traversal primitive may safely run concurrently with the
_rcu list-mutation primitives such as list_add_rcu() as long as the
traversal is guarded by rcu_read_lock().
Do it by reusing the "list_for_each_entry_rcu" macro which does the same.
On Linux it implements some additional lockdep stuff which we skip.
Also move the macro to linux/rculist.h where it resides on Linux.
LinuxKPI: Change flags parameter type of atomic_dec_and_lock_irqsave
On Linux atomic_dec_and_lock_irqsave is a wrapper macro which provides
a reference to third parameter rather than parameter value itself to
implementation routine called _atomic_dec_and_lock_irqsave [1].
LinuxKPI: Allow kmem_cache_free() to be called from critical sections
as it is required by i915kms driver from Linux kernel v 5.5.
This is done with asynchronous freeing of requested memory areas from
taskqueue thread. As memory to be freed is reused to store linked list
entry, backing UMA zone item size is rounded up to pointer size.
While here, make struct linux_kmem_cache private to LKPI to reduce amount
of BSD headers included by linux/slab.h and switch RCU code to usage of
LKPI's linux_irq_work_tq taskqueue to avoid injection of current into
system-wide taskqueue_fast thread context.
Submitted by: nc (initial version for drm-kmod)
Reviewed by: manu, nc
Differential revision: https://reviews.freebsd.org/D30760
The kernel part of ipfw(8) does initialize LibAlias uncondistionally
with an zeroized port range (allowed ports from 0 to 0). During
restucturing of libalias, port ranges are used everytime and are
therefor initialized with different values than zero. The secondary
initialization from ipfw (and probably others) overrides the new
default values and leave the instance in an unfunctional state. The
obvious solution is to detect such reinitializations and use the new
default value instead.
Pavel Balaev [Thu, 1 Jul 2021 16:27:25 +0000 (19:27 +0300)]
EFI RT: resurrect EFIIOC_GET_TABLE
Make it work, but change the interface to be safe for non-root users. In
particular, right now interface only works for the tables which can be
minimally parsed by kernel to determine the table size. Then, userspace can
query the table size, after that it provides a buffer of needed size
and kernel copies out just table to userspace.
Main advantage is that user no longer need to be able to read /dev/mem,
the disadvantage is the need to have minimal parsers aware of the table
types. Right now the parsers are implemented for ESRT and PROP tables.
Future extension of the present interface might be a return of only
the table physical address, in case kernel does not have suitable
parser yet. Then, a privileged user could read the table from /dev/mem.
This extension, which logically equivalent to the old (non-worked)
EFIIOC_GET_TABLE variant, is not implemented until needed.
K Staring [Sat, 3 Jul 2021 06:15:49 +0000 (00:15 -0600)]
hdaa: update pin patch configurations
A number of structural changes:
- Use decimal nid numbers instead of hex
- updated the branch to incoorporate the suggestions made in the
ALC280 pull request github thread
- Convert magic pin values into strings.
- Also update hdaa_patches to use clearer enums..
- made pin patch type enum clearer, add macro for 'string' type
patches
- Added pin_patch structures to separate data from logic.
- Integrated Realtek patches into new structure.
These incorporate fixes for ALC255, ALC256, ALC260, ALC262, ALC268,
ALC269, ALC280, ALC282, ALC283, ALC286, ALC290, ALC293, ALC296, ALC2880
And have definitions for a number of Dell and HP laptops.
Much of this data has been mined fromt he tables in the Linux driver.
imp squashed these into one commit because the changes from the github
pull requests no longer cleanly apply individually and made light style
changes after feedback from jhb.
Warner Losh [Fri, 2 Jul 2021 23:09:19 +0000 (17:09 -0600)]
hardclock.9: Refine some details
Refine mistakes from adaptaton of NetBSD's hardclock man page to
FreeBSD:
o clarify what usermode means
o clarify how often hardclock is called
o remove Xr callout(9) since that's done elsewhere
The expiration time of direct address mappings is explicitly
uninitialized. Expire times are always compared during housekeeping.
Despite the uninitialized value does not harm, it's simpler to just
set it to a reasonable default. This was detected during valgrinding
the test suite.
Revert libunwind change to fix backtrace segfault on aarch64
Revert commit 22b615a96593 from llvm git (by Daniel Kiss):
[libunwind] Support for leaf function unwinding.
Unwinding leaf function is useful in cases when the backtrace finds a
leaf function for example when it caused a signal.
This patch also add the support for the DW_CFA_undefined because it marks
the end of the frames.
Reland with limit the test to the x86_64-linux target.
Bisection has shown that this particular upstream commit causes programs
using backtrace(3) on aarch64 to segfault. This affects the lang/rust
port, for instance. Until we can upstream to fix this problem, revert
the commit for now.
Comparing elements in a tree requires transitiviy. If a < b and b < c
then a must be smaller than c. This way the tree elements are always
pairwise comparable.
Tristate comparsion functions returning values lower, equal, or
greater than zero, are usually implemented by a simple subtraction of
the operands. If the size of the operands are equal to the size of
the result, integer modular arithmetics kick in and violates the
transitivity.
Example:
Working on byte with 0, 120, and 240. Now computing the differences:
120 - 0 = 120
240 - 120 = 120
240 - 0 = -16
Warner Losh [Fri, 2 Jul 2021 22:00:42 +0000 (16:00 -0600)]
nvme: coherently read status of completion records
Coherently read the phase bit of the status completion record. We loop
over the completion record array, looking for all the transactions in
the same phase that have been completed. In doing that, we have to be
careful to read the status field first, and if it indicates a complete
record, we need to read and process that record. Otherwise, the host
might be overtaken by device when reading this completion record,
leading to a mistaken belief that the record is in phase. This leads to
the code using old values and looking at an already completed entry, which
has no current tracker.
To work around this problem, we read the status and make sure it is in
phase, we then re-read the entire completion record guaranteeing it's
complete, valid, and consistent . In addition we resync the dmatag to
reflect changes since the prior loop for the bouncing dma case.
Reviewed by: jrtc27@, chuck@
Found by: jrtc27 (this fix is based in part on her D30995 fix)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31002
Warner Losh [Fri, 2 Jul 2021 21:58:19 +0000 (15:58 -0600)]
nvme: Fix alignment on nvme structures
Remove __packed from nvme_command, nvme_completion and
nvme_dsm_trim. Add super-alignment to nvme_completion since it's always
at least that aligned in hardware (and in our existing uses of it
embedded in structures). It generates better code in
nvme_qpair_process_completions on riscv64 because otherwise the ABI
assumes a 4-byte alignment, and the same on all other platforms.
Kristof Provost [Tue, 29 Jun 2021 08:26:40 +0000 (10:26 +0200)]
pf: Reduce the data returned in DIOCGETSTATESNV
This call is particularly slow due to the large amount of data it
returns. Remove all fields pfctl does not use. There is no functional
impact to pfctl, but it somewhat speeds up the call.
It might affect other (i.e. non-FreeBSD) code that uses the new
interface, but this call is very new, so there's unlikely to be any. No
releases contained the previous version, so we choose to live with the
ABI modification.
Kristof Provost [Mon, 28 Jun 2021 10:48:20 +0000 (12:48 +0200)]
pf tests: Stress state retrieval
Create and retrieve 20.000 states. There have been issues with nvlists
causing very slow state retrieval. We don't impose a specific limit on
the time required to retrieve the states, but do log it. In excessive
cases the Kyua timeout will fail this test.
Mateusz Guzik [Wed, 30 Jun 2021 14:15:25 +0000 (16:15 +0200)]
mbuf: add m_free_raw to be used instead of directly calling uma_zfree
The intent is to remove all direct zone_mbuf consumers so that ctor/dtor
from that zone can be reimplemented as wrappers around uma, avoiding an
indirect function call.
Alex Richardson [Fri, 2 Jul 2021 08:21:04 +0000 (09:21 +0100)]
Simplify and speed up the kyua build
Instead of having multiple kyua libraries, just include the files as part
of usr.bin/kyua. Previously, we would build each kyua source up to four
times: once as a .o file and once as a .pieo. Additionally, the kyua
libraries might be built again for compat32. As all the kyua libraries
amount to 102 C++ sources the build time is significant (especially when
using an assertions enabled compiler). This change ensures that we build
306 fewer .cpp source files as part of buildworld.
As for example pfctl -ss keeps calling it, it saves a lot of overhead
from elided parsing of /etc/nsswitch.conf and /etc/protocols.
Sample result when running a pre-nvlist binary with nfs root and dumping
7 mln states:
before: 24.817u 62.993s 1:28.52 99.1%
after: 8.064u 1.117s 0:18.87 48.5%
Alexander Motin [Thu, 1 Jul 2021 19:28:55 +0000 (15:28 -0400)]
mrsas(4): Report more correct maximum I/O size.
Subtract one SGE for the case of misaligned address. Also take into
account maximum number of sectors reported by firmware, that gives
nicer 256KB limit instead of 276KB calculated from the SGE limit.
While there, remove number of I/O size checks, duplicating what is
already checked by CAM and busdma(9).
phandle_t is a uint32_t type, <= 0 comparison doesn't work with it as intended.
This caused the ofw_pci code to attach to PCI bus on ACPI based systems.
Since 3eae4e106ac7 ("Fix error value returned by ofw_bus_gen_get_node().")
ofw subsystem can only return -1 for invalid nodes. Use that.
Currently all PCIE windows point to bus address 0x0, which does not match
the values obtained from hardware during EA.
Replace those values with CPU addresses, since in reality we
have a 1:1 mapping between the two.
This patch is queued for Linux v5.14 in linux-next tree: 6bee93d93111 "arm64: dts: fsl-ls1028a: Correct ECAM PCIE window ranges"
The r intc interrupt controller seems to do a lot of things :
- It can handle the NMI interrupt
- It have local interrupts for some device that also can be muxed with GIC
- It can serve as an forwarder for the GIC
It's mostly used for deepsleep/wakeup if I understood correctly and we do not
support this on arm64.
For now just forward everything to the GIC so interrupts works again for device
which now have this interrupts controller set since dts v5.12
Warner Losh [Thu, 1 Jul 2021 15:32:40 +0000 (09:32 -0600)]
hz.9: update stathz for current usage
Update the stathz description to reflect reality. profhz is the only
thing we should deprecate. Add some implementation notes that describe
the optimizations made to date.
This introduces a new, per-process flag, "NO_NEW_PRIVS", which
is inherited, preserved on exec, and cannot be cleared. The flag,
when set, makes subsequent execs ignore any SUID and SGID bits,
instead executing those binaries as if they not set.
The main purpose of the flag is implementation of Linux
PROC_SET_NO_NEW_PRIVS prctl(2), and possibly also unpriviledged
chroot.
When the setting of kern.ipc.maxsockbuf is less than what is
desired for I/O based on vfs.maxbcachebuf and vfs.nfs.bufpackets,
a console message of "Consider increasing kern.ipc.maxsockbuf".
is printed.
This patch modifies the message to provide a suggested value
for kern.ipc.maxsockbuf.
Note that the setting is only needed when the NFS rsize/wsize
is set to vfs.maxbcachebuf.
While here, make nfs_bufpackets global, so that it can be used
by a future patch that adds a sysctl to set the NFS server's
maximum I/O size. Also, remove "sizeof(u_int32_t)" from the maximum
packet length, since NFS_MAXXDR is already an "overestimate"
of the actual length.
Olivier Houchard [Wed, 30 Jun 2021 20:56:50 +0000 (22:56 +0200)]
arm: Make sure we can handle a thumb entry point.
Similarly to what's been done on arm64 with commit 712c060c94fd447c91b0e6218c12a431206b487a, when executing a binary, if the
entry point is a thumb symbol, then make sure we set the PSL_T flag, otherwise
the CPU will interpret it in ARM mode, and that will likely leads to an
undefined instruction.
Mitchell Horne [Thu, 27 May 2021 20:02:04 +0000 (17:02 -0300)]
libpmc: enable pmu_utils on arm64
This allows supported libpmc to query/select from the pmu-events table,
which may have a more complete set of events than what we define
manually. A future update to these definitions should greatly improve
this support. The alias table is empty for now, until this future import
is complete.
Add the Foundation's copyright for recent work on this file.
Reviewed by: ray (slightly earlier version)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30603
Mitchell Horne [Wed, 19 May 2021 16:11:33 +0000 (13:11 -0300)]
hwpmc_arm64: accept raw event codes for PMC_OP_PMCALLOCATE
Make it possible to specify event codes without an offset of
PMC_EV_ARMV8_FIRST, by setting a machine-dependent flag. This is
required to make use of event definitions from pmu-events.
Reviewed by: ray (slightly earlier version)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30602