Colin Percival [Mon, 10 Jan 2022 01:22:20 +0000 (17:22 -0800)]
x86: Speed up clock calibration
Prior to this commit, the TSC and local APIC frequencies were calibrated
at boot time by measuring the clocks before and after a one-second sleep.
This was simple and effective, but had the disadvantage of *requiring a
one-second sleep*.
Rather than making two clock measurements (before and after sleeping) we
now perform many measurements; and rather than simply subtracting the
starting count from the ending count, we calculate a best-fit regression
between the target clock and the reference clock (for which the current
best available timecounter is used). While we do this, we keep track
of an estimate of the uncertainty in the regression slope (aka. the ratio
of clock speeds), and stop measuring when we believe the uncertainty is
less than 1 PPM.
In order to avoid the risk of aliasing resulting from the data-gathering
loop synchronizing with (a multiple of) the frequency of the reference
clock, we add some additional spinning depending upon the iteration number.
For numerical stability and simplicity of implementation, we make use of
floating-point arithmetic for the statistical calculations.
On the author's Dell laptop, this reduces the time spent in calibration
from 2000 ms to 29 ms; on an EC2 c5.xlarge instance, it is reduced from
2000 ms to 2.5 ms.
When detaching, truss(1) sends SIGSTOP to the traced process to ensure
that it is detaching in the steady state. But it is possible, for
multithreaded process, that wait() call returns event other than our
SIGSTOP notification. As result, SIGSTOP might sit in some thread'
sigqueue, which makes SIGCONT a nop. Then, the process is stopped when
the queued SIGSTOP is acted upon.
To handle this, loop until we drain everything before SIGSTOP,
and see that the process is stopped.
Note that the earlier fix makes it safe to have some more debugging
events longering after SIGSTOP is acted upon. They will be ignored
after PT_DETACH.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33861
Doug Moore [Wed, 12 Jan 2022 17:03:53 +0000 (11:03 -0600)]
vm_reserv: use enhanced bitstring for popmaps
vm_reserv.c uses its own bitstring implemenation for popmaps. Using
the bitstring_t type from a standard header eliminates the code
duplication, allows some bit-at-a-time operations to be replaced with
more efficient bitstring range operations, and, in
vm_reserv_test_contig, allows bit_ffc_area_at to more efficiently
search for a big-enough set of consecutive zero-bits.
Make bitstring changes improve the vm_reserv code. Define a bit_ntest
method to test whether a range of bits is all set, or all clear.
Define bit_ff_at and bit_ff_area_at to implement the ffs and ffc
versions with a parameter to choose between set- and clear- bits.
Improve the area_at implementation. Modify the bit_nset and
bit_nclear implementations to allow code optimization in the cases
when start or end are multiples of _BITSTR_BITS.
Andrew Turner [Thu, 8 Jul 2021 13:15:55 +0000 (13:15 +0000)]
Add arm64 pointer authentication support
Pointer authentication allows userspace to add instructions to insert
a Pointer Authentication Code (PAC) into a register based on an address
and modifier and check if the PAC is correct. If the check fails it will
either return an invalid address or fault to the kernel.
As many of these instructions are a NOP when disabled and in earlier
revisions of the architecture this can be used, for example, to sign
the return address before pushing it to the stack making Return-oriented
programming (ROP) attack more difficult on hardware that supports them.
The kernel manages five 128 bit signing keys: 2 instruction keys, 2 data
keys, and a generic key. The instructions then use one of these when
signing the registers. Instructions that use the first four store the
PAC in the register being signed, however the instructions that use the
generic key store the PAC in a separate register.
Currently all userspace threads share all the keys within a process
with a new set of userspace keys being generated when executing a new
process. This means a forked child will share its keys with its parent
until it calls an appropriate exec system call.
In the kernel we allow the use of one of the instruction keys, the ia
key. This will be used to sign return addresses in function calls.
Unlike userspace each kernel thread has its own randomly generated.
Thread0 has a static key as does the early code on secondary CPUs.
This should be safe as there is minimal user interaction with these
threads, however we could generate random keys when the Armv8.5
Random number generation instructions are present.
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31261
Andrew Turner [Wed, 29 Dec 2021 12:10:49 +0000 (12:10 +0000)]
Fix undefined behaviour in the USB controllers
The USB controller drivers assume they can cast a NULL pointer to a
struct and find the address of a member. KUBSan complains about this so
replace with the __offsetof and __containerof macros that use either a
builtin function where available, or the same NULL pointer on older
compilers without the builtin.
Reviewers: hselasky
Subscribers: imp
Reviewed by: hselasky
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33865
Andriy Gapon [Wed, 12 Jan 2022 07:01:29 +0000 (09:01 +0200)]
mmc_da: implement d_dump method, sddadump
sddadump has been derived from sddastart.
mmc_sim interface has grown a new method, cam_poll, to support polled
operation.
mmc_sim code has been changed to provide a sim_poll hook only if the
controller implements the new method. The hooks is implemented in terms
of the new mmc_sim_cam_poll method.
Additionally, in-progress CCB-s now have CAM_REQ_INPROG status to
satisfy xpt_pollwait().
mmc_sim_cam_poll method has been implemented in dwmmc host controller.
unionfs: add stress2 scenarios for write references
Add some test cases, based on the existing nullfs10 scenario, to
ensure that unionfs write references are propagated between the
unionfs and underlying vnodes, including unionfs copy-on-write
operations
unionfs: allow vnode lock to be held shared during VOP_OPEN
do_execve() will hold the vnode lock shared when it calls VOP_OPEN(),
but unionfs_open() requires the lock to be held exclusive to
correctly synchronize node status updates. This requirement is
asserted in unionfs_get_node_status().
Change unionfs_open() to temporarily upgrade the lock as is already
done in unionfs_close(). Related to this, fix various cases throughout
unionfs in which vnodes are not checked for reclamation following lock
upgrades that may have temporarily dropped the lock. Also fix another
related issue in which unionfs_lock() can incorrectly add LK_NOWAIT
during a downgrade operation, which trips a lockmgr assertion.
John Baldwin [Tue, 11 Jan 2022 22:17:41 +0000 (14:17 -0800)]
crypto: Re-add encrypt/decrypt_multi hooks to enc_xform.
These callbacks allow multiple contiguous blocks to be manipulated in
a single call. Note that any trailing partial block for a stream
cipher must still be passed to encrypt/decrypt_last.
While here, document the setkey and reinit hooks and reorder the hooks
in 'struct enc_xform' to better reflect the life cycle.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33529
John Baldwin [Tue, 11 Jan 2022 22:16:05 +0000 (14:16 -0800)]
crypto: Add support for the XChaCha20-Poly1305 AEAD cipher.
This cipher is a wrapper around the ChaCha20-Poly1305 AEAD cipher
which accepts a larger nonce. Part of the nonce is used along with
the key as an input to HChaCha20 to generate a derived key used for
ChaCha20-Poly1305.
This cipher is used by WireGuard.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33523
John Baldwin [Tue, 11 Jan 2022 19:38:11 +0000 (11:38 -0800)]
Add list-old-{dirs,files,libs} targets.
These targets generate a raw list of the candidate old files roughly
corresponding to the values of OLD_DIRS, OLD_FILES, and OLD_LIBS.
Currently list-old-files also includes uncompressed manpages in
addition to compressed manpages.
Use these targets in the implementation of check-old-* and
delete-old-* to replace duplicated logic.
Reviewed by: imp, emaste
Sponsored by: The University of Cambridge, Google Inc.
Differential Revision: https://reviews.freebsd.org/D33327
Rick Macklem [Tue, 11 Jan 2022 17:40:07 +0000 (09:40 -0800)]
nfsd: Do not accept audit/alarm ACEs for the NFSv4 server
The UFS and ZFS file systems only support Allow/Deny ACEs
in the NFSv4 ACLs. This patch does not allow the server
to parse Audit/Alarm ACEs. The NFSv4 client is still
allowed to pase Audit/Alarm ACEs, since non-FreeBSD NFSv4
servers may use them.
This patch should not have a significant effect, since the
UFS and ZFS file systems will not handle these ACEs anyhow.
It simply serves as an additional "safety belt" for the
NFSv4 server.
Andrew Turner [Tue, 11 Jan 2022 11:55:32 +0000 (11:55 +0000)]
Use ${MACHINE} for the kernel modeule ldscript
For consistancy with the kernel linker script also use ${MACHINE} for
finding the kernel module linker script. As we currently only use this
for amd64 and i386 this is a no-op, but I'm planning on using this with
arm64 where ${MACHINE} != ${MACHINE_ARCH}.
Reviewed by: markj, kib, imp
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33841
I now see that the implementation of the "dacl" operation
requires that the NFSv4 server to "automatic inheritance"
and I do not plan on doing this. As such, this patch is
harmless, but unneeded.
Andriy Gapon [Tue, 11 Jan 2022 13:56:07 +0000 (15:56 +0200)]
dwwdt: make it actually useful
Flip dwwdt_prevent_restart to false. What's the use of a watchdog if it
does not restart a hung system?
Add a knob for panic-ing on the first timeout, resetting on the second
one. This can be useful if interrupts can still work, otherwise a reset
recovers a system without any aid for debugging the hang.
The change also doubles the timeout that's programmed into the hardware.
The previous version of the code always had the interrupt on the first
timeout enabled, but it took no action on it. Only the second timeout
could be configured to reset the system. So, the hardware timeout was
set to a half of the user requested timeout. But now,we can take a
corrective action on the first timeout, so we use the user requested
timeout.
While here, define boolean sysctl-s as such.
Reviewed by: manu
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D33534
Andriy Gapon [Tue, 11 Jan 2022 13:44:46 +0000 (15:44 +0200)]
dtrace: add a knob to control maximum size of principal buffers
We had a hardcoded limit of 1/128-th of physical memory that was further
subdivided between all CPUs as principal buffers are allocated on the
per-CPU basis. Actually, the buffers could use up 1/64-th of the
memmory because with the default switch policy there are two buffers per
CPU.
This commit allows to change that limit.
Note that the discussed limit is per dtrace command invocation.
The idea is to limit the size of a single malloc(9) call, not the total
memory size used by DTrace buffers.
Wojciech Macek [Tue, 11 Jan 2022 10:08:35 +0000 (11:08 +0100)]
ip_mroute: do not call epoch_waitwhen lock is taken
mrouter_done is called with RAW IP lock taken. Some annoying
printfs are visible on the console if INVARIANTS option is enabled.
Provide atomic-based mechanism which counts enters and exits from/to
critical section in ip_input and ip_output.
Before de-initialization of function pointers ensure (with busy-wait)
that mrouter de-initialization is visible to all readers and that we don't
remove pointers (like ip_mforward etc.) in the middle of packet processing.
Kristof Provost [Mon, 10 Jan 2022 17:43:35 +0000 (18:43 +0100)]
pf: postpone clearing of struct pf_pdesc
Postpone zeroing out pd until after the PFI_IFLAG_SKIP/M_SKIP_FIREWALL
checks. We don't need it until then, and it saves us a few CPU cycles in
some cases.
This isn't expected to make a measurable performance change though.
Bjoern A. Zeeb [Mon, 10 Jan 2022 22:12:53 +0000 (22:12 +0000)]
LinuxKPI: 802.11 fix locking in lkpi_stop_hw_scan()
In lkpi_stop_hw_scan() we have to unlock around cancelling the
hardware scan and an msleep to wait for the confirmation that the
scan ended. Otherwise we are sleeping with the non-sleepable
net80211 com lock held. At the same time we need to hold the lhw
lock for the msleep().
This lock change got lost in the refactoring of lkpi_iv_newstate().
Reported by: ambrisko, delphij
PR: 261075
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
LinuxKPI: Use negative bit field size to trigger BUILD_BUG_ON_ZERO
compile time assertion on non-NULL pointers. Tests conducted show that
_Static_assert, negative array size method and current code does not
handle pointers well enough. Bit field method solves this problem.
This change is derrived from Linux implementation of BUILD_BUG_ON_ZERO.
LinuxKPI: Import MTRR support functions from drm-kmod
They are superseded by PAT and mostly useless nowadays but still can be
used on Pentium III/IV era processors. Unlike drm-kmod version, this one
ignores MTRR if PAT is available that fixes confusing "Failed to add WC
MTRR for [0xXXXX-0xYYYY]: 22; performance may suffer" message often
appearing during drm-kmod initialization process.
If child is slow reading from its input, or even completely stops doing
the read, script(1) hangs in write(2) to the pts master waiting until
there is a space in the terminal discipline buffer. This also stops
handling any outer io, as well as child output.
Work around the problem by making pts master fd non-blocking, and be
prepared for short writes to it. The data to be written to master is
buffered in the tailq which is processed when select(2) detects that
master is ready for write.
PR: 260938
Reported by: наб <nabijaczleweli@nabijaczleweli.xyz>
See also: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1003095
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33789
Alexander Motin [Mon, 10 Jan 2022 14:35:01 +0000 (09:35 -0500)]
Revert "iflib: Relax timer period from 0.5 to 0.5-0.75s."
I've noticed relations between iflib_timer() vs ixl_admin_timer().
Both scheduled at the same 2Hz rate, but the second is rescheduling
the first each time, so if the first get any slower, it won't be
executed at all. Revert this until deeper investigation.
If a counter more than overflows just as we read it on switch out then,
if using sampling mode, we will negate this small value to give a huge
reload count, and if we later switch back in that context we will
validate that value against pm_reloadcount and panic an INVARIANTS
kernel with:
panic: [pmc,1470] pmcval outside of expected range cpu=2 ri=16 pmcval=fffff292 pm_reloadcount=10000
or similar. Presumably in a non-INVARIANTS kernel we will instead just
use the provided value as the reload count, which would lead to the
overflow not happing for a very long time (e.g. 78 minutes for a 48-bit
counter incrementing at an averate rate of 1GHz).
Instead, clamp the reload count to 0 (which corresponds precisely to the
value we would compute if it had just overflowed and no more), which
will result in hwpmc using the full original reload count again. This is
the approach used by core for Intel (for both fixed and programmable
counters).
As part of this, armv7 and arm64 are made conceptually simpler; rather
than skipping modifying the overflow count for sampling mode counters so
it's always kept as ~0, those special cases are removed so it's always
applicable and the concatentation of it and the hardware counter can
always be viewed as a 64-bit counter, which also makes them look more
like other architectures.
Whilst here, fix an instance of UB (shifting a 1 into the sign bit) for
amd in its sign-extension code.
Andrew Turner [Wed, 5 Jan 2022 15:12:01 +0000 (15:12 +0000)]
Move instructions into the arm64 exception vectors
We have 32 instructions in each exception vector on arm64. Previously
only one was used to branch to the handler function. We can split the
start of these functions and move some of the instructions into the
vectors.
Reviewed by: kib, markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33751
Kirk McKusick [Mon, 10 Jan 2022 00:17:13 +0000 (16:17 -0800)]
Avoid unnecessary setting of UFS flag requesting fsck(8) be run.
When the kernel is requested to mount a filesystem with a bad superblock
check hash, it would set the flag in the superblock requesting that the
fsck(8) program be run. The flag is only written to disk as part of a
superblock update. Since the superblock always has its check hash updated
when it is written to disk, the problem for which the flag has been set
will no longer exist. Hence, it is counter-productive to set the flag
as it will just cause an unnecessary run of fsck if it ever gets written.