Mariusz Zaborski [Tue, 12 Jan 2021 18:38:10 +0000 (19:38 +0100)]
casper: convert macros to inline functions
In libcasper, the first argument to the function is a structure that
represents a connection to Casper. On systems without Casper, macros
are used to interpose the Casper functions to standard libc ones.
This may cause errors/warnings that the variable is not used.
With the inline function, there is no such problem.
When detaching the if_ure(4) driver, the TX active USB transfer array may
point to freed USB transfers. Given that the number of USB transfers is
very low, simply start all transfers every time there is a packet to
keep safe from use-after-free.
mhorne [Wed, 4 Nov 2020 17:51:10 +0000 (13:51 -0400)]
riscv pmap: add some pv list assertions
Ensure that we don't end up with a superpage in the vm_page_t's pv list.
This may help with debugging the panic reported in PR 250866, in which
l3 in pmap_remove_write() was found to be NULL. Adding a KASSERT to this
function will help narrow down the cause of this panic the next time it
occurs.
Andrew Turner [Tue, 12 Jan 2021 12:14:09 +0000 (12:14 +0000)]
Handle using a sub instruction in the arm64 fbt
Some stack frames are too large for a store pair instruction we already
detect in the arm64 fbt code. Add support for handling subtracting the
stack pointer directly.
Andrew Turner [Tue, 12 Jan 2021 11:37:06 +0000 (11:37 +0000)]
Only allow a store through sp in the arm64 fbt
When searching for an instruction to patch out in the arm64 function
boundary trace we search for a store pair with a write back. This
instruction is commonly used to store two registers to the stack
and update the stack pointer to hold space for more.
This works in many cases, however not all functions use this, e.g.
when the stack frame is too large. In these cases we may find another
instruction of the same type that doesn't store through the stack
pointer. Filter these instructions out and assume if we see one we
are past the function prologue.
Emmanuel Vadot [Tue, 12 Jan 2021 11:02:38 +0000 (12:02 +0100)]
linuxkpi: add kernel_fpu_begin/kernel_fpu_end
With newer AMD GPUs (>=Navi,Renoir) there is FPU context usage in the
amdgpu driver.
The `kernel_fpu_begin/end` implementations in drm did not even allow nested
begin-end blocks.
Emmanuel Vadot [Tue, 22 Dec 2020 18:15:01 +0000 (19:15 +0100)]
linuxkpi: Add shrinker support
A driver can register a shrinker that will be called when the kernel
wants to free some memory.
Add support for that in linuxkpi and call the registered shrinkers
when the lowmem event is triggered.
Emmanuel Vadot [Thu, 10 Dec 2020 16:47:11 +0000 (17:47 +0100)]
linuxkpi: Add more pci functions needed by DRM
-pci_get_class : This function search for a matching pci device based on
the class/subclass and returns a newly created pci_dev.
- pci_{save,restore}_state : This is analogous to ours with the same name
- pci_is_root_bus : Return true if this is the root bus
- pci_get_domain_bus_and_slot : This function search for a matching pci
device based on domain, bus and slot/function concat into a single
unsigned int (devfn) and returns a newly created pci_dev
- pci_bus_{read,write}_config* : Read/Write to the config space.
While here add some helper function to alloc and fill the pci_dev struct.
Emmanuel Vadot [Thu, 10 Dec 2020 17:38:41 +0000 (18:38 +0100)]
pci: Add pci_find_class_from
pci_find_class_from help finding one or multiple device matching
a class and subclass.
If the from argument is not null we will first loop in the device list
until we find the matching device and only then start to check if the
class/subclass matches.
Toomas Soome [Mon, 11 Jan 2021 21:54:23 +0000 (21:54 +0000)]
loader.efi: reworked framebuffer setup
Pass gfx_state to efi_find_framebuffer(), so we can pick between
GOP and UGA in efi_find_framebuffer(), also we can then
set up struct gen_fb in gfx_state from efifb and isolate efi fb data
processing into framebuffer.c.
This change does allow us to clean up efi_cons_init() and reduce
BS->LocateProtocol() calls.
A little downside is that we now need to translate gen_fb back to
efifb in bootinfo.c (for passing to kernel), and we need to add few
-I options to CFLAGS.
libthr malloc: support recursion on thr_malloc_umtx.
One possible way the recursion can happen is during fork: suppose
that fork is called from early code that did not triggered
jemalloc(3) initialization yet. Then we lock thr_malloc lock, and
call malloc_prefork() that might require initialization of jemalloc
pthread_mutexes, calling into libthr malloc. It is safe to allow
recursion for this occurence.
PR: 252579
Reported by: Vasily Postnicov <shamaz.mazum@gmail.com>
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
sigfastblock: do not skip cursig/postsig loop in ast()
Even if sigfastblock block is non-zero, non-blockable signals must be
checked on ast and delivered now. This also affects debugger ability
to attach, because issignal() also calls ptracestop() if there is
a pending stop for debugee.
Instead of checking for sigfastblock, and either setting PENDING flag
for usermode or doing signal delivery loop, always do the loop after
checking, and then handle PENDING bit. issignal() already does the right
thing for fast-blocked case, allowing only STOPs and SIGKILL delivery to
happen.
Reported by: Vasily Postnicov <shamaz.mazum@gmail.com>, markj
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28089
Mateusz Guzik [Tue, 12 Jan 2021 08:47:32 +0000 (08:47 +0000)]
amd64: fix tlb shootdown when all cpus are passed in the bitmap
Right now the routine leaves the current CPU in the map, later tripping
on an assert when filling in the scoreboard: panic: IPI scoreboard is
zero, initiator 1 target 1
Instead pre-check if all CPUs are present in the map and remember that
outcome for later.
Alan Somers [Sun, 10 Jan 2021 03:23:05 +0000 (20:23 -0700)]
lio_listio: validate aio_lio_opcode
Previously, we would accept any kind of LIO_* opcode, including ones
that were intended for in-kernel use only like LIO_SYNC (which is not
defined in userland). The situation became more serious with 022ca2fc7fe08d51f33a1d23a9be49e6d132914e. After that revision, setting
aio_lio_opcode to LIO_WRITEV or LIO_READV would trigger an assertion.
Note that POSIX does not specify what should happen if aio_lio_opcode is
invalid.
Charlie Root [Tue, 12 Jan 2021 01:56:12 +0000 (18:56 -0700)]
ICMP checksum test: Fix for big endian
The in_cksum tests originally tried to simulate a BE environment by
swapping the byte order of the input. But that's overcomplicated, and
didn't actually work on real BE hardware. The correct testing strategy
is just to test on the native endianness, and run the tests in both BE
and LE environments.
Andrew Gallatin [Tue, 12 Jan 2021 01:03:37 +0000 (20:03 -0500)]
amd64: compare TLB shootdown target to all_cpus
On amd64, the pmap code passes all_cpus to
smp_targeted_tlb_shootdown() when unmapping from the
kernel pmap. This function has an optimized path to send IPIs
to all but itself, which it intends to do when the target
is all cpus. However, we need to compare the target cpu mask
with all_cpus, rather than using CPU_ISFULLSET(). Comparing with
CPU_ISFULLSET() will only work when we have MAXCPU cpus active in
the system, otherwise, we'll be sending repeated IPIs, rather than
a single IPI to all CPUs but ourself.
Fixing this should reduce the time spent in native_lapic_ipi_wait()
as we will be sending ipis in parallel, rather than one-by-one.
This is confirmed by dtrace.
Kirk McKusick [Tue, 12 Jan 2021 00:44:41 +0000 (16:44 -0800)]
Eliminate lock order reversal in UFS ffs_unmount().
UFS uses a new "mntfs" pseudo file system which provides private
device vnodes for a file system to safely access its disk device.
The original device vnode is saved in um_odevvp to hold the exclusive
lock on the device so that any attempts to open it for writing will
fail. But it is otherwise unused and has its BO_NOBUFS flag set to
enforce that file systems using mntfs vnodes do not accidentally
use the original devfs vnode. When the file system is unmounted,
um_odevvp is no longer needed and is released.
The lock order reversal happens because device vnodes must be locked
before UFS vnodes. During unmount, the root directory vnode lock
is held. When when calling vrele() on um_odevvp, vrele() attempts to
exclusive lock um_odevvp causing the lock order reversal. The problem
is eliminated by doing a non-blocking exclusive lock on um_odevvp
which will always succeed since there are no users of um_odevvp.
With um_odevvp locked, it can be released using vput which does not
attempt to do a blocking exclusive lock request and thus avoids the
lock order reversal.
For rate-based resources that support throttling (e.g.
readiops/writeips), this fixes a divide-by-zero panic when rctl(8)
passes 0 as the throttle value. For these resources, treat
zero-throttle requests as requests to suspend forward progress as long
as possible using the duration specified in
kern.racct.rctl.throttle_max.
Use rn_match instead of doing indirect calls in fib_algo.
Relevant inet/inet6 code has the control over deciding what
the RIB lookup function currently is. With that in mind,
explicitly set it to the current value (rn_match) in the
datapath lookups. This avoids cost on indirect call.
exec_new_vmspace: print useful error message on ctty if stack cannot be mapped.
After old vmspace is destroyed during execve(2), but before the new space
is fully constructed, an error during image activation cannot be returned
because there is no executing program to receive it.
In the relatively common case of failure to map stack, print some hints
on the control terminal. Note that user has enough knobs to cause stack
mapping error, and this is the most common reason for execve(2) aborting
the process.
Requested by: jhb
Reviewed by: emaste, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28050
It is checked in vm_map_insert() and vm_map_protect() that PROT_WRITE |
PROT_EXEC are never specified together, if vm_map has MAP_WX flag set.
FreeBSD control flag allows specific binary to request WX exempt, and
there are per ABI boolean sysctls kern.elf{32,64}.allow_wx to enable/
disable globally.
Reviewed by: emaste, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28050
Kristof Provost [Mon, 11 Jan 2021 13:09:08 +0000 (14:09 +0100)]
pfctl: Another set skip <group> fix
When retrieving the list of group members we cannot simply use
ifa_lookup(), because it expects the interface to have an IP (v4 or v6)
address. This means that interfaces with no address are not found.
This presents as interfacing being alternately marked as skip and not
whenever the rules are re-loaded.
Happily we only need to fix ifa_grouplookup(). Teach it to also accept
AF_LINK (i.e. interface) node_hosts.
amd64 pmap: do not sleep in pmap_allocpte_alloc() with zero referenced page table page.
Otherwise parallel pmap_allocpte_alloc() for nearby va might also fail
allocating page table page and free the page under us. The end result is
that we could dereference unmapped pte when doing cleanup after sleep.
Instead, on allocation failure, first free everything, only then we can
drop pmap mutex and sleep safely, right before returning to caller.
Split inner non-sleepable part of the pmap_allocpte_alloc() into a new
helper pmap_allocpte_nosleep().
Reviewed by: markj
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27956
amd64 pmap: rename _pmap_allocpte() to pmap_allocpte_alloc().
The function performs actual allocation of pte, as opposed to
pmap_allocpte() that uses existing free pte if pt page is already
there. This also moves function out of namespace similar to a language
reserved.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27956
Bump amount of queued packets in for unresolved ARP/NDP entries to 16.
Currently default behaviour is to keep only 1 packet per unresolved entry.
Ability to queue more than one packet was added 10 years ago, in r215207,
though the default value was kep intact.
Things have changed since that time. Systems tend to initiate multiple
connections at once for a variety of reasons.
For example, recent kern/252278 bug report describe happy-eyeball DNS
behaviour sending multiple requests to the DNS server.
The primary driver for upper value for the queue length determination is
memory consumption. Remote actors should not be able to easily exhaust
local memory by sending packets to unresolved arp/ND entries.
For now, bump value to 16 packets, to match Darwin implementation.
The proper approach would be to switch the limit to calculate memory
consumption instead of packet count and limit based on memory.
Mitchell Horne [Mon, 11 Jan 2021 00:30:41 +0000 (20:30 -0400)]
cgem: update 64-bit check
The cgem(4) driver was updated to support 64-bit bus addressing in facdd1cd2045. However, the committed version determines this in an
un-idiomatic way. Change the compile-time conditional to check
BUS_SPACE_MAXADDR, rather than comparing int and pointer sizes.
Robert Watson [Sat, 9 Jan 2021 08:38:11 +0000 (08:38 +0000)]
Changes that improve DTrace FBT reliability on freebsd/arm64:
- Implement a dtrace_getnanouptime(), matching the existing
dtrace_getnanotime(), to avoid DTrace calling out to a potentially
instrumentable function.
(These should probably both be under KDTRACE_HOOKS. Also, it's not clear
to me that they are correct implementations for the DTrace thread time
functions they are used in .. fixes for another commit.)
- Don't allow FBT to instrument functions involved in EL1 exception handling
that are involved in FBT trap processing: handle_el1h_sync() and
do_el1h_sync().
- Don't allow FBT to instrument DDB and KDB functions, as that makes it
rather harder to debug FBT problems.
Prior to these changes, use of FBT on FreeBSD/arm64 rapidly led to kernel
panics due to recursion in DTrace.
Reliable FBT on FreeBSD/arm64 is reliant on another change from @andrew to
have the aarch64 instrumentor more carefully check that instructions it
replaces are against the stack pointer, which can otherwise lead to memory
corruption. That change remains under review.
Roger Pau Monne [Thu, 25 Jun 2020 17:16:04 +0000 (19:16 +0200)]
xen/privcmd: implement the restrict ioctl
Use an interface compatible with the Linux one so that the user-space
libraries already using the Linux interface can be used without much
modifications.
This allows an open privcmd instance to limit against which domains it
can act upon.
Roger Pau Monne [Thu, 25 Jun 2020 16:25:29 +0000 (18:25 +0200)]
xen/privcmd: implement the dm op ioctl
Use an interface compatible with the Linux one so that the user-space
libraries already using the Linux interface can be used without much
modifications.
This allows user-space to make use of the dm_op family of hypercalls,
which are used by device models.
Ryan Libby [Mon, 11 Jan 2021 05:53:15 +0000 (21:53 -0800)]
pf: quiet -Wredundant-decls for pf_get_ruleset_number
In e86bddea9fe62d5093a1942cf21950b3c5ca62e5 sys/netpfil/pf/pf.h grew a
declaration of pf_get_ruleset_number. Now delete the old declaration
from sys/net/pfvar.h.
Ed Maste [Mon, 11 Jan 2021 00:02:56 +0000 (19:02 -0500)]
cmp: fix -s (silent) when used with skip offsets
-s causes cmp to print nothing for differing files, for use when only
the exit status is of interest.
-z compares the file size first, for regular files, and fails the
comparison early if they do not match.
Prior to this change -s implied -z as an optimization, but this is not
valid when file offsets are specified. Now, enable the -z optimization
for -s only if both skip arguments are not provided / 0.
Note that using -z with differing skip values will currently always
fail. We may want to compare size1 - skip1 with size2 - skip2 instaead,
and in any case the man page should be clarified.
PR: 252542
Fixes: 3e6902efc802ab57fc4e9bf798f2d271b152e7f9
Reported by: William Ahern
Reviewed by: markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28071
Mark Johnston [Sun, 10 Jan 2021 22:46:32 +0000 (17:46 -0500)]
libdtrace: Format USDT symbols correctly based on symbol binding
Before we did not handle weak symbols correctly, sometimes resulting in
link errors from dtrace -G when processing object files where functions
with weak aliases contain USDT probes.
Reported by: rlibby
Tested by: rlibby
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Restore the hwofs functionality temporarily disabled by 7ba6ecf216fb15e8b147db2 to prevent issues with iflib.
This patch brings the necessary changes to iflib to
enable howfs to allow interface restarts without
disrupting netmap applications actively using its
rings.
After this change, it becomes possible for multiple
non-cooperating netmap applications to use non-overlapping
subsets of the available netmap rings without clashing
with each other.
x86 tsc: mark %eax as earlyclobber in tscp_get_timecount_low().
i386 codegen insists on preloading tc_priv into register on i386, and
this register cannot be %eax because RDTSCP instruction clobbers it
before it is used.
Reported and tested by: dim
MFC after: 6 days
Sponsored by: The FreeBSD Foundation
Alan Cox [Sun, 10 Jan 2021 08:51:33 +0000 (02:51 -0600)]
Prefer the use of vm_page_domain() to vm_phys_domain().
When we already have the vm page in hand, use vm_page_domain() instead
of vm_phys_domain(). The former has a trivial constant-time
implementation whereas the latter iterates over the mem_affinity array.
iflib: add assert to prevent out-of-bounds array access
The iflib_queues_alloc() allocates isc_nrxqs iflib_dma_info structs
for each rxqset, and links each struct to a different free list.
As a result, it must be isc_nrxqs >= isc_nfl (plus the completion
queue, if present).
Add an assertion to make this constraint explicit.
Robert Watson [Fri, 1 Jan 2021 13:04:46 +0000 (13:04 +0000)]
Track pipe(2) reads and writes as rusage message receives and sends, a
feature misplaced during the transition from BSD 4.4's socket implementation
to the optimised FreeBSD pipe implementation.
netmap: iflib: enable/disable krings on any interface reinit
Since 1d238b07d5d4d9660ae0e0, krings are disabled before
a reinit cycle triggered by iflib_netmap_register.
However, this operation is actually necessary also for
any interface reinit triggered by other causes (i.e.,
ifconfig commands).
We achieve this goal by moving the krings enable/disable
operation inside iflib_stop() and iflib_init_locked().
Once here, this change also removes some redundant operations
from iflib_netmap_register(), that are already performed by
iflib_stop().
Jamie Gritton [Sun, 10 Jan 2021 05:05:06 +0000 (21:05 -0800)]
jail: Simplify handling of prison_deref()
Track the the current lock/reference state in a single variable,
rather than deducing the proper prison_deref() flags from a
combination of equations and hard-coded values.
Instead of providing ifuncs for each kind of fence, define ifuncs
that combine fence and invocation of RDTSC. This refactoring makes
introduction of RDTSCP use possible.
Reviewed by: gallatin, markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27986
tsc: add RDTSCP or faster variants of get_timecount()
Use it in preference of Xfenced RDTSC if RDTSCP is supported. It is
recommended by both Intel and AMD. But, on AMD Zens and newer use
LFENCE, as recommended by AMD [*]. In particular, this means that now
AMD CPUs use more appropriate fence instead of too harsh MFENCe.
Add comment explaining the intent of the selection logic.
Instead of trying to maintain pg_jobc counter on each process group
update (and sometimes before), just calculate the counter when needed.
Still, for the benefit of the signal delivery code, explicitly mark
orphaned groups as such with the new process group flag.
This way we prevent bugs in the corner cases where updates to the counter
were missed due to complicated configuration of p_pptr/p_opptr/real_parent
(debugger).
Since we need to iterate over all children of the process on exit, this
change mostly affects the process group entry and leave, where we need
to iterate all process group members to detect orpaned status.
Proctree lock is needed for correct calculation and collection of the
job-control related data in kinfo_proc. There was even an XXX comment
about it.
Satisfy locking and lock ordering requirements by taking proctree lock
around pass over each bucket in proc_iterate(), and in sysctl_kern_proc()
and note_procstat_proc() for individual process reporting.
Reviewed by: jilles
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871
Increase the scope of the process group lock ownership. This ensures that
we are consistent in returning EIO for tty write from an orphan and delivery
of TTYOUT signals.
Reviewed by: jilles
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871
Often, we have a process locked and need to get locked process group.
In this case, because progress group lock is before process lock,
unlocking process allows the group to be freed. See for instance
tty_wait_background().
Make pgrp structures allocated from nofree zone, and ensure type stability
of the pgrp mutex.
Reviewed by: jilles
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871