jeff [Mon, 16 Dec 2019 20:15:04 +0000 (20:15 +0000)]
Repeat the spinlock_enter/exit pattern from amd64 on other architectures to
fix an assert violation introduced in r355784. Without this spinlock_exit()
may see owepreempt and switch before reducing the spinlock count. amd64
had been optimized to do a single critical enter/exit regardless of the
number of spinlocks which avoided the problem and this optimization had
not been applied elsewhere.
trasz [Mon, 16 Dec 2019 20:07:04 +0000 (20:07 +0000)]
Add compat.linux.emul_path, so it can be set to something other
than "/compat/linux". Useful when you have several compat directories
with different Linux versions and you don't want to clash with files
installed by linux-c7 packages.
Reviewed by: bcr (manpages)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22574
Keyboard drivers are generally registered via linker set. In these cases,
they're also available as kmods which use KPI for registering/unregistering
keyboard drivers outside of the linker set.
For built-in modules, we still fire off MOD_LOAD and maybe even MOD_UNLOAD
if an error occurs, leading to registration via linker set and at MOD_LOAD
time.
This is a minor optimization at best, but it keeps the internal kbd driver
tidy as a future change will merge the linker set driver list into its
internal keyboard_drivers list via SYSINIT and simplify driver lookup by
removing the need to consult the linker set.
mmel [Mon, 16 Dec 2019 14:08:49 +0000 (14:08 +0000)]
Fix LLVM libunwnwind _Unwind_Backtrace symbol version for ARM.
In original GNU libgcc, _Unwind_Backtrace is published with GCC_3.3 version
for all architectures but ARM. For ARM should be publishes with GCC_4.3.0
version. This was originally omitted in r255095, fixed in r318024 and omitted
aging in LLVM libunwind implementation in r354347.
For ARM _Unwind_Backtrace should be published as default with GCC_4.3.0
version , (because this is right original version) and again as
normal(not-default) with GCC_3.3 version (to maintain ABI compatibility
compiled/linked with wrong pre r318024 libgcc)
kevans [Mon, 16 Dec 2019 04:52:06 +0000 (04:52 +0000)]
kbd: patch linker set methods, too
This is needed after r355796. Some double-registration of kbd drivers needs
to be sorted out, then this sysinit will simply add these drivers into the
normal list and kill off any other bits in the driver that are aware of the
linker set, for simplicity.
kevans [Mon, 16 Dec 2019 03:12:53 +0000 (03:12 +0000)]
kbd: remove kbdsw, store pointer to driver in each keyboard_t
The previous implementation relied on a kbdsw array that mirrored the global
keyboards array. This is fine, but also requires extra locking consideration
when accessing to ensure that it's not being resized as new keyboards are
added.
The extra pointer costs little in a struct that there are relatively few of
on any given system, and simplifies locking requirements ever-so-slightly as
we only need to consider the locking requirements of whichever method is
being invoked.
__FreeBSD_version is bumped as any kbd modules will need rebuilt following
this change.
kevans [Mon, 16 Dec 2019 02:44:56 +0000 (02:44 +0000)]
kbd: provide default implementations of get_fkeystr/diag
Most keyboard drivers are using the genkbd implementations as it is;
formally use them for any that aren't set and make
genkbd_get_fkeystr/genkbd_diag private.
kevans [Mon, 16 Dec 2019 02:05:44 +0000 (02:05 +0000)]
keyboard switch definitions: standardize on c99 initializers
A future change will provide default implementations for some of these where
it makes sense and most of them are already using the genkbd
implementation (e.g. get_fkeystr, diag).
kevans [Mon, 16 Dec 2019 01:37:03 +0000 (01:37 +0000)]
kbd drivers: use kbdd_* indirection for diag invocation
These invocations were directly calling enkbd_diag(), rather than
indirection back through kbdd_diag/kbdsw. While they're functionally
equivent, invoking kbdd_diag where feasible (i.e. not in a diag
implementation) makes it easier to visually identify locking needs in these
other drivers.
mjg [Mon, 16 Dec 2019 00:07:51 +0000 (00:07 +0000)]
vfs: allow tail call optimisation in vops in the common case
Most frequently used vops boil down to checking SDT probes, doing the call and
checking again. There is no vop_post/pre in their case but the check after the
call prevents tail call optimisation from taking place. Instead, check once
upfront. Kernels with debug or vops with non-empty vop_post still don't short
circuit.
alc [Sun, 15 Dec 2019 22:41:57 +0000 (22:41 +0000)]
Apply a small optimization to pmap_remove_l3_range(). Specifically, hoist a
PHYS_TO_VM_PAGE() operation that always returns the same vm_page_t out of
the loop. (Since arm64 is configured as VM_PHYSSEG_SPARSE, the
implementation of PHYS_TO_VM_PAGE() is more costly than that of
VM_PHYSSEG_DENSE platforms, like amd64.)
tsoome [Sun, 15 Dec 2019 21:52:40 +0000 (21:52 +0000)]
loader: rewrite zfs vdev initialization
In some cases the pool discovery will get stuck in infinite loop while setting
up the vdev children.
To fix, we split the vdev setup into two parts, first we create vdevs based on
configuration we do get from pool label, then, we process pool config from MOS
and update the pool config if needed.
Testing done: confirm previously hung loader is not hung any more.
jeff [Sun, 15 Dec 2019 21:26:50 +0000 (21:26 +0000)]
schedlock 4/4
Don't hold the scheduler lock while doing context switches. Instead we
unlock after selecting the new thread and switch within a spinlock
section leaving interrupts and preemption disabled to prevent local
concurrency. This means that mi_switch() is entered with the thread
locked but returns without. This dramatically simplifies scheduler
locking because we will not hold the schedlock while spinning on
blocked lock in switch.
This change has not been made to 4BSD but in principle it would be
more straightforward.
jhibbits [Sun, 15 Dec 2019 21:20:18 +0000 (21:20 +0000)]
powerpc/powernv: Set the PTCR for the Nest MMU
The Nest MMU manages address translation for accelerators on the POWER9. To
do so, it needs a page table, so export the system page table to the Nest
MMU. This will quietly fail on pre-POWER9 systems that do not have a NMMU.
The NMMU is currently unused, so this change is currently effectively a NOP,
but the NMMU and VAS will eventually be used.
jeff [Sun, 15 Dec 2019 21:19:41 +0000 (21:19 +0000)]
schedlock 3/4
Eliminate lock recursion from turnstiles. This was simply used to avoid
tracking the top-level turnstile lock. explicitly check for it before
picking up and dropping locks.
ian [Sun, 15 Dec 2019 21:16:35 +0000 (21:16 +0000)]
Rewrite arm kernel stack unwind code to work when unwinding through modules.
The arm kernel stack unwinder has apparently never been able to unwind when
the path of execution leads through a kernel module. There was code that
tried to handle modules by looking for the unwind data in them, but it did
so by trying to find symbols which have never existed in arm kernel
modules. That caused the unwind code to panic, and because part of panic
handling calls into the unwind code, that just created a recursion loop.
Locating the unwind data in a loaded module requires accessing the Elf
section headers to find the SHT_ARM_EXIDX section. For preloaded modules
those headers are present in a metadata blob. For dynamically loaded
modules, the headers are present only while the loading is in progress; the
memory is freed once the module is ready to use. For that reason, there is
new code in kern/link_elf.c, wrapped in #ifdef __arm__, to extract the
unwind info while the headers are loaded. The values are saved into new
fields in the linker_file structure which are also conditional on __arm__.
In arm/unwind.c there is new code to locally cache the per-module info
needed to find the unwind tables. The local cache is crafted for lockless
read access, because the unwind code often needs to run in context where
sleeping is not allowed. A large comment block describes the local cache
list, so I won't repeat it all here.
jeff [Sun, 15 Dec 2019 21:11:15 +0000 (21:11 +0000)]
schedlock 1/4
Eliminate recursion from most thread_lock consumers. Return from
sched_add() without the thread_lock held. This eliminates unnecessary
atomics and lock word loads as well as reducing the hold time for
scheduler locks. This will eventually allow for lockless remote adds.
ian [Sun, 15 Dec 2019 18:05:18 +0000 (18:05 +0000)]
Support --all-repeats in uniq(1) for compatibility with gnu coreutils.
This adds a new -D/--all-repeats option to uniq(1), which outputs each copy
of any repeated lines (as opposed to a single copy of a repeated line). You
can specify a separator option to output a blank line before or after each
group of repeated lines. This adds compatibility with the GNU coreutils
version of uniq(1).
This change also re-groups the -c, -d, -D, -u options in the usage display
and man page to indicate that they are mutally exclusive of each other. This
matches the posix/opengroup definition of uniq(1) command line args. Note
that this change does NOT actually enforce the mutual exclusion in the code,
for now, it simply documents that the arguments should be considered
exclusive with each other.
cem [Sun, 15 Dec 2019 17:33:26 +0000 (17:33 +0000)]
Revert r355760, r355759
And remove the inline/deprecated attribute use entirely in stdlib.h, from
r355747. The intent was to provide a buildable API transitionary period, but
clearly that was counter-productive.
mmel [Sun, 15 Dec 2019 14:28:38 +0000 (14:28 +0000)]
Properly synchronize completion DMA buffers.
Within command completion processing the callback function may access
DMAed data buffer. Synchronize it before use, not after.
This allows to use NVMe disk on non-DMA coherent arm64 system.
kevans [Sun, 15 Dec 2019 04:22:50 +0000 (04:22 +0000)]
kbd: drop _KERNEL #ifdef in kbdreg.h
This #ifdef is misleading as there are actually no user-serviceable parts
inside and, as far as I can tell, there is no pollution leading from
userland to this header. Furthermore, it becomes a slight nuisance when
attempting to move things around in this header.
jeff [Sun, 15 Dec 2019 04:08:24 +0000 (04:08 +0000)]
Previously we did not support invalid pages in default objects. This means
that if fault fails to progress and needs to restart the loop it must free
the page it is working on and allocate again on restart. Resolve the few
places that need to be modified to support this condition and simply
deactivate the page. Presently, we only permit this when fault restarts
for busy contention. This has an added benefit of removing some object
trylocking in this case.
While here consolidate some page cleanup logic into fault_page_free() and
fault_page_release() to reduce redundant code and automate some teardown.
jeff [Sun, 15 Dec 2019 03:15:06 +0000 (03:15 +0000)]
Add a deferred free mechanism for freeing swap space that does not require
an exclusive object lock.
Previously swap space was freed on a best effort basis when a page that
had valid swap was dirtied, thus invalidating the swap copy. This may be
done inconsistently and requires the object lock which is not always
convenient.
Instead, track when swap space is present. The first dirty is responsible
for deleting space or setting PGA_SWAP_FREE which will trigger background
scans to free the swap space.
Simplify the locking in vm_fault_dirty() now that we can reliably identify
the first dirty.
jeff [Sun, 15 Dec 2019 02:02:27 +0000 (02:02 +0000)]
Slightly optimize locking in vm_map_copy_swap_entry(). Anonymous objects
require the object lock to synchronize collapse. Other swap objects such
as tmpfs do not.
jeff [Sun, 15 Dec 2019 02:00:32 +0000 (02:00 +0000)]
Handle pagein clustering in vm_page_grab_valid() so that it can be used by
exec_map_first_page(). This will also enable pagein clustering for other
interested consumers (tmpfs, md, etc).
pfg [Sun, 15 Dec 2019 01:56:56 +0000 (01:56 +0000)]
cdefs: use more accurate GCC version for the deprecated attribute.
The message argument in the "deprecated" attribute was introduced in GCC 4.5 *.
Use the accurate version number for consistency, as done already with other
attributes.
cem [Sat, 14 Dec 2019 21:52:49 +0000 (21:52 +0000)]
cdefs: Add __deprecated(message) function attribute macro
The legacy version of GCC4 currently in base does not support the
parameterized form of this function attribute, as recent introduced in
stdlib.h (r355747).
As we have done for other function attributes with similar compatibility
problems, add a version-compatibile definition in sys/cdefs.h. Note that
Clang defines itself to be GCC 4, so one must check for __clang__ in
addition to __GNUC__ version. On legacy GCC 4, the macro expands to just
the __deprecated__ attribute; on modern GCC or Clang, the macro expands to
the parameterized variant with the message.
Ignoring legacy or unsupported compilers, the macro is also beneficial in
that it is a bit more ergonomic than the full
__attribute__((__deprecated__())) boilerplate.
Reported by: CI (but not tinderbox); imp and others
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D22817
rmacklem [Sat, 14 Dec 2019 21:49:47 +0000 (21:49 +0000)]
Update the mount_nfs.8 man page to include NFSv4.2.
r355677 added NFSv4.2 support to the NFS client. This patch updates the
mount_nfs.8 man page to reflect that.
It also clarifies that the "nolockd" option does not apply to NFSv4 mounts.
dougm [Sat, 14 Dec 2019 19:44:42 +0000 (19:44 +0000)]
Simplify the processing a leaf mask to find big-enough ranges of set
bits, by storing and modifying the complement of the original leaf
mask, and by avoiding some unnecessary intermediate variables in
computing the shift amounts. The logic is similar to what has recently
been committed to sys/sys/bitstring.h.
Compute better hint updates for the case when the cursor starts in
mid-leaf, and eliminates some otherwise viable solutions. Assume the
worst case, that all the eliminated offsets could have been solutions,
and you can still compute a better hint than we use now.
Eliminate some unnecessary conditional control flow.
mmel [Sat, 14 Dec 2019 14:56:34 +0000 (14:56 +0000)]
Add driver for Rockchip PCIe root complex found in RK3399 SOC.
Unfortunately, there are some limitations:
- memory aperture of his controller is only 16MiB, so it is nearly
unusable for graphic cards
- every attempt to generate type 1 config cycle always causes trap.
These config cycles are disabled now and we don't support cards
with PCIe switch.
- in some cases, attempt to do config cycle to (probably) not-yet ready
card also causes trap. This cannot be detected at runtime, but it seems
like very rare issue.
cem [Sat, 14 Dec 2019 08:28:10 +0000 (08:28 +0000)]
Deprecate sranddev(3) API
It serves no useful purpose and wasn't as popular as its equally meritless
cousin, srandomdev(3).
Setting aside the problems with rand(3) in general, the problem with this
interface is that the seed isn't shared with the caller (other than by
attacking the output of the generator, which is trivial, but not a hallmark of
pleasant API design). The (arguable) utility of rand(3) or random(3) is as a
semi-fast simulation generator which produces consistent results from a given
seed. These are mutually at odd. Furthermore, sometimes people got the
mistaken impression that a high quality random seed meant a weak generator like
rand(3) or random(3) could be used for things like cryptographic key
generation. This is absolutely not so.
The API was never part of a standard and was not widely used in tree. Existing
in-tree uses have all been removed.
rlibby [Sat, 14 Dec 2019 05:21:56 +0000 (05:21 +0000)]
uma dbg: flexible size for slab debug bitset too
Recently (r355315) the size of the struct uma_slab bitset field us_free
became dynamic instead of conservative. Now, make the debug bitset
size dynamic too. The debug bitset is INVARIANTS-only, so in fact we
don't care too much about the space savings that results from this, but
enabling minimally-sized slabs on INVARIANTS builds is still important
in order to be able to test new slab layouts effectively.
scottl [Fri, 13 Dec 2019 23:46:59 +0000 (23:46 +0000)]
Add accessors for the Vendor Specific Extended Capability (VSEC)
Parse out the VSEC. If the user invokes a second -c command line option,
do a hex dump of the vendor data.
imp [Fri, 13 Dec 2019 22:32:05 +0000 (22:32 +0000)]
Better copyright advice
Document the common practices around copyrights with "all rights reserved" in
them as new copyright notices get added.
It's an open question qhether to point people at the fact that since the Berne
convention was ratified, All rights reserved is largely obsolete.
https://en.wikipedia.org/wiki/All_rights_reserved#Obsolescence has the
details. The committer's guide will be revised shortly, and it's likely that's a
better place for this discussion. If not, I'll add a blurb here.
avg [Fri, 13 Dec 2019 22:04:13 +0000 (22:04 +0000)]
zfs boot: fix a crash in a rarely taken path in fzap_lookup
Instead of passing NULL to fzap_name_equal and crashing, just return
ENOENT. This happened when higher bits of a hash of the searched key
(its hash prefix) matched a hash prefix of some key in the ZAP, but the
full hash value of the searched key did not match any key in the ZAP.
I observerved this problem when loader tried to look up
"features_for_read" in a particular old pool that predates pool
features.
np [Fri, 13 Dec 2019 20:38:58 +0000 (20:38 +0000)]
cxgbe(4): Use the _XT variant of the CPL used to transmit NIC traffic.
CPL_TX_PKT_XT disables the internal parser on the chip and instead
relies on the driver to provide the exact length of the L2 and L3
headers. This allows hw checksumming and TSO to be used with L2 and
L3 encapsulations that the chip doesn't understand directly.
Note that netmap tx still uses the old CPL as it never uses the hw
to generate the checksum on tx.
bdragon [Fri, 13 Dec 2019 20:30:26 +0000 (20:30 +0000)]
[PowerPC] Fully define gdtoa settings on powerpc64.
The settings in arith.h were not fully defined on powerpc64 after the gdtoa
switchover. Generate them using arithchk.c, similar to what AMD64 did for
r114814.
Technically, none of this is necessary in FreeBSD gdtoa, but since the other
platforms have full definitions, we might as well have full definitions
too.
Approved by: jhibbits (in irc)
Differential Revision: https://reviews.freebsd.org/D22775
imp [Fri, 13 Dec 2019 19:39:33 +0000 (19:39 +0000)]
Create new wrapper function: bus_delayed_attach_children()
Delay the attachment of children, when requested, until after interrutps are
running. This is often needed to allow children to run transactions on i2c or
spi busses. It's a common enough idiom that it will be useful to have its own
wrapper.
Reviewed by: ian
Differential Revision: https://reviews.freebsd.org/D21465
jhb [Fri, 13 Dec 2019 19:21:58 +0000 (19:21 +0000)]
Support software breakpoints in the debug server on Intel CPUs.
- Allow the userland hypervisor to intercept breakpoint exceptions
(BP#) in the guest. A new capability (VM_CAP_BPT_EXIT) is used to
enable this feature. These exceptions are reported to userland via
a new VM_EXITCODE_BPT that includes the length of the original
breakpoint instruction. If userland wishes to pass the exception
through to the guest, it must be explicitly re-injected via
vm_inject_exception().
- Export VMCS_ENTRY_INST_LENGTH as a VM_REG_GUEST_ENTRY_INST_LENGTH
pseudo-register. Injecting a BP# on Intel requires setting this to
the length of the breakpoint instruction. AMD SVM currently ignores
writes to this register (but reports success) and fails to read it.
- Rework the per-vCPU state tracked by the debug server. Rather than
a single 'stepping_vcpu' global, add a structure for each vCPU that
tracks state about that vCPU ('stepping', 'stepped', and
'hit_swbreak'). A global 'stopped_vcpu' tracks which vCPU is
currently reporting an event. Event handlers for MTRAP and
breakpoint exits loop until the associated event is reported to the
debugger.
Breakpoint events are discarded if the breakpoint is not present
when a vCPU resumes in the breakpoint handler to retry submitting
the breakpoint event.
- Maintain a linked-list of active breakpoints in response to the GDB
'Z0' and 'z0' packets.
imp [Fri, 13 Dec 2019 18:35:48 +0000 (18:35 +0000)]
Move to using bool instead of boolean_t
While there are subtle semantic differences between bool and boolean_t, none of
them matter in these cases. Prefer true/false when dealing with bool
type. Preserve a couple of TRUEs since they are passed into int args into CAM.
Preserve a couple of FALSEs when used for status.done, an int.
markj [Fri, 13 Dec 2019 18:28:01 +0000 (18:28 +0000)]
Restore the reservation of boot pages for bucket zones after r355707.
uma_startup2() sets booted = BOOT_BUCKETS after calling bucket_init(),
but before that assignment, startup_alloc() will use pages from the
reserved pool, so the bucket zones themselves are still allocated using
startup pages.
bdragon [Fri, 13 Dec 2019 18:18:14 +0000 (18:18 +0000)]
[PowerPC] Enable TLS usage in system libraries on ELFv2.
Currently, __NO_TLS is defined to 1 on powerpc64. TLS usage works much
better on ELFv2 due to the modern tooling, so take the opportunity to
reenable TLS on ELFv2.
If you are using a self-built ELFv2 environment on powerpc64, you will
have to run installworld twice due to RuneLocale changes. This is the only
known regression, and if you are using the ELFv2 isos, you likely already
have the updated libraries installed, as this change is part of the
patchset that the isos integrate.
(No UPDATING note about this because ELFv2 is still an unofficial build.)
tsoome [Fri, 13 Dec 2019 12:36:16 +0000 (12:36 +0000)]
loader: cd9660_open() warn: is 'buf' large enough for 'struct iso_primary_descriptor'?
We do allocate amount of memory (void * or char *), and then assign this
buffer to struct iso_primary_descriptor *vd. Make sure we do
allocate enough bytes.
In fact we do allocate enough, but it is good idea to make sure this really
is so.
ae [Fri, 13 Dec 2019 11:47:58 +0000 (11:47 +0000)]
Make TCP options parsing stricter.
Rework tcpopts_parse() to be more strict. Use const pointer. Add length
checks for specific TCP options. The main purpose of the change is
avoiding of possible out of mbuf's data access.
rlibby [Fri, 13 Dec 2019 10:34:19 +0000 (10:34 +0000)]
libmemstat: unbreak build
r355706 added an instance of offsetof() to the UMA private kernel header
file uma_int.h. Userspace memstat_uma.c includes that header, and
chokes on offsetof() because apparently the definition in sys/types.h is
ifdef _KERNEL. Now, include sys/stddef.h which has an identical
definition.
Pointyhat to: rlibby
Sponsored by: Dell EMC Isilon
rlibby [Fri, 13 Dec 2019 09:32:16 +0000 (09:32 +0000)]
bitset: rename confusing macro NAND to ANDNOT
s/BIT_NAND/BIT_ANDNOT/, and for CPU and DOMAINSET too. The actual
implementation is "and not" (or "but not"), i.e. A but not B.
Fortunately this does appear to be what all existing callers want.
Don't supply a NAND (not (A and B)) operation at this time.
Discussed with: jeff
Reviewed by: cem
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22791
rlibby [Fri, 13 Dec 2019 09:32:03 +0000 (09:32 +0000)]
uma: delay bucket_init() until we might actually enable buckets
This helps with a bootstrapping problem in upcoming work.
We don't first enable buckets until uma_startup2(), so we can delay
bucket creation until then. The other two paths to bucket_enable() are
both later, one in the pageout daemon (SI_SUB_KTHREAD_PAGE vs SI_SUB_VM)
and one in uma_timeout() (first activated in uma_startup3()). Note that
although some bucket functions are accessible before uma_startup2()
(e.g. bucket_select() in zone_ctor()), none of them inspect ubz_zone.
rlibby [Fri, 13 Dec 2019 09:31:59 +0000 (09:31 +0000)]
uma dbg: flexible size for slab debug bitset too
Recently (r355315) the size of the struct uma_slab bitset field us_free
became dynamic instead of conservative. Now, make the debug bitset
size dynamic too. The debug bitset is INVARIANTS-only, so in fact we
don't care too much about the space savings that results from this, but
enabling minimally-sized slabs on INVARIANTS builds is still important
in order to be able to test new slab layouts effectively.
trasz [Fri, 13 Dec 2019 09:28:44 +0000 (09:28 +0000)]
Add kern.geom.part.separator tunable. This makes it possible
to specify an optional separator to insert before partition name;
eg if it's set to "c/", you'll get "ada0c/s1" instead of "ada0s1".
(It cannot be set to just “/“, since ada0 is a device node, not
a directory.)
cem [Fri, 13 Dec 2019 05:42:57 +0000 (05:42 +0000)]
libtelnet: Replace bogus use of srandomdev + random to generate "public key pair"
I'm pretty skeptical that any crypto in telnet is worth using, but if we're
ostensibly generating keys, arc4random is strictly better than the previous
construct.