r336019 introduced ${SRCTOP}/sys to the include paths in order to pull in a
new sys/{c,}nv.h. This is wrong, because the build tree's ABI isn't
guaranteed to match what's running on the host system.
Fix instead by removing -I${SRCTOP}/sys and installing the libnv headers
with `make -C lib/libnv includes`... this may or may not get re-worked in
the future so that a userland lib isn't installing includes from sys/.
Matt Macy [Fri, 6 Jul 2018 10:10:00 +0000 (10:10 +0000)]
counter(9): unbreak amd64 following r336020
Apply temporary fix to counter until daylight hours.
The fact that the assembly for counter_u64_add relied on the sizeof(struct pcpu) was
the basis for the otherwise arbitrary offset never came up in D15933.
critical_{enter,exit} is now inline so the only real added overhead is the
added (mostly false) conditional branch in exit.
Matt Macy [Fri, 6 Jul 2018 02:06:03 +0000 (02:06 +0000)]
Back pcpu zone with domain correct pages
- Change pcpu zone consumers to use a stride size of PAGE_SIZE.
(defined as UMA_PCPU_ALLOC_SIZE to make future identification easier)
- Allocate page from the correct domain for a given cpu.
- Don't initialize pc_domain to non-zero value if NUMA is not defined
There are some misconceptions surrounding this field. It is the
_VM_ NUMA domain and should only ever correspond to valid domain
values as understood by the VM.
The former slab size of sizeof(struct pcpu) was somewhat arbitrary.
The new value is PAGE_SIZE because that's the smallest granularity
which the VM can allocate a slab for a given domain. If you have
fewer than PAGE_SIZE/8 counters on your system there will be some
memory wasted, but this is obviously something where you want the
cache line to be coming from the correct domain.
Reviewed by: jeff
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15933
config(8): De-dupe hint/env vars within a single file
r335653 flipped the order in which hints/env files are concatenated to match
the order in which vars are processed by the kernel. This is the other
hammer to drop.
Use nv(9) to de-dupe entries within a single `hint` or `env` file, using the
latest value specified for a key. This leaves some duplicates if a variable
is specified in multiple hint/env files or via `envvar` in a kernel config,
but the reversed order of concatenation (from r335653) makes this a
non-issue as the latest-specified version will be seen first.
This change also silently rewrote hint bits to use the same sanitization
process that ian@ wrote for r335642. To the kernel, hints and env vars are
basically the same thing through early boot, then get merged into the
dynamic environment once kmem becomes available and the dynamic environment
is created. They should be subjected to the same restrictions.
libnv has been added to -legacy for the time being to support the build of
config(8) with the new cnvlist API.
Sean Eric Fagan [Thu, 5 Jul 2018 22:56:13 +0000 (22:56 +0000)]
This exposes ZFS user and group quotas via the normal
quatactl(2) mechanism. (Read-only at this point, however.)
In particular, this is to allow rpc.rquotad query quotas
for NFS mounts, allowing users to see their quotas on the
hosts using the datasets.
The changes specifically:
* Add new RPC entry points for querying quotas.
* Changes the library routines to allow non-UFS quotas.
* Changes rquotad to check for quotas on mounted filesystems,
rather than being limited to entries in /etc/fstab
* Lastly, adds a VFS entry-point for ZFS to query quotas.
Note that this makes one unavoidable behavioural change: if quotas
are enabled, then they can be queried, as opposed to the current
method of checking for quotas being specified in fstab. (With
ZFS, if there are user or group quotas, they're used, always.)
config(8): De-dupe hint/env vars within a single file
r335653 flipped the order in which hints/env files are concatenated to match
the order in which vars are processed by the kernel. This is the other
hammer to drop.
Use nv(9) to de-dupe entries within a single `hint` or `env` file, using the
latest value specified for a key. This leaves some duplicates if a variable
is specified in multiple hint/env files or via `envvar` in a kernel config,
but the reversed order of concatenation (from r335653) makes this a
non-issue as the latest-specified version will be seen first.
This change also silently rewrote hint bits to use the same sanitization
process that ian@ wrote for r335642. To the kernel, hints and env vars are
basically the same thing through early boot, then get merged into the
dynamic environment once kmem becomes available and the dynamic environment
is created. They should be subjected to the same restrictions.
It is possible that a fictitious unmanaged userspace mapping of
superpage is created on x86, e.g. by pmap_object_init_pt(), with the
physical address outside the vm_page_array[] coverage.
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16085
Andrew Turner [Thu, 5 Jul 2018 17:13:37 +0000 (17:13 +0000)]
Create a new macro for static DPCPU data.
On arm64 (and possible other architectures) we are unable to use static
DPCPU data in kernel modules. This is because the compiler will generate
PC-relative accesses, however the runtime-linker expects to be able to
relocate these.
In preparation to fix this create two macros depending on if the data is
global or static.
Sean Bruno [Thu, 5 Jul 2018 17:07:23 +0000 (17:07 +0000)]
Make ZSTD a real option via ZSTDIO.
It looks like the intent was to allow ZSTD support to be
compiled into the kernel with options ZSTDIO. But it doesn't look
like that was ever implemented or I'm missing how to do it.
I did a cursory audit of kernel config files and made a decision to
enable ZSTDIO in riscv GENERIC and mips MALTA configurations. All other
kernel configurations already had this option in their kernel configs
but they didn't do anything useful as the feature was declared as
"standard" prior to this.
Reviewed by: cem allanjude
Differential Revision: https://reviews.freebsd.org/D16007
Copyout(9) on 4/4 i386 needs correct vm_page_array[].
On the 4/4 i386, copyout(9) may need to call pmap_extract_and_hold()
on arbitrary userspace mapping. If the mapping is backed by the
non-managed cdev pager or by the sg pager, on dense configs we might
access arbitrary element of vm_page_array[], in particular, not
corresponding to a page from the memory segment. Initialize such pages
as fictitious with the corresponding physical address.
Reported by: bde
Reviewed by: alc, markj (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D16085
In x86 pmap_extract_and_hold(), there is no need to recalculate the
physical address, which is readily available after sucessfull
vm_page_pa_tryrelock().
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16085
kern_environment: use any provided environments, evict hintmode/envmode
At the moment, hintmode and envmode are used to indicate whether static
hints or static env have been provided in the kernel config(5) and the
static versions are mutually exclusive with loader(8)-provided environment.
hintmode *can* be reconfigured later to pull from the dynamic environment,
thus taking advantage of the loader(8) or post-kmem environment setting.
This changeset fixes both problems at once to move us from a semi-confusing
state to a consistent state: if an environment file, hints file, or
loader(8) environment are provided, we use them in a well-known order of
precedence:
Once the dynamic environment is setup this becomes a moot point. The
loader(8) and static environments are merged (respecting the above order of
precedence), and the static hints are merged in on an as-needed basis after
the dynamic environment has been setup.
Hints lookup are changed to respect all of the above. Before the dynamic
environment is setup, lookups use the above-mentioned order and fallback to
the next environment if a matching hint is not found. Once the dynamic
environment is setup, that is used on its own since it captures all of the
above information plus any dynamic kenv settings that came up later in boot.
The following tangentially related changes were made to res_find:
- A hintp cookie is now passed in so that related searches continue using
the chain of environments (or dynamic environment) without relying on
global state
- All three environments will be searched if they actually have valid hints
to use, rather than just choosing the first environment that actually had
a hint and rolling with that only
The hintmode sysctl has been ripped out. static_{env,hints}.disabled are
still honored and will disable their respective environments from being used
for hint lookups and from being merged into the dynamic environment, as
expected.
In x86 pmap_extract_and_hold(), there is no need to recalculate the
physical address, which is readily available after sucessfull
vm_page_pa_tryrelock().
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16085
kern_environment: use any provided environments, evict hintmode/envmode
At the moment, hintmode and envmode are used to indicate whether static
hints or static env have been provided in the kernel config(5) and the
static versions are mutually exclusive with loader(8)-provided environment.
hintmode *can* be reconfigured later to pull from the dynamic environment,
thus taking advantage of the loader(8) or post-kmem environment setting.
This changeset fixes both problems at once to move us from a semi-confusing
state to a consistent state: if an environment file, hints file, or
loader(8) environment are provided, we use them in a well-known order of
precedence:
Once the dynamic environment is setup this becomes a moot point. The
loader(8) and static environments are merged (respecting the above order of
precedence), and the static hints are merged in on an as-needed basis after
the dynamic environment has been setup.
Hints lookup are changed to respect all of the above. Before the dynamic
environment is setup, lookups use the above-mentioned order and fallback to
the next environment if a matching hint is not found. Once the dynamic
environment is setup, that is used on its own since it captures all of the
above information plus any dynamic kenv settings that came up later in boot.
The following tangentially related changes were made to res_find:
- A hintp cookie is now passed in so that related searches continue using
the chain of environments (or dynamic environment) without relying on
global state
- All three environments will be searched if they actually have valid hints
to use, rather than just choosing the first environment that actually had
a hint and rolling with that only
The hintmode sysctl has been ripped out. static_{env,hints}.disabled are
still honored and will disable their respective environments from being used
for hint lookups and from being merged into the dynamic environment, as
expected.
With the introduction of reapers and reaplists in r275800,
proc0 and init are setup as a circular dependency.
create_init() calls fork1() which calls do_fork(). There the
newproc (initproc) is setup with a reaper of proc0 who's reaper
points to itself. The newproc (initproc) is then put on its
reaper's (proc0) p_reaplist (initproc is a descendants of proc0
for proc0 to reap). Upon return to create_init(), proc0 is
added to initproc's p_reaplist (which would mean proc0 is a
descendant of init, for init to reap). This creates a
circular dependency which eventually leads to LIST corruptions
when trying to kill init and a proc0.
For the base system we never really hit this case during reboot.
The problem only became visible after adding more virtual process
spaces which could go away cleanly (work existing in an experimental
branch).
Reviewed by: kib
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D15924
Ian Lepore [Thu, 5 Jul 2018 16:00:58 +0000 (16:00 +0000)]
Detach all children before beginning to tear down the hardware, instead of
doing it last. Also, remove the local tracking of whether usb's busdma
memory allocation got done, because it's safe to call the free_all
function even if it wasn't.
Ian Lepore [Thu, 5 Jul 2018 15:34:16 +0000 (15:34 +0000)]
Remove a test and early-out which just can't possibly be right. It causes
detach() to do nothing if attach() succeeded, which is the opposite of
what's needed. Also, move device_delete_children() from the end to the
beginning of detach(), so that children won't be trying to make use of the
hardware we're in the process of shutting down.
Ian Lepore [Thu, 5 Jul 2018 14:09:48 +0000 (14:09 +0000)]
Fix an out-of-bounds array access... the irq data for teardown is in two
arrays, as elements 0 and 1 of one array and elements 1 and 2 of the other.
Run the loop 0..1 instead of 1..2 and use named constants to offset into
one of the arrays.
Brooks Davis [Thu, 5 Jul 2018 13:13:48 +0000 (13:13 +0000)]
Make struct xinpcb and friends word-size independent.
Replace size_t members with ksize_t (uint64_t) and pointer members
(never used as pointers in userspace, but instead as unique
idenitifiers) with kvaddr_t (uint64_t). This makes the structs
identical between 32-bit and 64-bit ABIs.
On 64-bit bit systems, the ABI is maintained. On 32-bit systems,
this is an ABI breaking change. The ABI of most of these structs
was previously broken in r315662. This also imposes a small API
change on userspace consumers who must handle kernel pointers
becoming virtual addresses.
The initial work on bhyve NVMe device emulation was done by the GSoC student
Shunsuke Mie and was heavily modified in performan, functionality and
guest support by Leon Dang.
Tested with guest OS: FreeBSD Head, Linux Fedora fc27, Ubuntu 18.04,
OpenSuse 15.0, Windows Server 2016 Datacenter.
Tested with all accepted device paths: Real nvme, zdev and also with ram.
Tested on: AMD Ryzen Threadripper 1950X 16-Core Processor and
Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz.
Alan Cox [Thu, 5 Jul 2018 02:04:18 +0000 (02:04 +0000)]
As of r335784, if pmap_enter() replaces a managed mapping by an unmanaged
mapping, then it leaks the unlinked PV entry. This change eliminates that
leak, freeing the PV entry.
Rick Macklem [Wed, 4 Jul 2018 19:46:26 +0000 (19:46 +0000)]
Fix the pNFS server so that it handles the "#mds_path" check for mirrors.
The recently added feature of the pNFS server will set an fsid for the
MDS file system to define the file system a DS should store files for.
For a case where a DS handling all file systems has failed, it was possible
for the code to check for a mirror with a specified fs, even though
nfsdev_mdsisset was 0, possibly causing a false successful check for a mirror.
This patch adds a check for nfsdev_mdsisset != 0 to avoid this.
It only affects the pNFS server for a rare case. Found via code inspection.
Sean Bruno [Wed, 4 Jul 2018 17:10:07 +0000 (17:10 +0000)]
fix locking within tcp_ipsec_pcbctl() to match ipsec4_pcbctl(), ipsec4_pcbctl()
IPSEC_PCBCTL() functions, which include tcp_ipsec_pcbctl(),
ipsec4_pcbctl(), and ipsec6_pcbctl(), should all have matching locking
semantics.
ipsec4_pcbctl() and ipsec6_pcbctl() expect the inp to be unlocked on
entry and exit and appear to be correctly implemented as such. But
tcp_ipsec_pcbctl() had other semantics. This patch fixes the semantics
for tcp_ipsec_pcbctl().
Andrew Gallatin [Wed, 4 Jul 2018 14:25:38 +0000 (14:25 +0000)]
mxge: fix panic at module unload
r333175 (multicast changes) exposed a bug where
mxge was not checking to see if the driver was being
unloaded while handing ioctls that touch hardware.
As a result, now that in6m_disconnect() is run from
an async gtaskq, it was busy-waiting in mxge_send_cmd()
while the mcast list was destroyed.
Some applications, notably PostgreSQL, want to call setproctitle()
very often. It's slow. Provide an alternative cheap way of updating
process titles without making any syscalls, instead requiring other
processes (top, ps etc) to do a bit more work to retrieve the data.
This uses a pre-existing code path inherited from ancient BSD, which
always did it that way.
Add a way for the process to request cleanup of the kernel cache of
the process arguments. New arguments length zero causes the drop of
the pargs instead of allocation of useless zero-length buffer.
remove unneeded inclusion of sys/interrupt.h from several files
It's likely that the header was needed in the past for swi(9).
But now that code does not use swi(9) or any other interfaces defined
in sys/interrupt.h.
loader: fdt: Try to load every possible DTB from u-boot
U-Boot setup a few variables :
- fdt_addr which is the board static dtb (most of the time loaded before
u-boot or coming from some hardware like a ROM)
- fdt_addr_r which is a location in RAM that holds the DTB loaded by
u-boot or before u-boot
In the case of u-boot + rpi firmware the DTB is loaded in RAM but the location
still end up in the fdt_addr variable and the fdt_addr_r variable exist.
Change the behavior so we test that a DTB exists for every possible variable :
- fdt_addr_r is checked first as if u-boot needed to modify it the
correct DTB will live there.
- fdt_addr is checked second as if we run on a hardware with DTB in ROM
it means that we what/need to run that
- fdtaddr looks like a FreeBSD-ism but since I'm not sure leave it.
Hiroki Sato [Wed, 4 Jul 2018 06:47:34 +0000 (06:47 +0000)]
- Fix a double unlock in inp_block_unblock_source() and
lock leakage in inp_leave_group() which caused a panic.
- Make order of CTR1() and IN_MULTI_LIST_LOCK() consistent
around inm_merge().
Will Andrews [Wed, 4 Jul 2018 03:36:46 +0000 (03:36 +0000)]
Revert r335833.
Several third-parties use at least some of these ioctls. While it would be
better for regression testing if they were used in base (or at least in the
test suite), it's currently not worth the trouble to push through removal.
muge(4): add DTB blob as one more possible source of MAC address
On FDT-enabled platforms check if DTB blob has MAC address configured by
a boot loader. This information passed as a "local-mac-address" or
"mac-address" property of the device node. For USB NICs node
can be found by looking for compatibility string "usbVVV,PPP" where
VVV - vendor id (hex) and PPP - product id (hex)
Matt Macy [Wed, 4 Jul 2018 02:47:16 +0000 (02:47 +0000)]
epoch(9): allow preemptible epochs to compose
- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
there's no longer any benefit to dropping the pcbinfo lock
and trying to do so just adds an error prone branchfest to
these functions
- Remove cases of same function recursion on the epoch as
recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
thread as the tracker field is now stack or heap allocated
as appropriate.
Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066
Allow jail names (not just IDs) to be specified for: cpuset(1), ipfw(8),
sockstat(1), ugidfw(8)
These are the last of the jail-aware userland utilities that didn't work
with names.
PR: 229266
MFC after: 3 days
Differential Revision: D16047
John Baldwin [Tue, 3 Jul 2018 22:03:28 +0000 (22:03 +0000)]
Use 'e' instead of 'i' constraints with 64-bit atomic operations on amd64.
The ADD, AND, OR, and SUB instructions take at most a 32-bit
sign-extended immediate operand. 64-bit constants that do not fit into
that constraint need to be loaded into a register. The 'i' constraint
tells the compiler it can pass any integer constant to the assembler,
whereas the 'e' constrain only permits constants that fit into a 32-bit
sign-extended value. This fixes using
atomic_add/clear/set/subtract_long/64 with constants that do not fit into
a 32-bit sign-extended immediate.
Reported by: several folks
Tested by: Pete Wright <pete@nomadlogic.org>
MFC after: 2 weeks
Fix .depend.foo.o tracking for sys/conf/files defined compilations.
Some example files:
ia32_genassym.o
acpi_wakecode.o
The old mkdep method also lacked tracking these files.
Objects defined in sys/conf/files with no-obj and no-implicit-rule get their
own targets defined in the kernel Makefile but lack having their objects added
to DEPENDOBJS so never get a .depend file generated. Normally if an object is
in OBJS it will get a .depend file.
Fix this by looking for .o files in CLEAN and ensuring they are part of
the -MD filtering and .depend loading. This is a hack. Other solutions
could exist involving sys/conf/files or config(8) to auto add these to
DEPENDFILES/DEPENDOBJS but this method seems reliable enough without being
intrusive or error-prone for new files.