kib [Thu, 29 Aug 2019 07:25:27 +0000 (07:25 +0000)]
Centralize __pcpu definitions.
Many extern struct pcpu <something>__pcpu declarations were
copied/pasted in sources. The issue is that the definition is MD, but
it cannot be provided by machine/pcpu.h due to actual struct pcpu
defined in sys/pcpu.h later than the inclusion of machine/pcpu.h.
This forced the copying when other code needed direct access to
__pcpu. There is no way around it, due to machine/pcpu.h supplying
part of struct pcpu fields.
To work around the problem, add a new machine/pcpu_aux.h header, which
should fill any needed MD definitions after struct pcpu definition is
completed. This allows to remove copies of __pcpu spread around the
source. Also on x86 it makes it possible to remove work arounds like
OFFSETOF_CURTHREAD or clang specific warnings supressions.
Reported and tested by: lwhsu, bcran
Reviewed by: imp, markj (previous version)
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21418
avg [Thu, 29 Aug 2019 07:19:06 +0000 (07:19 +0000)]
zfs_ioc_snapshot: check user-prop permissions on snapshotted datasets
Previously, the permissions were checked on the pool which was obviously
incorrect.
After this change, zfs_check_userprops() only validates the properties
without any permission checks. The permissions are checked individually
for each snapshotted dataset.
This was also committed to ZoL: zfsonlinux/zfs@e6203d2
karels [Thu, 29 Aug 2019 02:44:18 +0000 (02:44 +0000)]
Fix address annotation in xml output from w
The libxo xml feature of adding an annotation with the "original"
address from the utmpx file if it is different than the final "from"
field was broken by r351379. This was pointed out by the gcc error
that save_p might be used uninitialized. Save the original address
as needed in each entry, don't just use the last one from the previous
loop.
mjg [Wed, 28 Aug 2019 20:34:24 +0000 (20:34 +0000)]
vfs: add VOP_NEED_INACTIVE
vnode usecount drops to 0 all the time (e.g. for directories during path lookup).
When that happens the kernel would always lock the exclusive lock for the vnode
in order to call vinactive(). This blocks other threads who want to use the vnode
for looukp.
vinactive is very rarely needed and can be tested for without the vnode lock held.
This patch gives filesytems an opportunity to do it, sample total wait time for
tmpfs over 500 minutes of poudriere -j 104:
mjg [Wed, 28 Aug 2019 19:40:57 +0000 (19:40 +0000)]
amd64: clean up cpu_switch.S
- LK macro (conditional on SMP for the lock prefix) is unused
- SETLK unnecessarily performs xchg. obtained value is never used and the
implicit lock prefix adds avoidable cost. Barrier provided by it does
not appear to be of any use.
- the lock waited for is almost never blocked, yet the loop starts with
a pause. Move it out of the common case.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19563
mav [Wed, 28 Aug 2019 17:39:46 +0000 (17:39 +0000)]
MFV/ZoL: Fix wrong assertion in libzfs diff error handling
In compare(), all error cases set the error code to EPIPE, so when an
error is set, the correct assertion to make is that the error is EPIPE,
not EINVAL.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@freqlabs.com>
Closes #8743
zfsonlinux/zfs@9dc41a769df164875d974c2431b2453e70e16c41
Submitted by: Ryan Moeller <ryan@freqlabs.com>
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D20118
markj [Wed, 28 Aug 2019 16:08:06 +0000 (16:08 +0000)]
Wire pages in vm_page_grab() when appropriate.
uiomove_object_page() and exec_map_first_page() would previously wire a
page after having grabbed it. Ask vm_page_grab() to perform the wiring
instead: this removes some redundant code, and is cheaper in the case
where the requested page is not resident since the page allocator can be
asked to initialize the page as wired, whereas a separate vm_page_wire()
call requires the page lock.
In vm_imgact_hold_page(), use vm_page_unwire_noq() instead of
vm_page_unwire(PQ_NONE). The latter ensures that the page is dequeued
before returning, but this is unnecessary since vm_page_free() will
trigger a batched dequeue of the page.
asomers [Wed, 28 Aug 2019 04:19:37 +0000 (04:19 +0000)]
fusefs: Fix some bugs regarding the size of the LISTXATTR list
* A small error in r338152 let to the returned size always being exactly
eight bytes too large.
* The FUSE_LISTXATTR operation works like Linux's listxattr(2): if the
caller does not provide enough space, then the server should return ERANGE
rather than return a truncated list. That's true even though in FUSE's
case the kernel doesn't provide space to the client at all; it simply
requests a maximum size for the list. We previously weren't handling the
case where the server returns ERANGE even though the kernel requested as
much size as the server had told us it needs; that can happen due to a
race.
* We also need to ensure that a pathological server that always returns
ERANGE no matter what size we request in FUSE_LISTXATTR won't cause an
infinite loop in the kernel. As of this commit, it will instead cause an
infinite loop that exits and enters the kernel on each iteration, allowing
signals to be processed.
Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21287
jhb [Tue, 27 Aug 2019 21:29:37 +0000 (21:29 +0000)]
Adjust the deprecated warnings for /dev/crypto to be less noisy.
Warn when actual operations are performed instead of when sessions are
created. The /dev/crypto engine in OpenSSL 1.0.x tries to create
sessions for all possible algorithms each time it is initialized
resulting in spurious warnings.
Reported by: Mike Tancsa
MFC after: 3 days
Sponsored by: Chelsio Communications
mjg [Tue, 27 Aug 2019 20:51:17 +0000 (20:51 +0000)]
unionfs: stop passing LK_INTERLOCK to VOP_UNLOCK
This is part of the preparation to remove flags argument from VOP_UNLOCK.
Also has a side effect of fixing stacking on top of nullfs broken by r351472.
Reported by: cy
Sponsored by: The FreeBSD Foundation
manu [Tue, 27 Aug 2019 18:00:01 +0000 (18:00 +0000)]
arm64: rk3399: pinctrl: Add gpio banks and fix iomux
Since r351187 the pinctrl driver need to know the gpio bank as it
directly attach the gpio driver to handle some setup that might
be present in the dts, add the gpio banks table for rk3399.
While here fix some IOMUX definition that prevented to boot
on RK3399 as pinctrl wasn't configured correctly.
manu [Tue, 27 Aug 2019 17:59:09 +0000 (17:59 +0000)]
arm64: rk3328: pinctrl: Add gpio banks and fix iomux
Since r351187 the pinctrl driver need to know the gpio bank as it
directly attach the gpio driver to handle some setup that might
be present in the dts, add the gpio banks table for rk3328.
While here fix some IOMUX definition that prevented to boot
on RK3328 as pinctrl wasn't configured correctly.
mav [Tue, 27 Aug 2019 16:41:06 +0000 (16:41 +0000)]
Always check cam_periph_error() status for ERESTART.
Even if we do not expect retries, we better be sure, since otherwise it
may result in use after free kernel panic. I've noticed that it retries
SCSI_STATUS_BUSY even with SF_NO_RECOVERY | SF_NO_RETRY.
markj [Tue, 27 Aug 2019 14:06:34 +0000 (14:06 +0000)]
Fix several logic issues in domainset_empty_vm().
- Don't add 1 to the result of DOMAINSET_FLS.
- Do not modify domainsets containing only empty domains.
- Always flatten a _PREFER policy to _ROUNDROBIN if the preferred
domain is empty. Previously we were doing this only when ds_cnt > 1.
These bugs could cause hangs during boot if a VM domain is empty.
Tested by: hselasky
Reviewed by: hselasky, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21420
trasz [Tue, 27 Aug 2019 11:46:22 +0000 (11:46 +0000)]
Introduce <sys/qmath.h>, a fixed-point math library from Netflix.
This makes it possible to perform mathematical operations on
fractional values without using floating point. It operates on Q
numbers, which are integer-sized, opaque structures initialized
to hold a chosen number of integer and fractional bits.
For a general description of the Q number system, see the "Fixed Point
Representation & Fractional Math" whitepaper[1]; for the actual
API see the qmath(3) man page.
This is one of dependencies for the upcoming stats(3) framework[2]
that will be applied to the TCP stack in a later commit.
mmel [Tue, 27 Aug 2019 09:20:01 +0000 (09:20 +0000)]
Add support for RK3288 into existing RockChip drivers.
This patch ensures only minimal level of compatibility necessary to boot
on RK3288 based boards. GPIO and pinctrl interaction, missing in current
implementation, will be improved by own patch in the near future.
np [Tue, 27 Aug 2019 04:19:40 +0000 (04:19 +0000)]
cxgbe/t4_tom: Initialize all TOE connection parameters in one place.
Remove now-redundant items from toepcb and synq_entry and the code to
support them.
Let the driver calculate tx_align, rx_coalesce, and sndbuf by default.
np [Tue, 27 Aug 2019 01:16:02 +0000 (01:16 +0000)]
cxgbe/t4_tom: Limit work requests with immediate payload to a single
descriptor. The per-tid tx credits are in demand during active Tx and
it's best not to use too many just for payload.
jhb [Tue, 27 Aug 2019 00:01:56 +0000 (00:01 +0000)]
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
markj [Mon, 26 Aug 2019 20:20:10 +0000 (20:20 +0000)]
Fix a few nits in vm_pqbatch_process_page().
- Don't bother masking off non-queue state flags when loading the
page's atomic state, since it is only required for one of the
function's assertions. Update the assertion instead.
- Remove an incorrect comment regarding synchronization with the
page daemon. The page daemon only ever checks for PGA_ENQUEUED
with the page queue lock held.
- When clearing requeue flags, only clear the flags that have been
acted upon.
mav [Mon, 26 Aug 2019 17:54:19 +0000 (17:54 +0000)]
Announce PCI Segment Groups supported to PCI host _OSC.
According to ACPI 6.3 specification:
The OS sets this bit to 1 if it supports PCI Segment Groups as defined
by the _SEG object, and access to the configuration space of devices
in PCI Segment Groups as described by this specification. Otherwise,
the OS sets this bit to 0.
As far as I see we support both of those as PCI domains for quite a while.
According to my tests and errata to several generations of Intel CPUs,
PCIe hot-plug command completion reporting is not very reliable thing.
At least on my Supermicro X11DPi-NT board I never saw it reported.
Before this change timeout code detached devices and tried to disable
the slot, that in my case resulted in hot-plugged device being detached
just a second after it was successfully detected and attached. This
change removes that, so in case of timeout it just prints the error and
continue operation. Linux does the same.
jhb [Mon, 26 Aug 2019 17:25:07 +0000 (17:25 +0000)]
Stop using des_cblock * for arguments to DES functions.
This amounts to a char ** since it is a char[8] *. Evil casts mostly
resolved the fact that what was actually passed in were plain char *.
Instead, change the DES functions to use 'unsigned char *' for keys
and for input and output buffers.
markj [Sun, 25 Aug 2019 21:14:46 +0000 (21:14 +0000)]
Handle UMA_ANYDOMAIN in kstack_import().
The kernel thread stack zone performs first-touch allocations by
default, and must handle the case where the local memory domain
is empty. For most UMA zones this is handled in the keg layer,
but cache zones currently must implement a policy for this case.
Simply use a round-robin policy if UMA_ANYDOMAIN is passed.
Reported and tested by: bcran
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
imp [Sun, 25 Aug 2019 19:39:31 +0000 (19:39 +0000)]
Fix bogusly declared WERRORs in kernel build
Many arm kernel configs bogusly specified WERROR=-Werror. There's no
reason for this because the default is that and there's no reason to
override. These date from a time when we needed to add additional
warning->error suppression. They are obsolete and were cut and paste
propagated from file to file.
Comment out all the WERROR=.... lines in powerpc. They aren't bogus,
but were appropriate for the old defaults for gcc4.2.1. Now that we've
made the policy decision to suppress -Werror by default on these
platforms, it is appropriate to comment these out. People wishing to
fix these errors can still un-comment them out, or say WERROR=-Werror
on the command line.
Fix two instances (cut and paste propagation) of hard-coded -Werror
in x86 code. Replace with ${WERROR} instead. This is a no-op change
except for people who build WERROR=-Wno-error :).
0mp [Sun, 25 Aug 2019 17:55:31 +0000 (17:55 +0000)]
mixer(8): Report an error if the passed value is an empty string
This patch fixes a bug that made the mixer command enter
an infinite loop when instructed to set the value of a device
to an empty string (e.g., `mixer vol ""`).
Additionally, some tests for mixer(8) are being added.
dougm [Sun, 25 Aug 2019 07:06:51 +0000 (07:06 +0000)]
vm_map_simplify_entry considers merging an entry with its two
neighbors, and is used in a way so that if entries a and b cannot be
merged, we consider them twice, first not-merging a with its successor
b, and then not-merging b with its predecessor a. This change replaces
vm_map_simplify_entry with vm_map_try_merge_entries, which compares
two adjacent entries only, and uses it to avoid duplicated
merge-checks.
mjg [Sun, 25 Aug 2019 05:13:15 +0000 (05:13 +0000)]
nullfs: reduce areas protected by vnode interlock
Some places only take the interlock to hold the vnode, which was a requiremnt
before they started being manipulated with atomics. Use the newly introduced
vholdnz to bump the count.
Reviewed by: kib
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21358
kib [Sat, 24 Aug 2019 15:31:31 +0000 (15:31 +0000)]
amd64: rework PCPU allocation
Move pcpu KVA out of .bss into dynamically allocated VA at
pmap_bootstrap(). This avoids demoting superpage mapping .data/.bss.
Also it makes possible to use pmap_qenter() for installation of
domain-local pcpu page on NUMA configs.
Refactor pcpu and IST initialization by moving it to helper functions.
Reviewed by: markj
Tested by: pho
Discussed with: jeff
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21320
kib [Sat, 24 Aug 2019 14:29:13 +0000 (14:29 +0000)]
Make stack grow use the same gap as stack create.
Store stack_guard_page * PAGE_SIZE into the gap->next_read field at
the time of the stack creation. This makes the used guard size
consistent between stack creation and stack grow time.
Suggested by: alc
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D21384
imp [Fri, 23 Aug 2019 22:52:58 +0000 (22:52 +0000)]
It turns out the duplication is only mostly harmless.
While it worked with the kenrel, it wasn't working with the loader.
It failed to handle dependencies correctly. The reason for that is
that we never created a nvme module with the DRIVER_MODULE, but
instead a nvme_pci and nvme_ahci module. Create a real nvme module
that nvd can be dependent on so it can import the nvme symbols it
needs from there.
Arguably, nvd should just be a simple child of nvme, but transitioning
to that (and winning that argument given why it was done this way) is
beyond the scope of this change.
markj [Fri, 23 Aug 2019 19:53:11 +0000 (19:53 +0000)]
Stop clearing page flags in vm_page_pqbatch_submit().
All existing callers guarantee that the page does not have a
pre-existing dequeue pending. Thus, if the page is dequeued before
pqbatch_submit() acquires the page queue lock, we do not need to do
anything since vm_page_dequeue_complete() takes care of clearing all
page queue state flags for us.
With this change, vm_page_pqbatch_submit() has the nice property that it
does not directly modify any fields in the page structure.
markj [Fri, 23 Aug 2019 19:49:29 +0000 (19:49 +0000)]
Make vm_pqbatch_submit_page() externally visible.
It will become useful for the page daemon to be able to directly create
a batch queue entry for a page, and without modifying the page
structure. Rename vm_pqbatch_submit_page() to vm_page_pqbatch_submit()
to keep the namespace consistent. No functional change intended.
kib [Fri, 23 Aug 2019 19:40:10 +0000 (19:40 +0000)]
De-commision the MNTK_NOINSMNTQ kernel mount flag.
After all the changes, its dynamic scope is same as for MNTK_UNMOUNT,
but to allow the syncer vnode to be re-installed on unmount failure.
But the case of syncer was already handled by using the VV_FORCEINSMQ
flag for quite some time.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
jhb [Fri, 23 Aug 2019 18:26:34 +0000 (18:26 +0000)]
Fix universe to include arm LINT kernel configs.
Strip comments from the NOTES.armv[57] files as is done for other
NOTES files when building the corresponding LINT configs. Without
this, the LINT configs contained the NO_UNIVERSE comment from the
NOTES.armv[57] files.
imp [Fri, 23 Aug 2019 16:42:39 +0000 (16:42 +0000)]
Turn off -Werror for gcc 4.2.1 for userland
As discussed on arch@, gcc 4.2.1 is on its way out. Turn off Werror on gcc
versions < 5.0 permantly. This will allow older platforms to continue to compile
w/o new errors once we take them out of universe by default. This will also free
developers from chasing down obsolete warnings that produce no beneficial
changes to the source.
imp [Fri, 23 Aug 2019 16:42:04 +0000 (16:42 +0000)]
Turn off -Werror for gcc 4.2.1
As part of marching gcc 4.2.1 out of the tree, turn off -Werror on gcc 4.2.1
compiles by default. It generates too many false positives and breaks CI
for no benefit.
asomers [Fri, 23 Aug 2019 15:22:20 +0000 (15:22 +0000)]
ping6: Rename options for better consistency with ping
Now equivalent options have the same flags, and nonequivalent options have
different flags. This is a prelude to merging the two commands.
Submitted by: Ján Sučan <sucanjan@gmail.com>
MFC: Never
Sponsored by: Google LLC (Google Summer of Code 2019)
Differential Revision: https://reviews.freebsd.org/D21345