rmacklem [Thu, 27 Jun 2019 23:10:40 +0000 (23:10 +0000)]
Add non-blocking trylock variants for the rangelock functions.
A future patch that will add a Linux compatible copy_file_range(2) syscall
needs to be able to lock the byte ranges of two files concurrently.
To do this without a risk of deadlock, a non-blocking variant of
vn_rangelock_rlock() called vn_rangelock_tryrlock() was needed.
This patch adds this, along with vn_rangelock_trywlock(), in order to
do this.
The patch also adds a couple of comments, that I hope clarify how the
algorithm used in kern_rangelock.c works.
jhb [Thu, 27 Jun 2019 22:50:11 +0000 (22:50 +0000)]
Fix comment in sofree() to reference sbdestroy().
r160875 added sbdestroy() as a wrapper around sbrelease_internal to be
called from sofree(), yet the comment added in the same revision to
sofree() still mentions sbrelease_internal().
asomers [Thu, 27 Jun 2019 22:24:56 +0000 (22:24 +0000)]
fusefs: fix a memory leak regarding FUSE_INTERRUPT
We were leaking the fuse ticket if the original operation completed before
the daemon received the INTERRUPT operation. Fixing this was easier than I
expected.
bcran [Thu, 27 Jun 2019 22:06:41 +0000 (22:06 +0000)]
Increase EFI_STAGING_SIZE to 100MB on x64
To avoid failures when the large 18MB nvidia.ko module is being loaded,
increase EFI_STAGING_SIZE from 64MB to 100MB on x64 systems.
Leave the other platforms at 64MB.
asomers [Thu, 27 Jun 2019 20:18:12 +0000 (20:18 +0000)]
fusefs: recycle vnodes after their last unlink
Previously fusefs would never recycle vnodes. After VOP_INACTIVE, they'd
linger around until unmount or the vnlru reclaimed them. This commit
essentially actives and inlines the old reclaim_revoked sysctl, and fixes
some issues dealing with the attribute cache and multiply linked files.
jhb [Thu, 27 Jun 2019 19:36:30 +0000 (19:36 +0000)]
Hold an explicit reference on the socket for the aiotx task.
Previously, the aiotx task relied on the aio jobs in the queue to hold
a reference on the socket. However, when the last job is completed,
there is nothing left to hold a reference to the socket buffer lock
used to check if the queue is empty. In addition, if the last job on
the queue is cancelled, the task can run with no queued jobs holding a
reference to the socket buffer lock the task uses to notice the queue
is empty.
Fix these races by holding an explicit reference on the socket when
the task is queued and dropping that reference when the task
completes.
avg [Thu, 27 Jun 2019 15:46:06 +0000 (15:46 +0000)]
gpiobus: provide a new hint, pin_list
"pin_list" allows to specify child pins as a list of pin numbers.
Existing hint "pins" serves the same purpose but with a 32-bit wide bit
mask. One problem with that is that a controller can have more than 32
pins. One example is amdgpio. Also, a list of numbers is a little bit
more human friendly than a matching bit mask. As a side note, it seems
that in FDT pins are typically specified by their numbers as well.
This commit also adds accessors for instance variables (IVARs) that
define the child pins. My primary goal is to allow a child to be
configured programmatically rather than via hints (assuming that FDT is
not supported on a platform). Also, while a child should not care about
specific pin numbers that are allocated to it, it could be interested in
how many were actually assigned to it.
While there, I removed "flags" instance variable. It was unused.
kevans [Thu, 27 Jun 2019 14:03:32 +0000 (14:03 +0000)]
bectl(8): create non-recursive boot environments
bectl advertises that it has the ability to create recursive and
non-recursive boot environments. This patch implements that functionality
using the be_create_depth API provided by libbe. With this patch, bectl now
works as bectl(8) describes in regards to creating recursive/non-recursive
boot environments.
Submitted by: Rob Fairbanks <rob.fx907 gmail com> (with minor changes)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20240
asomers [Thu, 27 Jun 2019 00:00:48 +0000 (00:00 +0000)]
fusefs: fix some memory leaks
Fix memory leaks relating to FUSE_BMAP and FUSE_CREATE. There are still
leaks relating to FUSE_INTERRUPT, but they'll be harder to fix since the
server is legally allowed to never respond to a FUSE_INTERRUPT operation.
cognet [Wed, 26 Jun 2019 22:06:40 +0000 (22:06 +0000)]
In get_fpcontext32() and set_fpcontext32(), we can't just use memcpy() to
copy the VFP registers.
arvm7 VFP uses 32 64bits fp registers (but those could be used in pairs to
make 16 128bits registers), while aarch64 uses 32 128bits fp registers, so
we have to copy the value of each register.
alc [Wed, 26 Jun 2019 21:43:41 +0000 (21:43 +0000)]
Revert one of the changes from r349323. Specifically, undo the change
that replaced a pmap_invalidate_page() with a dsb(ishst) in
pmap_enter_quick_locked(). Even though this change is in principle
correct, I am seeing occasional, spurious bus errors that are only
reproducible without this pmap_invalidate_page(). (None of adding an
isb, "upgrading" the dsb to wait on loads as well as stores, or
disabling superpage mappings eliminates the bus errors.) Add an XXX
comment explaining why the pmap_invalidate_page() is being performed.
rgrimes [Wed, 26 Jun 2019 21:19:43 +0000 (21:19 +0000)]
Emulate the "TEST r/m{16,32,64}, imm{16,32,32}" instructions (opcode F7H).
This adds emulation for:
test r/m16, imm16
test r/m32, imm32
test r/m64, imm32 sign-extended to 64
OpenBSD guests compiled with clang 8.0.0 use TEST directly against a
Local APIC register instead of separate read via MOV followed by a
TEST against the register.
PR: 238794
Submitted by: jhb
Reported by: Jason Tubnor jason@tubnor.net
Tested by: Jason Tubnor jason@tubnor.net
Reviewed by: markj, Patrick Mooney patrick.mooney@joyent.com
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D20755
asomers [Wed, 26 Jun 2019 20:25:57 +0000 (20:25 +0000)]
fusefs: annotate deliberate file descriptor leaks in the tests
closing a file descriptor causes FUSE activity that is superfluous to the
purpose of most tests, but would nonetheless require matching expectations.
Rather than do that, most tests deliberately leak file descriptors instead.
This commit moves the leakage from each test into two trivial functions:
leak and leakdir. Hopefully Coverity will only complain about those
functions and not all of their callers.
markj [Wed, 26 Jun 2019 20:11:52 +0000 (20:11 +0000)]
Avoid a divide-by-zero when bad checksum counters overflow.
A mixture of IP or UDP packets with valid and invalid checksum could
cause {ip,udp}_packets_bad_checksum to wrap around to 0, resulting
in a division by zero.
This is packet.c rev. 1.27 from OpenBSD.
admbugs: 552
Obtained from: OpenBSD
MFC after: 3 days
asomers [Wed, 26 Jun 2019 19:10:39 +0000 (19:10 +0000)]
fusefs: run the io tests with direct io, too
Now the io tests are run in all cache modes. The fusefs test suite can now
get adequate coverage without changing the value of
vfs.fusefs.data_cache_mode, which is only needed for legacy file systems
now.
markj [Wed, 26 Jun 2019 17:37:51 +0000 (17:37 +0000)]
Add a return value to vm_page_remove().
Use it to indicate whether the page may be safely freed following
its removal from the object. Also change vm_page_remove() to assume
that the page's object pointer is non-NULL, and have callers perform
this check instead.
This is a step towards an implementation of an atomic reference counter
for each physical page structure.
As of protocol 7.23, fuse file systems can specify their cache behavior on a
per-mountpoint basis. If they set FUSE_WRITEBACK_CACHE in
fuse_init_out.flags, then they'll get the writeback cache. If not, then
they'll get the writethrough cache. If they set FOPEN_DIRECT_IO in every
FUSE_OPEN response, then they'll get no cache at all.
The old vfs.fusefs.data_cache_mode sysctl is ignored for servers that use
protocol 7.23 or later. However, it's retained for older servers,
especially for those running in jails that lack access to the new protocol.
This commit also fixes two other minor test bugs:
* WriteCluster:SetUp was using an uninitialized variable.
* Read.direct_io_pread wasn't verifying that the cache was actually
bypassed.
cognet [Wed, 26 Jun 2019 16:56:56 +0000 (16:56 +0000)]
Fix debugging of 32bits arm binaries on arm64.
In set_regs32()/fill_regs32(), we have to get/set SP and LR from/to
tf_x[13] and tf_x[14].
set_regs() and fill_regs() may be called for a 32bits process, if the process
is ptrace'd from a 64bits debugger. So, in set_regs() and fill_regs(), get
or set PC and SPSR from where the debugger expects it, from tf_x[15] and
tf_x[16].
markj [Wed, 26 Jun 2019 16:38:30 +0000 (16:38 +0000)]
libdwarf: Use the cached strtab pointer when reading string attributes.
Previously we would perform a linear search of the DWARF section
list for ".debug_str". However, libdwarf always caches a pointer to
the strtab image in its debug descriptor. Using it gives a modest
performance improvement when iterating over the attributes of each
DIE.
Reviewed by: emaste
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20759
marius [Wed, 26 Jun 2019 15:28:21 +0000 (15:28 +0000)]
o In iflib_txq_drain():
- Remove desc_used, which is only ever written to.
- Remove a dead store to reclaimed.
- Don't recycle avail.
- Sort variables according to style(9).
These changes will make a subsequent commit easier to read.
o In iflib_tx_credits_update(), don't bother checking whether the
ift_txd_credits_update method pointer is NULL; _iflib_pre_assert()
asserts upfront that this method has been assigned and functions
like iflib_{fast_intr_rxtx,netmap_timer_adjust,txq_can_drain}()
and _task_fn_tx() were already unconditionally relying on the
method being callable.
asomers [Wed, 26 Jun 2019 15:15:24 +0000 (15:15 +0000)]
fusefs: delete some unused mount options
The fusefs kernel module allegedly supported no_attrcache, no_readahed,
no_datacache, no_namecache, and no_mmap mount options, but the mount_fusefs
binary never did. So there was no way to ever activate these options.
Delete them. Some of them have alternatives:
no_attrcache: set the attr_valid time to 0 in FUSE_LOOKUP and FUSE_GETATTR
responses.
no_readahed: set max_readahead to 0 in the FUSE_INIT response.
no_datacache: set the vfs.fusefs.data_cache_mode sysctl to 0, or (coming
soon) set the attr_valid time to 0 and set FUSE_AUTO_INVAL_DATA in
the FUSE_INIT response.
no_namecache: set entry_valid time to 0 in FUSE_LOOKUP and FUSE_GETATTR
responses.
hselasky [Wed, 26 Jun 2019 12:04:54 +0000 (12:04 +0000)]
Only call libusb_hotplug_enumerate() once from libusb_hotplug_register_callback().
Else when registering multiple filters the same USB device may appear twice in
the list.
MFC after: 3 days
Sponsored by: Mellanox Technologies
hselasky [Wed, 26 Jun 2019 11:28:08 +0000 (11:28 +0000)]
Fix support for LIBUSB_HOTPLUG_ENUMERATE in libusb. Currently all
devices are enumerated regardless of of the LIBUSB_HOTPLUG_ENUMERATE
flag. Make sure when the flag is not specified no arrival events are
generated for currently enumerated devices.
MFC after: 3 days
Sponsored by: Mellanox Technologies
avg [Wed, 26 Jun 2019 07:38:31 +0000 (07:38 +0000)]
gpio.4: document device hints common to all devices on gpiobus
"at" keyword is documented in device.hints(5) for all buses, but it does
hurt to add another reference to it.
"pins" keyword is specific to gpiobus.
At least these two hints should be configured for any gpiobus device on
a hints based system.
asomers [Wed, 26 Jun 2019 02:09:22 +0000 (02:09 +0000)]
fusefs: implement the "time_gran" feature.
If a server supports a timestamp granularity other than 1ns, it can tell the
client this as of protocol 7.23. The client will use that granularity when
updating its cached timestamps during write. This way the timestamps won't
appear to change following flush.
jhibbits [Wed, 26 Jun 2019 01:14:39 +0000 (01:14 +0000)]
powerpc/booke: Handle misaligned floating point loads/stores as on AIM
Misaligned floating point loads and stores are already handled for AIM, but
use the DSISR to obtain the necessary data. Book-E does not have the DSISR,
so these fixups are not performed, leading to a SIGBUS on misaligned FP
loads or stores. Obtain the necessary data on the Book-E side, similar to
how is done for SPE.
cy [Wed, 26 Jun 2019 00:53:49 +0000 (00:53 +0000)]
While working on PR/238796 I discovered an unused variable in frdest,
the next hop structure. It is likely this contributes to PR/238796
though other factors remain to be investigated.
asomers [Wed, 26 Jun 2019 00:06:41 +0000 (00:06 +0000)]
fusefs: delete obsolete comments in the tests
I originally thought that the kernel would be responsible for ctime in
protocol 7.23. But now I realize that's not the case. The server is
responsible for ctime. The kernel only sets it when there are dirty writes
cached, because that's when the server can't.
asomers [Wed, 26 Jun 2019 00:03:37 +0000 (00:03 +0000)]
fusefs: set ctime during FUSE_SETATTR following a write
As of r349396 the kernel will internally update the mtime and ctime of files
on write. It will also flush the mtime should a SETATTR happen before the
data cache gets flushed. Now it will flush the ctime too, if the server is
using protocol 7.23 or higher.
This is the only case in which the kernel will explicitly set a file's
ctime, since neither utimensat(2) nor any other user interfaces allow it.
asomers [Tue, 25 Jun 2019 23:40:18 +0000 (23:40 +0000)]
fusefs: automatically update mtime and ctime on write
Writing should implicitly update a file's mtime and ctime. For fuse, the
server is supposed to do that. But the client needs to do it too, because
the FUSE_WRITE response does not include time attributes, and it's not
desirable to issue a GETATTR after every WRITE. When using the writeback
cache, there's another hitch: the kernel should ignore the mtime and ctime
fields in any GETATTR response for files with a dirty write cache.
dougm [Tue, 25 Jun 2019 20:25:16 +0000 (20:25 +0000)]
Eliminate some uses of the prev and next fields of vm_map_entry_t.
Since the only caller to vm_map_splay is vm_map_lookup_entry, move the
implementation of vm_map_splay into vm_map_lookup_helper, called by
vm_map_lookup_entry.
vm_map_lookup_entry returns the greatest entry less than or equal to a
given address, but in many cases the caller wants the least entry
greater than or equal to the address and uses the next pointer to get
to it. Provide an alternative interface to lookup,
vm_map_lookup_entry_ge, to provide the latter behavior, and let
callers use one or the other rather than having them use the next
pointer after a lookup miss to get what they really want.
In vm_map_growstack, the caller wants an entry that includes a given
address, and either the preceding or next entry depending on the value
of eflags in the first entry. Incorporate that behavior into
vm_map_lookup_helper, the function that implements all of these
lookups.
Eliminate some temporary variables used with vm_map_lookup_entry, but
inessential.
emaste [Tue, 25 Jun 2019 19:06:43 +0000 (19:06 +0000)]
bhyve: avoid theoretical stack buffer overflow from integer overflow
Use the proper size_t type to match strlen's return type. This is not
exploitable in practice as this parses command line arguments, which
are limited to well below 2^31 bytes.
This is a minimal change to address the reported issue; hda_parse_config
and the rest of this file will benefit from further review.
Reported by: Fakhri Zulkifli
Reviewed by: jhb, markj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
kevans [Tue, 25 Jun 2019 18:47:40 +0000 (18:47 +0000)]
libbe(3): restructure be_mount, skip canmount check for BE dataset
Further cleanup after r349380; loader and kernel will both ignore canmount
on the root dataset as well, so we should not be so strict about it when
mounting it. be_mount is restructured to make it more clear that depth==0 is
special, and to not try fetching these properties that we won't care about.
asomers [Tue, 25 Jun 2019 18:36:11 +0000 (18:36 +0000)]
fusefs: writes should update the file size, even when data_cache_mode=0
Writes that extend a file should update the file's size. r344185 restricted
that behavior for fusefs to only happen when the data cache was enabled.
That probably made sense at the time because the attribute cache wasn't
fully baked yet. Now that it is, we should always update the cached file
size during write.
mav [Tue, 25 Jun 2019 18:35:23 +0000 (18:35 +0000)]
Avoid extra taskq_dispatch() calls by DMU.
DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes
and os_synced_dnodes. Since the number of sublists by default is equal
to number of CPUs, it will dispatch equal, potentially large, number of
tasks, waking up many CPUs to handle them, even if only one or few of
sublists actually have any work to do.
This change adds check for empty sublists to avoid this.
kevans [Tue, 25 Jun 2019 18:13:39 +0000 (18:13 +0000)]
libbe(3): mount: the BE dataset is mounted at /
Other parts of libbe(3) were fairly strict on the mountpoint property of the
BE dataset, and be_mount was not much better. It was improved in r347027 to
allow mountpoint=none for depth==0, but this bit was still sensitive to
mountpoint != / and mountpoint != none. Given that other parts of libbe(3)
no longer restrict the mountpoint property here, and the rest of the base
system is generally OK and will assume that a BE is mounted at /, let's do
the same.
asomers [Tue, 25 Jun 2019 17:24:43 +0000 (17:24 +0000)]
fusefs: rewrite vop_getpages and vop_putpages
Use the standard facilities for getpages and putpages instead of bespoke
implementations that don't work well with the writeback cache. This has
several corollaries:
* Change the way we handle short reads _again_. vfs_bio_getpages doesn't
provide any way to handle unexpected short reads. Plus, I found some more
lock-order problems. So now when the short read is detected we'll just
clear the vnode's attribute cache, forcing the file size to be requeried
the next time it's needed. VOP_GETPAGES doesn't have any way to indicate
a short read to the "caller", so we just bzero the rest of the page
whenever a short read happens.
* Change the way we decide when to set the FUSE_WRITE_CACHE bit. We now set
it for clustered writes even when the writeback cache is not in use.
luporl [Tue, 25 Jun 2019 17:15:44 +0000 (17:15 +0000)]
[PowerPC64] Don't mark module data as static
Fixes panic when loading ipfw.ko and if_epair.ko built with modern compiler.
Similar to arm64 and riscv, when using a modern compiler (!gcc4.2), code
generated tries to access data in the wrong location, causing kernel panic
(data storage interrupt trap) when loading if_epair and ipfw.
Issue was reproduced with kernel/module compiled using gcc8 and clang8. It
affects both ELFv1 and ELFv2 ABI environments.
mav [Tue, 25 Jun 2019 17:00:53 +0000 (17:00 +0000)]
Fix strsep_quote() on strings without quotes.
For strings without quotes and escapes dstptr and srcptr are equal, so
zeroing *dstptr before checking *srcptr is not a good idea. In practice
it means that in -maproot=65534:65533 everything after the colon is lost.
The problem was there since r293305, but before r346976 it was covered by
improper strsep_quote() usage.
PR: 238725
MFC after: 3 days
Sponsored by: iXsystems, Inc.
asomers [Tue, 25 Jun 2019 16:49:20 +0000 (16:49 +0000)]
fusefs: fix multiple issues with the io tests
* During TearDown, close the test file before the backing file. That way
the backing file artifact will have the correct contents after the test
completes. It doesn't matter when running in Kyua, but it may when
running the test manually.
* Add a closeopen operation that mimics what FSX does with the "-c" option.
* Skip mmap-related tests when vfs.fusefs.data_cache_mode == 0
hselasky [Tue, 25 Jun 2019 11:54:41 +0000 (11:54 +0000)]
Convert all IPv4 and IPv6 multicast memberships into using a STAILQ
instead of a linear array.
The multicast memberships for the inpcb structure are protected by a
non-sleepable lock, INP_WLOCK(), which needs to be dropped when
calling the underlying possibly sleeping if_ioctl() method. When using
a linear array to keep track of multicast memberships, the computed
memory location of the multicast filter may suddenly change, due to
concurrent insertion or removal of elements in the linear array. This
in turn leads to various invalid memory access issues and kernel
panics.
To avoid this problem, put all multicast memberships on a STAILQ based
list. Then the memory location of the IPv4 and IPv6 multicast filters
become fixed during their lifetime and use after free and memory leak
issues are easier to track, for example by: vmstat -m | grep multi
All list manipulation has been factored into inline functions
including some macros, to easily allow for a future hash-list
implementation, if needed.
dougm [Tue, 25 Jun 2019 07:44:37 +0000 (07:44 +0000)]
vm_map_protect may return an INVALID_ARGUMENT or PROTECTION_FAILURE
error response after clipping the first map entry in the region to be
reserved. This creates a pair of matching entries that should have
been "simplified" back into one, or never created. This change defers
the clipping of that entry until those two vm_map_protect failure
cases have been ruled out.