Follow-up to r346046. These two commits implement fuse cache timeouts for
both entries and attributes. They also remove the vfs.fusefs.lookup_cache
enable sysctl, which is no longer needed now that cache timeouts are
honored.
The FUSE protocol allows the server to specify the timeout period for the
client's attribute and entry caches. This commit implements the timeout
period for the attribute cache. The entry cache's timeout period is
currently disabled because it panics, and is guarded by the
vfs.fusefs.lookup_cache_expire sysctl.
PR: 235773
Reported by: cem
Sponsored by: The FreeBSD Foundation
FUSE_LOOKUP, FUSE_GETATTR, FUSE_SETATTR, FUSE_MKDIR, FUSE_LINK,
FUSE_SYMLINK, FUSE_MKNOD, and FUSE_CREATE all return file attributes with a
cache validity period. fusefs will now cache the attributes, if the server
returns a non-zero cache validity period.
This change does _not_ implement finite attr cache timeouts. That will
follow as part of PR 235773.
PR: 235775
Reported by: cem
Sponsored by: The FreeBSD Foundation
VOP_ACCESS was never fully implemented in fusefs. This change:
* Removes the FACCESS_DO_ACCESS flag, which pretty much disabled the whole
vop.
* Removes a quixotic special case for VEXEC on regular files. I don't know
why that was in there.
* Removes another confusing special case for VADMIN.
* Removes the FACCESS_NOCHECKSPY flag. It seemed to be a performance
optimization, but I'm unconvinced that it was a net positive.
* Updates test cases.
This change does NOT implement -o default_permissions. That will be handled
separately.
fusefs: enforce -onoallow_other even beneath the mountpoint
When -o allow_other is not in use, fusefs is supposed to prevent access to
the filesystem by any user other than the one who owns the daemon. Our
fusefs implementation was only enforcing that restriction at the mountpoint
itself. That was usually good enough because lookup usually descends from
the mountpoint. However, there are cases when it doesn't, such as when
using openat relative to a file beneath the mountpoint.
These tests were actually fixed by r345398, r345390 and r345392, but I
neglected to reenable them. Too bad googletest doesn't have the notion of
an Expected Failure like ATF does.
PR: 236474, 236473
Sponsored by: The FreeBSD Foundation
cem [Thu, 4 Apr 2019 23:32:27 +0000 (23:32 +0000)]
sort(1): randomcoll: Skip the memory allocation entirely
There's no reason to order based on strcmp of ASCII digests instead of
memcmp of the raw digests.
While here, remove collision fallback. If you collide two MD5s, they're
probably the same string anyway. If robustness against MD5 collisions is
desired, maybe we shouldn't use MD5.
None of the behavior of sort -R is specified by POSIX, so we're free to
implement this however we like. E.g., using a 128-bit counter and block cipher
to generate unique indices for each line of input.
PR: 230792 (2/many)
Relnotes: This will change the sort order for a given dataset with a
given seed. Other similarly breaking changes are planned.
Sponsored by: Dell EMC Isilon
Revert r320698, since the related userland changes were reverted by r338192.
r338192 reverted the changes to nfsuserd so that it could use an AF_LOCAL
socket, since it resulted in a vnode locking panic().
Post r338192 nfsuserd daemons use the old AF_INET socket for upcalls and
do not use these kernel changes.
I left them in for a while, so that nfsuserd daemons built from head sources
between r320757 (Jul. 6, 2017) and r338192 (Aug. 22, 2018) would need them
by default.
This only affects head, since the changes were never MFC'd.
I will add an UPDATING entry, since an nfsuserd daemon built from head
sources between r320757 and r338192 will not run unless the "-use-udpsock"
option is specified. (This command line option is only in the affected
revisions of the nfsuserd daemon.)
I suspect few will be affected by this, since most who run systems built
from head sources (not stable or releases) will have rebuilt their nfsuserd
daemon from sources post r338192 (Aug. 22, 2018)
This is being reverted in preparation for an update to include AF_INET6
support to the code.
If a fuse file system returne FOPEN_KEEP_CACHE in the open or create
response, then the client is supposed to _not_ clear its caches for that
file. I don't know why clearing the caches would be the default given that
there's a separate flag to bypass the cache altogether, but that's the way
it is. fusefs(5) will now honor this flag.
Our behavior is slightly different than Linux's because we reuse file
handles. That means that open(2) wont't clear the cache if there's a
reusable file handle, even if the file server wouldn't have sent
FOPEN_KEEP_CACHE had we opened a new file handle like Linux does.
This bug was long present, but was exacerbated by r345876.
The problem is that fiov_refresh was bzero()ing a buffer _before_ it
reallocated that buffer. That's obviously the wrong order. I fixed the
order in r345876, which exposed the main problem. Previously, the first 160
bytes of the buffer were getting bzero()ed when it was first allocated in
fiov_init. Subsequently, as that buffer got recycled between callers, the
portion used by the _previous_ caller was getting bzero()ed by the current
caller in fiov_refresh. The problem was never visible simply because no
caller was trying to use more than 160 bytes.
Now the buffer gets properly bzero()ed both at initialization time and any
time it gets enlarged or reallocated.
Use IN_foo() macros from sys/netinet/in.h inplace of handcrafted code
There are a few places that use hand crafted versions of the macros
from sys/netinet/in.h making it difficult to actually alter the
values in use by these macros. Correct that by replacing handcrafted
code with proper macro usage.
- Remove issues that no longer apply thanks to devfs
- Add language pointing out devfs's role and referencing its config
- Add a "historical notes" section and move discussion of block vs character devs to it, including pointing out the removal of block devs
- Modernize some examples
If a FUSE daemon returns FOPEN_DIRECT_IO when a file is opened, then it's
allowed to write less data than was requested during a FUSE_WRITE operation
on that file handle. fusefs should simply return a short write to userland.
The old code attempted to resend the unsent data. Not only was that
incorrect behavior, but it did it in an ineffective way, by attempting to
"rewind" the uio and uiomove the unsent data again.
This commit correctly handles short writes by returning directly to
userland if FOPEN_DIRECT_IO was set. If it wasn't set (making the short
write technically a protocol violation), then we resend the unsent data.
But instead of rewinding the uio, just resend the data that's already in the
kernel.
That necessitated a few changes to fuse_ipc.c to reduce the amount of bzero
activity. fusefs may be marginally faster as a result.
Fix malloc stats for the RPCSEC_GSS server code when DEBUG is enabled.
The code enabled when "DEBUG" is defined uses mem_alloc(), which is a
malloc(.., M_RPC, M_WAITOK | M_ZERO), but then calls gss_release_buffer()
which does a free(.., M_GSSAPI) to free the memory.
This patch fixes the problem by replacing mem_alloc() with a
malloc(.., M_GSSAPI, M_WAITOK | M_ZERO).
This bug affects almost no one, since the sources are not normally built
with "DEBUG" defined.
Implement tests for online expansion:
- init, init -R
- onetime, onetime -R
- 512 and 4k sectors
- encryption only
- encryption and authentication
- configure -r/-R for detached providers
- configure -r/-R for attached providers
- all keys allocated (10, 20 and 30MB provider sizes)
- keys allocated on demand (10, 20 and 30PB provider sizes)
- reading and writing to provider after expansion (10-30MB only)
- checking if metadata in old location is cleared.
Implement automatic online expansion of GELI providers - if the underlying
provider grows, GELI will expand automatically and will move the metadata
to the new location of the last sector.
This functionality is turned on by default. It can be turned off with the
-R flag, but it is not recommended - if the underlying provider grows and
automatic expansion is turned off, it won't be possible to attach this
provider again, as the metadata is no longer located in the last sector.
If the automatic expansion is turned off and the underlying provider grows,
GELI will only log a message with the previous size of the provider, so
recovery can be easier.
phil [Wed, 3 Apr 2019 21:55:39 +0000 (21:55 +0000)]
Import libxo-1.0.2
from 1.0.0:
Add "continuation" flag, to allow multiple "xo" invocations in a single line of output (#58)
Add --top-wrap to make top-level JSON wrappers
Add --{open,close}-{list,instace} options
Add xo_xml_leader(), to detect use of some bogus XML tags. It's still bad form, but it's a little safer now
Avoid call to xo_write before xo_flush, since the latter calls the former
Check return code from xo_flush_h properly (<0) (FreeBSD Bug 236935)
For JSON output, avoid newline before a container's close brace (#62)
Merge branch 'text_only' of https://github.com/zvr/libxo into zvr-text_only
Use XO_USE_INT_RETURN_CODES, not USE_INT_RETURN_CODES
add docs for --continuation
add docs for --not-first
call xo_state_set_flags before values and close containers; add XOIF_MADE_OUTPUT flag to track state; make proper empty JSON objects in xo_finish
color_map code has to be #ifdef'd out, since the struct definition
correct xo_flush_func_t (doesn't use xo_ssize_t)
make depth change for --top-wrap only for JSON
fix to handle --top-wrap in "xo" by being more consistent with handling trailing newlines
fix to handle text-only version #64 (from zvr)
fix xo_buf_has_room for round up to the next XO_BUFSIZ, not just add XO_BUFSIZ to the size (FreeBSD Bug 236937)
update docs for new "xo" options
update functions to use xo_ssize_t
update test cases
from 1.0.1:
Add EINTEGRITY to .pot files under test/gettext/ (fix from FreeBSD)
from 1.0.2:
handle failure from xo_vnsprintf; don't add -1 to "rc"
PR: 236937, 236935
Submitted by: phil
Reported by: Alfonso S. Siciliano <alfix86@gmail.com>
MFC after: 2 weeks
If MACHINE_ARCH doesn't match TARGET_ARCH, and we're not in the special
case of building i386 images on an amd64 host, we need to pull in the
qemu-user-static package; this allows us to run some commands inside
the VM disk image chroot, most notably to install packages.
In r337703 DTS files were updated to Linux 4.18, including Linux commit 4d8b032d3c03f4e9788a18bbb51b10e6c9e8a56b which removed the `phy_id`
property from am335x-bone-common (as the property was deprecated).
Use `phy-handle` via fdt_get_phyaddr, keeping the existing code as a
fallback for old DTBs.
PR: 236624
Submitted by: manu, Gerald Aryeetey <aryeeteygerald_rogers.com>
Reported by: Gerald Aryeetey
Reviewed by: manu
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19814
The original fusefs import, r238402, contained a bug in fuse_vnop_close that
could close a directory's file handle while there were still other open file
descriptors. The code looks deliberate, but there is no explanation for it.
This necessitated a workaround in fuse_vnop_readdir that would open a new
file handle if, "for some mysterious reason", that vnode didn't have any
open file handles. r345781 had the effect of causing the workaround to
panic, making the problem more visible.
This commit removes the workaround and the original bug, which also fixes
the panic.
The FUSE protocol says that FUSE_FLUSH should be send every time a file
descriptor is closed. That's not quite possible in FreeBSD because multiple
file descriptors can share a single struct file, and closef doesn't call
fo_close until the last close. However, we can still send FUSE_FLUSH on
every VOP_CLOSE, which is probably good enough.
There are two purposes for FUSE_FLUSH. One is to allow file systems to
return EIO if they have an error when writing data that's cached
server-side. The other is to release POSIX file locks (which fusefs(5) does
not yet support).
PR: 236405, 236327
Sponsored by: The FreeBSD Foundation
libbe(3): Add a serial to the generated snapshot names
To use bectl in an example, when one creates a new boot environment with
either `bectl create <be>` or `bectl create -e <otherbe> <be>`, libbe will
take a snapshot of the original boot environment to clone. Previously, this
used %F-%T date format as the snapshot name, but this has some limitations-
attempting to create multiple boot environments in quick succession may
collide if done within the same second.
Tack a serial onto it to reduce the chances of a collision... we could still
collide if multiple processes/threads are creating boot environments at the
same time, but this is likely not a big concern as this has only been
reported as occurring in freebsd-ci setup.
msdosfs: zero tail of the last block on truncation for VREG vnodes as well.
Despite the call to vtruncbuf() from detrunc(), which results in
zeroing part of the partial page after EOF, there still is a
possibility to retain the stale data which is revived on file
enlargement. If the filesystem block size is greater than the page
size, partial block might keep other after-EOF pages wired and they
get reused then. Fix it by zeroing whole part of the partial buffer
after EOF, not relying on vnode_pager_setsize().
PR: 236977
Reported by: asomers
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Follow the declared behaviour that specifies server string format in
bsnmpclient(3).
snmp_parse_server() function accepts string where some fields can be
omitted: [trans::][community@][server][:port]
"trans" field can be "udp", "udp6", "dgram" and "stream".
"community" can be empty string, if it is omitted, the default value
will be used. For read_community it is "public", for write_comminity
it is "private". "server" field can be hostname, IPv4 address or IPv6
address. IPv6 address should be specified in brackets "[]".
If port is omitted, the default value "snmp" will be used for "udp"
and "udp6" transports. So, now for bsnmpget(1) and bsnmwalk(1) it is
not required to specify all fields in argument of '-s' option. E.g.
Harvesting has to compete for the TPM chip with userspace.
Before this change the callout could hijack an unread buffer
causing a userspace call to the TPM to fail.
powerpc: Allow emulating optional FPU instructions on CPUs with an FPU
The e5500 has an FPU, but lacks the optional fsqrt instruction. This
instruction gets emulated in the kernel, but the emulation uses stale data,
from the last switch out, and does not return the result of the operation
immediately. Fix both of these conditions by saving and restoring the FPRs
around the emulation point.
Create kernel module to parse Veriexec manifest based on envs
The current approach of injecting manifest into mac_veriexec is to
verify the integrity of it in userspace (veriexec (8)) and pass its
entries into kernel using a char device (/dev/veriexec).
This requires verifying root partition integrity in loader,
for example by using memory disk and checking its hash.
Otherwise if rootfs is compromised an attacker could inject their own data.
This patch introduces an option to parse manifest in kernel based on envs.
The loader sets manifest path and digest.
EVENTHANDLER is used to launch the module right after the rootfs is mounted.
It has to be done this way, since one might want to verify integrity of the init file.
This means that manifest is required to be present on the root partition.
Note that the envs have to be set right before boot to make sure that no one can spoof them.
powerpc: Apply r178139 from sparc64 to powerpc's fpu_sqrt
This fix was committed less than 2 months after the code was forked into the
powerpc kernel. Though powerpc doesn't use quad-precision floating point,
or need it for emulation, the changes do look like correctness fixes
overall.
This was found while trying to get fsqrt emulation working on e5500, which
does have a real FPU, but lacks the fsqrt instruction. This is not the
complete fix, the rest is to be committed separately.
fusefs: during ftruncate, discard cached data past truncation point
During truncate, fusefs was discarding entire cached blocks, but it wasn't
zeroing out the unused portion of a final partial block. This resulted in
reads returning stale data.
PR: 233783
Reported by: fsx
Sponsored by: The FreeBSD Foundation
Fix a race in the RPCSEC_GSS server code that caused crashes.
When a new client structure was allocated, it was added to the list
so that it was visible to other threads before the expiry time was
initialized, with only a single reference count.
The caller would increment the reference count, but it was possible
for another thread to decrement the reference count to zero and free
the structure before the caller incremented the reference count.
This could occur because the expiry time was still set to zero when
the new client structure was inserted in the list and the list was
unlocked.
This patch fixes the race by initializing the reference count to two
and initializing all fields, including the expiry time, before inserting
it in the list.
cxgbe(4): Add a flag to indicate that bits in interrupt cause but not in
interrupt enable are not fatal.
The firmware sets up all the interrupt enables based on run time
configuration, which means the information in the enables is more
accurate than what's compiled into the driver. This change also allows
the fatal bits to be updated without any changes in the driver in some
cases.
This commit cleans up after recent commits, especially 345766, 345768, and
345781. There is no functional change. The most important change is to add
comments documenting why we can't send flags like O_APPEND in
FUSE_WRITE_OPEN.
dim [Tue, 2 Apr 2019 18:01:54 +0000 (18:01 +0000)]
Fix regression in top(1) after r344381, causing informational messages
to no longer be displayed. This was because the reimplementation of
setup_buffer() did not copy the previous contents into any reallocated
buffer.
Reported by: James Wright <james.wright@jigsawdezign.com>
PR: 236947
MFC after: 3 days
In particular:
- suspend the mount around vflush() to avoid new writes come after the
vnode is processed;
- flush pending metadata updates (mostly node times);
- remap all rw mappings of files from the mount into ro.
It is not clear to me how to handle writeable mappings on rw->ro for
tmpfs best. Other filesystems, which use vnode vm object, call
vgone() on vnodes with writers, which sets the vm object type to
OBJT_DEAD, and keep the resident pages and installed ptes as is. In
particular, the existing mappings continue to work as far as
application only accesses resident pages, but changes are not flushed
to file.
For tmpfs the vm object of VREG vnodes also serves as the data pages
container, giving single copy of the mapped pages, so it cannot be set
to OBJT_DEAD. Alternatives for making rw mappings ro could be either
invalidating them at all, or marking as CoW.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D19737
This patch adds a new table begemotSnmpdTransInetTable that uses the
InetAddressType textual convention and can be used to create listening
ports for IPv4, IPv6, zoned IPv6 and based on DNS names. It also supports
future extension beyond UDP by adding a protocol identifier to the table
index. In order to support this gensnmptree had to be modified.
* Crank the OPAL state machine during the receive loop, to make sure the
pollers are executed
* Add a proper detach function, so the module can be unloaded and reloaded
at runtime.
It still doesn't reliably work 100% of the time on POWER9, and it appears
timing and/or cache related. It may work on POWER8 now.
powernv: Port OPAL asynchronous framework to use the new message framework
Since OPAL_GET_MSG does not discriminate between message types, asynchronous
completion events may be received in the OPAL_GET_MSG call, which dequeues
them from the list, thus preventing OPAL_CHECK_ASYNC_COMPLETION from
succeeding. Handle this case by integrating with the messaging framework.
Summary:
OPAL needs to be kicked periodically in order for the firmware to make
progress on its tasks. To do so, create a heartbeat thread to perform this task
every N milliseconds, defined by the device tree. This task is also a central
location to handle all messages received from OPAL.
Integrate capsicum-test into the FreeBSD test suite
This change takes capsicum-test from upstream and applies some local changes to make the
tests work on FreeBSD when executed via Kyua.
The local modifications are as follows:
1. Make `OpenatTest.WithFlag` pass with the new dot-dot lookup behavior in FreeBSD 12.x+.
2. capsicum-test references a set of helper binaries: `mini-me`, `mini-me.noexec`, and
`mini-me.setuid`, as part of the execve/fexecve tests, via execve, fexecve, and open.
It achieves this upstream by assuming `mini-me*` is in the current directory, however,
in order for Kyua to execute `capsicum-test`, it needs to provide a full path to
`mini-me*`. In order to achieve this, I made `capsicum-test` cache the executable's
path from argv[0] in main(..) and use the cached value to compute the path to
`mini-me*` as part of the execve/fexecve testcases.
3. The capsicum-test test suite assumes that it's always being run on CAPABILITIES enabled
kernels. However, there's a chance that the test will be run on a host without a
CAPABILITIES enabled kernel, so we must check for the support before running the tests.
The way to achieve this is to add the relevant `feature_present("security_capabilities")`
check to SetupEnvironment::SetUp() and skip the tests when the support is not available.
While here, add a check for `kern.trap_enotcap` being enabled. As noted by markj@ in
https://github.com/google/capsicum-test/issues/23, this sysctl being enabled can trigger
non-deterministic failures. Therefore, the tests should be skipped if this sysctl is
enabled.
All local changes have been submitted to the capsicum-test project
(https://github.com/google/capsicum-test) and are in various stages of review.
Please see the following pull requests for more details:
1. https://github.com/google/capsicum-test/pull/35
2. https://github.com/google/capsicum-test/pull/41
3. https://github.com/google/capsicum-test/pull/42
fusefs: send FUSE_OPEN for every open(2) with unique credentials
By default, FUSE performs authorization in the server. That means that it's
insecure for the client to reuse FUSE file handles between different users,
groups, or processes. Linux handles this problem by creating a different
FUSE file handle for every file descriptor. FreeBSD can't, due to
differences in our VFS design.
This commit adds credential information to each fuse_filehandle. During
open(2), fusefs will now only reuse a file handle if it matches the exact
same access mode, pid, uid, and gid of the calling process.
There is some code duplication in error handling paths in a few functions.
Create a function for printing such errors in human-readable way and get rid
of duplicates.
Import proof-of-concept for handling `GTEST_SKIP()` in `Environment::SetUp`
Per the upstream pull-request [1]:
```
gtest prior to this change would completely ignore `GTEST_SKIP()` if
called in `Environment::SetUp()`, instead of bailing out early, unlike
`Test::SetUp()`, which would cause the tests themselves to be skipped.
The only way (prior to this change) to skip the tests would be to
trigger a fatal error via `GTEST_FAIL()`.
Desirable behavior, in this case, when dealing with
`Environment::SetUp()` is to check for prerequisites on a system
(example, kernel supports a particular featureset, e.g., capsicum), and
skip the tests. The alternatives prior to this change would be
undesirable:
- Failing sends the wrong message to the test user, as the result of the
tests is indeterminate, not failed.
- Having to add per-test class abstractions that override `SetUp()` to
test for the capsicum feature set, then skip all of the tests in their
respective SetUp fixtures, would be a lot of human and computational
work; checking for the feature would need to be done for all of the
tests, instead of once for all of the tests.
For those reasons, making `Environment::SetUp()` handle `GTEST_SKIP()`,
by not executing the testcases, is the most desirable solution.
In order to properly diagnose what happened when running the tests if
they are skipped, print out the diagnostics in an ad hoc manner.
Update the documentation to note this change and integrate a new test,
gtest_skip_in_environment_setup_test, into the test suite.
This change addresses #2189.
Signed-off-by: Enji Cooper <yaneurabeya@gmail.com>
```
The goal with my merging in this change is to avoid requiring extensive
refactoring/retesting of test suites when ensuring prerequisites are met,
e.g., checking for a CAPABILITIES-enabled kernel before running capsicum-test
(see D19758 for more details).
The proof-of-concept is being imported before accepted by the upstream
project due to the fact that the upstream project is undergoing a potential
development freeze and the maintainers aren't responding to my PR.
'be_destroy' can destroy a boot environment (by name) or a given snapshot.
If the target to be destroyed is a dataset, check if it's mounted. We don't
want to check if the origin dataset is mounted when destroying a snapshot.
O_EXEC is useful for fexecve(2) and fchdir(2). Treat it as another fufh
type alongside the existing RDONLY, WRONLY, and RDWR. Prior to r345742 this
would've caused a memory and performance penalty.
r345742 replaced fusefs's fufh array with a fufh list. But it left a few
array idioms in place. This commit replaces those idioms with more
efficient list idioms. One location is in fuse_filehandle_close, which now
takes a pointer argument. Three other locations are places that had to loop
over all of a vnode's fuse filehandles.
mckusick [Sun, 31 Mar 2019 21:34:58 +0000 (21:34 +0000)]
When using the force option to shut down a memory-disk device,
I/O operations already in its queue were not being properly drained.
The GEOM framework does the queue draining, but the device driver
needs to wait for the draining to happen. The waiting is done by
adding a g_md_providergone() function to wait for the I/O operations
to finish up.
It is likely that every GEOM provider that implements orphaning
attached GEOM consumers needs to use the "providergone" mechanism
for this same reason, but some of them do not do so. Apparently
Kenneth Merry (ken@) added the drain for just such races, but he
missed adding it to some of the device drivers that needed it.