Add v_inval_buf_range, like vtruncbuf but for a range of a file
v_inval_buf_range invalidates all buffers within a certain LBA range of a
file. It will be used by fusefs(5). This commit is a partial merge of
r346162, r346606, and r346756 from projects/fuse2.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21032
If not limited by write_same_max_lba option, split operation into several
2^^31 blocks chunks in a loop. For large disks it may take a while, so
setting write_same_max_lba may be useful to avoid timeouts.
Remove support for kernel.tramp and kernel.tramp.gz
Nothing uses these anymore. They were for super small armv4 boards without
uboot. We removed armv4 support before 13.0, but neglected to garbage collect
this at the same time. Today, both flavors of armv5 kernels (mv and ralink) boot
via uboot which has its own compression scheme for boards that need it.
Note: OLDFILES has not been updated beacuse installkernel will move the whole
directory out of the way before installing the new kernel.
Substitute driver-defined IS_P2ALIGNED() with EFX_IS_P2ALIGNED()
defined in libefx.
Add type argument and cast value and alignment to one specified type.
Reported by: Andrea Valsania <andrea.valsania at answervad.it>
Reviewed by: philip
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D21076
sfxge(4): fix align to power of 2 when align has smaller type
Substitute driver-defined P2ALIGN() with EFX_P2ALIGN() defined in
libefx.
Cast value and alignment to one specified type to guarantee result
correctness.
Reported by: Andrea Valsania <andrea.valsania at answervad.it>
Reviewed by: philip
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D21075
sfxge(4): fix power of 2 round up when align has smaller type
Substitute driver-defined P2ROUNDUP() h with EFX_P2ROUNDUP()
defined in libefx.
Cast value and alignment to one specified type to guarantee result
correctness.
Reported by: Andrea Valsania <andrea.valsania at answervad.it>
Reviewed by: philip
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D21074
Lock the vnode before calling ufs_bmap_seekdata().
r346932 replaced a call to vn_bmap_seekhole() with a call to
ufs_bmap_seekdata(). Although vn_bmap_seekhole() locks the vnode,
ufs_bmap_seekdata() assumes it is already locked.
This patch adds locking of the vnode before the ufs_bmap_seekdata() call.
If the vn_lock() call fails, it returns EBADF since that is the normal
error returned when a file system is forced dismounted and is already
listed as an error return in the lseek(2) man page.
virtio: Fix running on machines with memory above 0xffffffff
We want to allocate a contiguous memory block anywhere in memory, but
expressed this as having to be between 0 and 0xffffffff. This limits us
on 64-bit machines, and outright breaks on machines where memory is
mapped above that address range.
Allow the full address range to be used for this allocation.
dim [Fri, 26 Jul 2019 18:49:20 +0000 (18:49 +0000)]
Pull in r366369 from upstream llvm trunk (by Francis Visoiu Mistrih):
[CodeGen][NFC] Simplify checks for stack protector index checking
Use `hasStackProtectorIndex()` instead of `getStackProtectorIndex()
>= 0`.
Pull in r366371 from upstream llvm trunk (by Francis Visoiu Mistrih):
[PEI] Don't re-allocate a pre-allocated stack protector slot
The LocalStackSlotPass pre-allocates a stack protector and makes sure
that it comes before the local variables on the stack.
We need to make sure that later during PEI we don't re-allocate a new
stack protector slot. If that happens, the new stack protector slot
will end up being **after** the local variables that it should be
protecting.
Therefore, we would have two slots assigned for two different stack
protectors, one at the top of the stack, and one at the bottom. Since
PEI will overwrite the assigned slot for the stack protector, the
load that is used to compare the value of the stack protector will
use the slot assigned by PEI, which is wrong.
For this, we need to check if the object is pre-allocated, and re-use
that pre-allocated slot.
Pull in r367068 from upstream llvm trunk (by Francis Visoiu Mistrih):
[CodeGen] Don't resolve the stack protector frame accesses until PEI
Currently, stack protector loads and stores are resolved during
LocalStackSlotAllocation (if the pass needs to run). When this is the
case, the base register assigned to the frame access is going to be
one of the vregs created during LocalStackSlotAllocation. This means
that we are keeping a pointer to the stack protector slot, and we're
using this pointer to load and store to it.
In case register pressure goes up, we may end up spilling this
pointer to the stack, which can be a security concern.
Instead, leave it to PEI to resolve the frame accesses. In order to
do that, we make all stack protector accesses go through frame index
operands, then PEI will resolve this using an offset from sp/fp/bp.
Together, these fix a issue where the stack protection feature in LLVM's
ARM backend can be rendered ineffective when the stack protector slot is
re-allocated so that it appears after the local variables that it is
meant to protect, leaving the function potentially vulnerable to a
stack-based buffer overflow.
Reported by: andrew
Security: https://kb.cert.org/vuls/id/129209/
MFC after: 3 days
commit log from upstream:
Fix race in parallel mount's thread dispatching algorithm
Strategy of parallel mount is as follows.
1) Initial thread dispatching is to select sets of mount points that
don't have dependencies on other sets, hence threads can/should run
lock-less and shouldn't race with other threads for other sets. Each
thread dispatched corresponds to top level directory which may or may
not have datasets to be mounted on sub directories.
2) Subsequent recursive thread dispatching for each thread from 1)
is to mount datasets for each set of mount points. The mount points
within each set have dependencies (i.e. child directories), so child
directories are processed only after parent directory completes.
The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.
There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.
1) #8833 case where input is `/a /a /a/b` after sorting.
The problem is that libzfs_path_contains() can't correctly handle an
input list with two same top level directories.
There is a race between two POSIX threads A and B,
* ThreadA for "/a" for test1 and "/a/b"
* ThreadB for "/a" for test0/a
and in case of #8833, ThreadA won the race. Two threads were created
because "/a" wasn't considered as `"/a" contains "/a"`.
2) #8450 case where input is `/ /var/data /var/data/test` after sorting.
The problem is that libzfs_path_contains() can't correctly handle an
input list containing "/".
There is a race between two POSIX threads A and B,
* ThreadA for "/" and "/var/data/test"
* ThreadB for "/var/data"
and in case of #8450, ThreadA won the race. Two threads were created
because "/var/data" wasn't considered as `"/" contains "/var/data"`.
In other words, if there is (at least one) "/" in the input list,
the initial thread dispatching must be single-threaded since every
directory is a child of "/", meaning they all directly or indirectly
depend on "/".
In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
PR: 237517, 237397, 239243
Submitted by: Matthew D. Fuller <fullermd@over-yonder.net> (by email)
MFC after: 3 days
This snapshot among other things includes a fix for a crash of mandoc with empty
tbl reported by rea@ (his regression test has been incorporated upstream)
Add reporting of SCSI Feature Sets VPD page from SPC-5.
CTL implements all defined feature sets except Drive Maintenance 2016,
which is not very applicable to such a virtual device, and implemented
only partially now. But may be it could be fixed later at least for
completeness.
Simplify the handling of superpages in pmap_clear_modify(). Specifically,
if a demotion succeeds, then all of the 4KB page mappings within the
superpage-sized region must be valid, so there is no point in testing the
validity of the 4KB page mapping that is going to be write protected.
The timeout field in the CAPS register is defined to be 8 bits, so its type was
uint8_t. We recently started adding 1 to it to cope with rogue devices that
listed 0 timeout time (which is impossible). However, in so doing, other devices
that list 0xff (for a 2 minute timeout) were broken when adding 1
overflowed. Widen the type to be uint32_t like its source register to avoid the
issue.
Added allocation retry loop in alloc_pvo_entry(), to wait for
memory to become available if the caller specifies the M_WAITOK flag.
Also, the loop in moa64_enter() was removed, as moea64_pvo_enter()
never returns ENOMEM. It is alloc_pvo_entry() memory allocation that
can fail and must be retried.
r350320 committed the wrong version of generated syscall.mk.
This commit is for the correct version. (The incorrect one had the order
of the last two entries reversed, due to my testing with copy_file_range
at 568 instead of 569.) This misordering should not have been a problem,
but is now fixed.
r350315 created a Linux compatible copy_file_range(2) syscall.
It uses a VOP method called VOP_COPY_FILE_RANGE so that file systems,
such as the NFSv4.2 client can do file system specific copying.
For NFSv4.2, this allows the copying to be done locally on the NFS server,
avoiding transferring the data across the wire twice.
Add kernel support for a Linux compatible copy_file_range(2) syscall.
This patch adds support to the kernel for a Linux compatible
copy_file_range(2) syscall and the related VOP_COPY_FILE_RANGE(9).
This syscall/VOP can be used by the NFSv4.2 client to implement the
Copy operation against an NFSv4.2 server to do file copies locally on
the server.
The vn_generic_copy_file_range() function in this patch can be used
by the NFSv4.2 server to implement the Copy operation.
Fuse may also me able to use the VOP_COPY_FILE_RANGE() method.
vn_generic_copy_file_range() attempts to maintain holes in the output
file in the range to be copied, but may fail to do so if the input and
output files are on different file systems with different _PC_MIN_HOLE_SIZE
values.
Separate commits will be done for the generated syscall files and userland
changes. A commit for a compat32 syscall will be done later.
fusefs file systems may have a fsname subtype (set by mount_fusefs's "-o
subtype" option) that gets appended to the fsname as returned by statfs(2).
The subtype is set on a per-mount basis so it isn't part of the struct
vfsconf. Special-case getvfsbyname to match either the full "fusefs.foobar"
or short "fusefs" fsname.
This is a merge of r348007, r348054, and r350093 from projects/fuse2
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21043
Summary:
It turns out statistics accounting is very expensive in the pmap driver,
and doesn't seem necessary in the common case. Make this optional
behind a MOEA64_STATS #define, which one can set if they really need
statistics.
Fix the fix to the logic bug. Upon further testing, the bug is that we shadoow
opt.vendor with vendor. We shouldn't. Delete the latter and use the former
everywhere and restore the prior logic which is now correct.
turnstile_{lock,unlock}() were added for use in epoch. turnstile_lock()
returned NULL to indicate that the calling thread had lost a race and
the turnstile was no longer associated with the given lock, or the lock
owner. However, reader-writer locks may not have a designated owner,
in which case turnstile_lock() would return NULL and
epoch_block_handler_preempt() would leak spinlocks as a result.
Apply a minimal fix: return the lock owner as a separate return value.
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21048
Commit text by Jake:
If a driver's IFDI_ATTACH_PRE function fails, the iflib_device_register
function will free the ctx pointer. However, it does not reset the
device softc pointer to NULL.
This will result in memory corruption as a future access to the now
invalid pointer will corrupt memory that is later allocated on top of
the same memory location.
The iflib_device_deregister function correctly resets the softc pointer
by using device_set_softc().
This clears up the invalid dangling pointer and prevents memory
corruption that could lead to a panic or undefined behavior if the
device's driver failed to attach.
libsysdecode: add explicit dependencies on recently changed headers
r349369 removed IP_MIN_MEMBERSHIPS and IPV6_MIN_MEMBERSHIPS, and r349893
removed TCP_RACK_SESS_CWV. libsysdecode lacked dependencies to trigger a
rebuild of tables.h.
Add explicit dependencies as a workaround to address these specific
cases; a holistic solution is still needed.
Add a sysctl variable ts_offset_per_conn to change the computation
of the TCP TS offset from taking the IP addresses and the TCP port
numbers into account to a version just taking only the IP addresses
into account. This works around broken middleboxes or endpoints.
The default is to keep the behaviour, which is also the behaviour
recommended in RFC 7323.
Fix the register layout for the Buffer Descript List Entry. It
got jumbled around during some other cleanups and was causing
audio failures on some guests.
ixgbe(4): Fix enabling/disabling and reconfiguration of queues
- Wrong order of casting and bit shift caused that enabling and disabling
queues didn't work properly for queues number larger than 32. Use literals
with right suffix instead.
- TX ring tail address was not updated during reinitiailzation of TX
structures. It could block sending traffic.
- Also remove unused variables 'eims' and 'active_queues'.
andrew [Tue, 23 Jul 2019 14:40:37 +0000 (14:40 +0000)]
Ensure the arm64 ID register fields are 64 bit types.
Previously only some of the ID register fields were 64 bit. To allow
for a script to generate these mark them all 64 bit. To allow for their
use in assembly we need to use the UINT64_C macro via a new UL macro
to stop the lines from being too long.
After r343631 pfil hooks are invoked in net_epoch_preempt section,
this allows to avoid extra locking. Add NET_EPOCH_ASSER() assertion
to each ipfw_bpf_*tap*() call to require to be called from inside
epoch section.
Use NET_EPOCH_WAIT() in ipfw_clone_destroy() to wait until it becomes
safe to free() ifnet. And use on-stack ifnet pointer in each
ipfw_bpf_*tap*() call to avoid NULL pointer dereference in case when
V_*log_if global variable will become NULL during ipfw_bpf_*tap*() call.
While for ATA disks resize is even more rare situation than for SCSI, it
may happen in case of HPA or AMA being used. Make ATA XPT report minor
IDENTIFY DATA change to upper layers with AC_GETDEV_CHANGED, and ada(4)
periph driver handle that event, recalculating all the disk properties and
signalling resize to GEOM. Since ATA has no mechanism of UNIT ATTENTIONs,
like SCSI, it has no way to detect that something has changed. That is why
this functionality depends on explicit reprobe via XPT_REPROBE_LUN call.
MFC after: 2 weeks
Relnotes: yes
Sponsored by: iXsystems, Inc.
ata_xpt: Use the correct union member when accessing valid.
In principle this should not matter as it's a union and they point to
the same memory location but based on the code above we should be
accessing .sata and not .ata.
Remove the USE_RFC2292BIS option and reap dead code
This option was imported as part of the KAME project in r62627 (in 2000).
It was turned on unconditionally in r121472 (in 2003) and has been on ever
since. The old alternative code has bitrotted. Reap the dead code.
o Add support for BERI IOMMU device
o Add an experimental IOMMU support to xDMA framework
The BERI IOMMU device is the part of CHERI device-model project [1]. It
translates memory addresses for various BERI peripherals modelled in
software. It accepts FreeBSD/mips64 page directories format and manages
BERI TLB.
powerpc64/mmu: Make moea64_pvo_enter() return if an entry already exists
Summary:
Instead of searching for a PVO entry before adding, take advantage of
the fact that RB_INSERT() returns NULL if it inserts, and the existing entry if
an entry exists, without inserting a new entry. This saves an extra tree
traversal in the cases where the PVO does not exist.
With the introduction of software dirty bit emulation for managed mappings,
we should test ATTR_SW_DBM, not ATTR_AP_RW, to determine whether to set
PGA_WRITEABLE. In effect, we are currently setting PGA_WRITEABLE based on
whether the dirty bit is preset, not whether the mapping is writeable.
Correct this mistake.
Check and avoid overflow when incrementing fp->f_count in
fget_unlocked() and fhold().
On sufficiently large machine, f_count can be legitimately very large,
e.g. malicious code can dup same fd up to the per-process
filedescriptors limit, and then fork as much as it can.
On some smaller machine, I see
kern.maxfilesperproc: 939132
kern.maxprocperuid: 34203
which already overflows u_int. More, the malicious code can create
transient references by sending fds over unix sockets.
I realized that this check is missed after reading
https://secfault-security.com/blog/FreeBSD-SA-1902.fd.html