Rick Macklem [Sat, 8 May 2021 00:30:56 +0000 (17:30 -0700)]
nfscl: Add support for va_birthtime to NFSv4
There is a NFSv4 file attribute called TimeCreate
that can be used for va_birthtime.
r362175 added some support for use of TimeCreate.
This patch completes support of va_birthtime by adding
support for setting this attribute to the server.
It also eanbles the client to
acquire and set the attribute for a NFSv4
server that supports the attribute.
Mark Johnston [Fri, 14 May 2021 14:07:56 +0000 (10:07 -0400)]
kqueue timer: Remove detached knotes from the process stop queue
There are some scenarios where a timer event may be detached when it is
on the process' kqueue timer stop queue. If kqtimer_proc_continue() is
called after that point, it will iterate over the queue and access freed
timer structures.
It is also possible, at least in a multithreaded program, for a stopped
timer event to be scheduled without removing it from the process' stop
queue. Ensure that we do not doubly enqueue the event structure in this
case.
Reported by: syzbot+cea0931bb4e34cd728bd@syzkaller.appspotmail.com
Reported by: syzbot+9e1a2f3734652015998c@syzkaller.appspotmail.com
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30251
Andriy Gapon [Fri, 7 May 2021 07:17:57 +0000 (10:17 +0300)]
storvsc: fix auto-sense reporting
I saw a situation where the driver set CAM_AUTOSNS_VALID on a failed ccb
even though SRB_STATUS_AUTOSENSE_VALID was not set in the status.
The actual sense data remained all zeros.
The problem seems to be that create_storvsc_request() always sets
hv_storvsc_request::sense_info_len, so checking for sense_info_len != 0
is not enough to determine if any auto-sense data is actually available.
Andriy Gapon [Thu, 6 May 2021 18:49:37 +0000 (21:49 +0300)]
PCI hot-plug: use dedicated taskqueue for device attach / detach
Attaching and detaching devices can be heavy-weight and detaching can
sleep waiting for events. For that reason using the system-wide
single-threaded taskqueue_thread is not really appropriate.
There is even a possibility for a deadlock if taskqueue_thread is used
for detaching.
In fact, there is an easy to reproduce deadlock involving nvme, pass
and a sudden removal of an NVMe device.
A pass peripheral would not release a reference on an nvme sim until
pass_shutdown_kqueue() is executed via taskqueue_thread. But the
taskqueue's thread is blocked in nvme_detach() -> ... -> cam_sim_free()
because of the outstanding reference.
Colin Percival [Sat, 15 May 2021 05:57:38 +0000 (22:57 -0700)]
MFC fixes to hostuuid handling
330f110b:
Fix 'hostuuid: preload data malformed' warning
If the preloaded hostuuid value is invalid and verbose booting is
enabled, a warning is printed. This printf had two bugs:
1. It was missing a trailing \n character.
2. The malformed UUID is printed with %s even though it is not known
to be NUL-terminated.
This commit adds the missing \n and uses %.*s with the (already known)
length of the preloaded UUID to ensure that we don't read past the end
of the buffer.
Reported by: kevans
Fixes: c3188289 Preload hostuuid for early-boot use
b6be9566:
Fix buffer overflow in preloaded hostuuid cleaning
When a module of type "hostuuid" is provided by the loader,
prison0_init strips any trailing whitespace and ASCII control
characters by (a) adjusting the buffer length, and (b) zeroing out
the characters in question, before storing it as the system's
hostuuid.
The buffer length adjustment was correct, but the zeroing overwrote
one byte higher in memory than intended -- in the typical case,
zeroing one byte past the end of the hostuuid buffer. Due to the
layout of buffers passed by the boot loader to the kernel, this will
be the first byte of a subsequent buffer.
This was *probably* harmless; prison0_init runs after preloaded kernel
modules have been linked and after the preloaded /boot/entropy cache
has been processed, so in both cases having the first byte overwritten
will not cause problems. We cannot however rule out the possibility
that other objects which are preloaded by the loader could suffer from
having the first byte overwritten.
Since the zeroing does not in fact serve any purpose, remove it and
trim trailing whitespace and ASCII control characters by adjusting
the buffer length alone.
Fixes: c3188289 Preload hostuuid for early-boot use
Reviewed by: kevans, markj
Mark Johnston [Thu, 13 May 2021 12:33:23 +0000 (08:33 -0400)]
fork: Suspend other threads if both RFPROC and RFMEM are not set
Otherwise, a multithreaded parent process may trigger races in
vm_forkproc() if one thread calls rfork() with RFMEM set and another
calls rfork() without RFMEM.
Also simplify vm_forkproc() a bit, vmspace_unshare() already checks to
see if the address space is shared.
Reported by: syzbot+0aa7c2bec74c4066c36f@syzkaller.appspotmail.com
Reported by: syzbot+ea84cb06937afeae609d@syzkaller.appspotmail.com
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30220
Mark Johnston [Thu, 13 May 2021 12:33:37 +0000 (08:33 -0400)]
posix timers: Check for overflow when converting to ns
Disallow a time or timer period value when the conversion to nanoseconds
would overflow. Otherwise it is possible to trigger a divison by zero
in realtime_expire_l(), where we compute the number of overruns by
dividing by the timer interval.
Fixes: 7995dae9 ("posix timers: Improve the overrun calculation")
Reported by: syzbot+5ab360bd3d3e3c5a6e0e@syzkaller.appspotmail.com
Reported by: syzbot+157b74ff493140d86eac@syzkaller.appspotmail.com
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30233
Cyril Zhang [Thu, 13 May 2021 12:55:06 +0000 (08:55 -0400)]
sort: Cache value of MB_CUR_MAX
Every usage of MB_CUR_MAX results in a call to __mb_cur_max. This is
inefficient and redundant. Caching the value of MB_CUR_MAX in a global
variable removes these calls and speeds up the runtime of sort. For
numeric sorting, runtime is almost halved in some tests.
PR: 255551
PR: 255840
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30170
netgraph/ng_bridge: Handle send errors during loop handling
If sending out a packet fails during the loop over all links, the
allocated memory is leaked and not all links receive a copy. This
patch fixes those problems, clarifies a premature abort of the loop,
and fixes a minory style(9) bug.
We generally like to avoid style changes when other changes are not
planned. In this case there are some makesyscalls.lua changes in the
pipeline, and this cleans up style nits in generated files that were
highlighted by experiments with clang-format.
Reviewed by: brooks, kevans
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30235
Guangyuan Yang [Sun, 16 May 2021 05:37:09 +0000 (01:37 -0400)]
kerberos.8: Replace dead link
Replace it with a tutorial hosted on kerberos.org and the classic
"dialogue" from Bill Bryant. The change has been reported and
merged upstream (https://github.com/heimdal/heimdal/commit/7f3445f1b7).
Mark Johnston [Wed, 12 May 2021 15:49:24 +0000 (11:49 -0400)]
nd6: Avoid using an uninitialized sockaddr in nd6_prefix_offlink()
Commit 81728a538 ("Split rtinit() into multiple functions.") removed
the initialization of sa6, but not one of its uses. This meant that we
were passing an uninitialized sockaddr as the address to
lltable_prefix_free(). Remove the variable outright to fix the problem.
The caller is expected to hold a reference on pr.
Fixes: 81728a538 ("Split rtinit() into multiple functions.")
Reported by: KMSAN
Reviewed by: donner
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30166
Mark Johnston [Wed, 12 May 2021 14:05:37 +0000 (10:05 -0400)]
if: Remove unnecessary validation in the SIOCSIFNAME handler
A successful copyinstr() call guarantees that the returned string is
nul-terminated. Furthermore, the removed check would harmlessly compare
an uninitialized byte with '\0' if the new name is shorter than
IFNAMESIZ - 1.
Reported by: KMSAN
Sponsored by: The FreeBSD Foundation
Kristof Provost [Wed, 12 May 2021 17:13:40 +0000 (19:13 +0200)]
tests: Only log critical errors from scapy
Since 2.4.5 scapy started issuing warnings about a few different
configurations during our tests. These are harmless, but they generate
stderr output, which upsets atf_check.
Configure scapy to only log critical errors (and thus not warnings) to
fix these tests.
IEEE Std 802.1D-2004 Section 17.14 defines permitted ranges for timers.
Incoming BPDU messages should be checked against the permitted ranges.
The rest of 17.14 appears to be enforced already.
sbin/ipfw: Fix parsing error in table based forward
The argument parser does not recognise the optional port for an
"tablearg" argument. Fix simplifies the code by make the internal
representation expicit for the parser. Includes the fix from D30208.
Mark Johnston [Mon, 3 May 2021 16:51:04 +0000 (12:51 -0400)]
Add missing sockaddr length and family validation to various protocols
Several protocol methods take a sockaddr as input. In some cases the
sockaddr lengths were not being validated, or were validated after some
out-of-bounds accesses could occur. Add requisite checking to various
protocol entry points, and convert some existing checks to assertions
where appropriate.
Reported by: syzkaller+KASAN
Reviewed by: tuexen, melifaro
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29519
Kristof Provost [Tue, 4 May 2021 17:23:15 +0000 (19:23 +0200)]
in6_mcast: Return EADDRINUSE when we've already joined the group
Distinguish between truly invalid requests and those that fail because
we've already joined the group. Both cases fail, but differentiating
them allows userspace to make more informed decisions about what the
error means.
For example. radvd tries to join the all-routers group on every SIGHUP.
This fails, because it's already joined it, but this failure should be
ignored (rather than treated as a sign that the interface's multicast is
broken).
This puts us in line with OpenBSD, NetBSD and Linux.
AIM (adaptive interrupt moderation) was part of BSD11 driver. Upon IFLIB
migration, AIM feature got lost. Re-introducing AIM back into IFLIB
based IXGBE driver.
One caveat is that in BSD11 driver, a queue comprises both Rx and Tx
ring. Starting from BSD12, Rx and Tx have their own queues and rings.
Also, IRQ is now only configured for Rx side. So, when AIM is
re-enabled, we should now consider only Rx stats for configuring EITR
register in contrast to BSD11 where Rx and Tx stats were considered to
manipulate EITR register.
Rick Macklem [Sun, 2 May 2021 23:04:27 +0000 (16:04 -0700)]
copy_file_range(2): improve copying of a large hole to EOF
PR#255523 reported that a file copy for a file with a large hole
to EOF on ZFS ran slowly over NFSv4.2.
The problem was that vn_generic_copy_file_range() would
loop around reading the hole's data and then see it is all
0s. It was coded this way since UFS always allocates a data
block near the end of the file, such that a hole to EOF never exists.
This patch modifies vn_generic_copy_file_range() to check for a
ENXIO returned from VOP_IOCTL(..FIOSEEKDATA..) and handle that
case as a hole to EOF. asomers@ confirms that it works for his
ZFS test case.
Navdeep Parhar [Sat, 1 May 2021 23:53:50 +0000 (16:53 -0700)]
cxgbe(4): Use ifaddr_event_ext instead of ifaddr_event for CLIP management.
The _ext event notification includes the address being added/removed and
that gives the driver an easy way to ignore non-IPv6 addresses. Remove
'tom' from the handler's name while here, it was moved out of t4_tom a
long time ago.
cxgbe(4): Do not panic when tx is called with invalid checksum requests.
There is no need to panic in if_transmit if the checksums requested are
inconsistent with the frame being transmitted. This typically indicates
that the kernel and driver were built with different INET/INET6 options,
or there is some other kernel bug. The driver should just throw away
the requests that it doesn't understand and move on.
cxgbe(4): Add flag to reliably stop the driver from accessing hw stats.
There are two kinds of routines in the driver that read statistics from
the hardware: the cxgbe_* variants read the per-port MPS/MAC registers
and the vi_* variants read the per-VI registers. They can be called
from the 1Hz callout or if_get_counter. All stats collection now takes
place under the callout lock and there is a new flag to indicate that
these routines should not access any hardware register.
Navdeep Parhar [Tue, 30 Mar 2021 04:35:05 +0000 (21:35 -0700)]
cxgbe/t4_tom: restore socket's protosw before entering TIME_WAIT.
This fixes a panic due to stale so->so_proto if t4_tom is unloaded and
one or more connections that were previously offloaded are still around
in TIME_WAIT state.
Navdeep Parhar [Wed, 24 Mar 2021 01:01:01 +0000 (18:01 -0700)]
cxgbe(4): Allow a T6 adapter to switch between TOE and NIC TLS mode.
The hw.cxgbe.kern_tls tunable was used for this in the past and if it
was set then all T6 adapters would be configured for NIC TLS operation
and could not be reconfigured for TOE without a reload. With this
change ifconfig can be used to manipulate toe and txtls caps like any
other caps. hw.cxgbe.kern_tls continues to work as usual but its
effects are not permanent any more.
* Enable nic_ktls_ofld in the default configuration file and use the
firmware instead of direct register manipulation to apply/rollback
NIC TLS configuration. This allows the driver to switch the hardware
between TOE and NIC TLS mode in a safe manner. Note that the
configuration is adapter-wide and not per-port.
* Remove the kern_tls config file as it works with 100G T6 cards only
and leads to firmware crashes with 25G cards. The configurations
included with the driver (with the exception of the FPGA configs) are
supposed to work with all adapters.
Navdeep Parhar [Fri, 19 Mar 2021 20:28:11 +0000 (13:28 -0700)]
cxgbe(4): create a separate helper routine to write the global RSS key.
While here, make sure only the PF driver attempts to program the global
RSS key (with options RSS). The VF driver doesn't have access to those
device registers.
Navdeep Parhar [Fri, 19 Mar 2021 19:30:57 +0000 (12:30 -0700)]
cxgbe(4): make it safe to call setup_memwin repeatedly.
A repeat call will recreate the memory windows in the hardware and move
them to their last-known positions without repeating any of the software
initialization.
Navdeep Parhar [Fri, 5 Mar 2021 19:28:18 +0000 (11:28 -0800)]
cxgbe(4): Fix an assertion that is not valid during attach.
Firmware access from t4_attach takes place without any synchronization.
The driver should not panic (debug kernels) if something goes wrong in
early communication with the firmware. It should still load so that
it's possible to poke around with cxgbetool.
Navdeep Parhar [Fri, 19 Feb 2021 22:18:08 +0000 (14:18 -0800)]
cxgbe(4): Use the correct filter width for T5+.
T5 and above have extra bits for the optional filter fields. This is a
correctness issue and not just a waste because a filter mode valid on a
T4 (36b) may not be valid on a T5+ (40b).
Navdeep Parhar [Fri, 19 Feb 2021 21:47:18 +0000 (13:47 -0800)]
cxgbe(4): Add a driver ioctl to set the filter mask.
Allow the filter mask (aka the hashfilter mode when hashfilters are
in use) to be set any time it is safe to do so. The requested mask
must be a subset of the filter mode already. The driver will not change
the mode or ingress config just to support a new mask.
Navdeep Parhar [Fri, 19 Feb 2021 21:05:19 +0000 (13:05 -0800)]
cxgbe(4): Use firmware commands to get/set filter configuration.
1. Query the firmware for filter mode, mask, and related ingress config
instead of trying to figure them out from hardware registers. Read
configuration from the registers only when the firmware does not
support this query.
2. Use the firmware to set the filter mode. This is the correct way to
do it and is more flexible as well. The filter mode (and associated
ingress config) can now be changed any time it is safe to do so.
The user can specify a subset of a valid mode and the driver will
enable enough bits to make sure that the mode is maxed out -- that
is, it is not possible to set another bit without exceeding the
total width for optional filter fields. This is a hardware
requirement that was not enforced by the driver previously.
Navdeep Parhar [Thu, 18 Feb 2021 09:15:46 +0000 (01:15 -0800)]
cxgbe(4): Break up t4_read_chip_settings.
Read the PF-only hardware settings directly in get_params__post_init.
Split the rest into two routines used by both the PF and VF drivers: one
that reads the SGE rx buffer configuration and another that verifies
miscellaneous hardware configuration.
Alexander Motin [Sun, 2 May 2021 23:35:28 +0000 (19:35 -0400)]
Improve UMA cache reclamation.
When estimating working set size, measure only allocation batches, not free
batches. Allocation and free patterns can be very different. For example,
ZFS on vm_lowmem event can free to UMA few gigabytes of memory in one call,
but it does not mean it will request the same amount back that fast too, in
fact it won't.
Update working set size on every reclamation call, shrinking caches faster
under pressure. Lack of this caused repeating vm_lowmem events squeezing
more and more memory out of real consumers only to make it stuck in UMA
caches. I saw ZFS drop ARC size in half before previous algorithm after
periodic WSS update decided to reclaim UMA caches.
Introduce voluntary reclamation of UMA caches not used for a long time. For
each zdom track longterm minimal cache size watermark, freeing some unused
items every UMA_TIMEOUT after first 15 minutes without cache misses. Freed
memory can get better use by other consumers. For example, ZFS won't grow
its ARC unless it see free memory, since it does not know it is not really
used. And even if memory is not really needed, periodic free during
inactivity periods should reduce its fragmentation.
Reviewed by: markj, jeff (previous version)
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D29790
Mark Johnston [Tue, 11 May 2021 21:36:12 +0000 (17:36 -0400)]
cryptodev: Fix some input validation bugs
- When we do not have a separate IV, make sure that the IV length
specified by the session is not larger than the payload size.
- Disallow AEAD requests without a separate IV. crp_sanity() asserts
that CRYPTO_F_IV_SEPARATE is set for AEAD requests, and some (but not
all) drivers require it.
- Return EINVAL for AEAD requests if an IV is specified but the
transform does not expect one.
Dmitry Wagin [Tue, 23 Mar 2021 16:01:15 +0000 (12:01 -0400)]
libc: Some enhancements to syslog(3)
- Defined MAXLINE constant (8192 octets by default instead 2048) for
centralized limit setting up. It sets maximum number of characters of
the syslog message. RFC5424 doesn't limit maximum size of the message.
Named after MAXLINE in syslogd(8).
- Fixed size of fmt_cpy buffer up to MAXLINE for rendering formatted
(%m) messages.
- Introduced autoexpansion of sending socket buffer up to MAXLINE.
Dmitry Wagin [Tue, 23 Mar 2021 16:15:28 +0000 (12:15 -0400)]
syslogd: Increase message size limits
Add a -M option to control the maximum length of forwarded messages.
syslogd(8) used to truncate forwarded messages to 1024 bytes, but after
commit 1a874a126a54 ("Add RFC 5424 syslog message output to syslogd.")
applies a more conservative limit of 480 bytes for IPv4 per RFC 5426
section 3.2. Restore the old default behaviour of truncating to 1024
bytes. RFC 5424 specifies no upper limit on the length of forwarded
messages, while for RFC 3164 the limit is 1024 bytes.
Increase MAXLINE to 8192 bytes to correspond to commit 672ef817a192.
Replaced bootfile[] size for MAXPATHLEN used in getbootfile(3) as a
returned value. Using (MAXLINE+1) as a size for bootfile[] is excessive.