Alex Richardson [Mon, 25 Jan 2021 14:11:45 +0000 (14:11 +0000)]
qeueue.h: Add {SLIST,STAILQ,LIST,TAILQ}_END()
We provide these for compat with other queue.h headers since some software
assumes it exists (e.g. the libevent contrib code), but we are not
encouraging their use (NULL should be used instead).
This fixes the following warning (which should arguable be an error since
it results in a function call to an undefined function):
.../contrib/libevent/buffer.c:495:16: warning: implicit declaration of function 'LIST_END' is invalid in C99 [-Wimplicit-function-declaration]
cbent != LIST_END(&buffer->callbacks);
^
.../contrib/libevent/buffer.c:495:13: warning: comparison between pointer and integer ('struct evbuffer_cb_entry *' and 'int') [-Wpointer-integer-compare]
cbent != LIST_END(&buffer->callbacks);
~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Alexander Motin [Tue, 2 Feb 2021 18:37:13 +0000 (13:37 -0500)]
Make DataSN counter of solicited Data-Out local.
DataSN for solicited Data-Out is per-R2T. Since we handle whole R2T
in one go, we don't need to store it anywhere, especially in global
per-command structure. This may allow us to handle multiple R2T per
command at once, if we decide, or may be relax locking.
Rename the second use of that field to io_referenced_task_tag.
Mark Johnston [Thu, 25 Feb 2021 15:04:44 +0000 (10:04 -0500)]
buf: Fix the dirtybufthresh check
dirtybufthresh is a watermark, slightly below the high watermark for
dirty buffers. When a delayed write is issued, the dirtying thread will
start flushing buffers if the dirtybufthresh watermark is reached. This
helps ensure that the high watermark is not reached, otherwise
performance will degrade as clustering and other optimizations are
disabled (see buf_dirty_count_severe()).
When the buffer cache was partitioned into "domains", the dirtybufthresh
threshold checks were not updated. Fix this.
Mark Johnston [Thu, 25 Feb 2021 15:04:44 +0000 (10:04 -0500)]
sendfile: Use the pager size to determine the file extent when possible
Previously sendfile would issue a VOP_GETATTR and use the returned size,
i.e., the file size. When paging in file data, sendfile_swapin() will
use the pager to determine whether it needs to zero-fill, most often
because of a hole in a sparse file. An attempt to page in beyond the
end of a file is treated this way, and occurs when the requested page is
past the end of the pager. In other words, both the file size and pager
size were used interchangeably.
With ZFS, updates to the pager and file sizes are not synchronized by
the exclusive vnode lock, at least partially due to its use of
MNTK_SHARED_WRITES. In particular, the pager size is updated after the
file size, so in the presence of a writer concurrently extending the
file, sendfile could incorrectly instantiate "holes" in the page cache
pages backing the file, which manifests as data corruption when reading
the file back from the page cache. The on-disk copy is unaffected.
Fix this by consistently using the pager size when available.
Reported by: dumbbell
Reviewed by: chs, kib
Tested by: dumbbell, pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28811
Kristof Provost [Wed, 24 Feb 2021 15:40:37 +0000 (16:40 +0100)]
bridge tests: Test that we also forward on some interfaces
Ensure that we not only block on some interfaces, but also forward on
some. Without the previous commit we wound up discarding on all ports,
rather than only on the ports needed to break the loop.
ipfw: make algo name argument optional for some table types
Most of table types currently supported by ipfw have only one
algorithm implementation. When user creates such tables, allow
to omit algo name in arguments. E.g. now it is possible:
ipfw table T1 create type number
ipfw table T2 create type iface
ipfw table T3 create type flow
Kyle Evans [Fri, 26 Feb 2021 21:46:47 +0000 (15:46 -0600)]
jail: allow root to implicitly widen its cpuset to attach
The default behavior for attaching processes to jails is that the jail's
cpuset augments the attaching processes, so that it cannot be used to
escalate a user's ability to take advantage of more CPUs than the
administrator wanted them to.
This is problematic when root needs to manage jails that have disjoint
sets with whatever process is attaching, as this would otherwise result
in a deadlock. Therefore, if we did not have an appropriate common
subset of cpus/domains for our new policy, we now allow the process to
simply take on the jail set *if* it has the privilege to widen its mask
anyways.
With the new logic, root can still usefully cpuset a process that
attaches to a jail with the desire of maintaining the set it was given
pre-attachment while still retaining the ability to manage child jails
without jumping through hoops.
A test has been added to demonstrate the issue; cpuset of a process
down to just the first CPU and attempting to attach to a jail without
access to any of the same CPUs previously resulted in EDEADLK and now
results in taking on the jail's mask for privileged users.
Rick Macklem [Sun, 28 Feb 2021 01:54:05 +0000 (17:54 -0800)]
nfsclient: fix panic in cache_enter_time()
Juraj Lutter (otis@) reported a panic "dvp != vp not true" in
cache_enter_time() called from the NFS client's nfsrpc_readdirplus()
function.
This is specific to an NFSv3 mount with the "rdirplus" mount
option. Unlike NFSv4, NFSv3 replies to ReaddirPlus
includes entries for the current directory.
This trivial patch avoids doing a cache_enter_time()
call for the current directory to avoid the panic.
MFC 9febbc454190:
Fix for natd(8) sending wrong sequence number after TCP retransmission,
terminating a TCP connection.
If a TCP packet must be retransmitted and the data length has changed in the
retransmitted packet, due to the internal workings of TCP, typically when ACK
packets are lost, then there is a 30% chance that the logic in GetDeltaSeqOut()
will find the correct length, which is the last length received.
This can be explained as follows:
If a "227 Entering Passive Mode" packet must be retransmittet and the length
changes from 51 to 50 bytes, for example, then we have three cases for the
list scan in GetDeltaSeqOut(), depending on how many prior packets were
received modulus N_LINK_TCP_DATA=3:
case 1: index 0: original packet 51
index 1: retransmitted packet 50
index 2: not relevant
case 2: index 0: not relevant
index 1: original packet 51
index 2: retransmitted packet 50
case 3: index 0: retransmitted packet 50
index 1: not relevant
index 2: original packet 51
This patch simply changes the searching order for TCP packets, always starting
at the last received packet instead of any received packet, in
GetDeltaAckIn() and GetDeltaSeqOut().
Martin Matuska [Wed, 3 Mar 2021 01:38:09 +0000 (02:38 +0100)]
zfs: cancel TRIM or initialize on FAULTED non-writeable vdevs
From the openzfs commit message:
When a device which is actively trimming or initializing becomes
FAULTED, and therefore no longer writable, cancel the active
TRIM or initialization. When the device is merely taken offline
with `zpool offline` then stop the operation but do not cancel it.
When the device is brought back online the operation will be
resumed if possible.
Martin Matuska [Wed, 3 Mar 2021 01:32:59 +0000 (02:32 +0100)]
zfs: fix assert in FreeBSD-specific dmu_read_pages
From the openzfs 2e160dee9 commit message:
The function has three similar pieces of code: for read-behind pages,
requested pages and read-ahead pages. All three pieces had an
assert to ensure that the page is not mapped. Later the assert was
relaxed to require that the page is not mapped for writing. But that
was done in two places out of three. This change fixes the third piece,
read-ahead.
Martin Matuska [Wed, 3 Mar 2021 01:28:56 +0000 (02:28 +0100)]
zfs: fix vdev_rebuild_thread deadlock
From the openzfs 8e43fa12c commit message:
The metaslab_disable() call may block waiting for a txg sync.
Therefore it's important that vdev_rebuild_thread release the
SCL_CONFIG read lock it is holding before this call. Failure
to do so can result in the txg_sync thread getting blocked
waiting for this lock which results in a deadlock.
Martin Matuska [Wed, 3 Mar 2021 01:25:03 +0000 (02:25 +0100)]
zfs: fix overly broad locking in spa_vdev_config_exit()
Resolves a deadlock which can occur when the ZED or zpool
command attaches a new device.
From the openzfs 75a089ed3 commit message:
Calling vdev_free() only requires the we acquire the spa config
SCL_STATE_ALL locks, not the SCL_ALL locks. In particular, we need
need to avoid taking the SCL_CONFIG lock (included in SCL_ALL) as a
writer since this can lead to a deadlock. The txg_sync_thread() may
block in spa_txg_history_init_io() when taking the SCL_CONFIG lock
as a reading when it detects there's a pending writer.
Mark Johnston [Wed, 24 Feb 2021 02:15:50 +0000 (21:15 -0500)]
rmlock: Add a required compiler membar to the rlock slow path
The tracker flags need to be loaded only after the tracker is removed
from its per-CPU queue. Otherwise, readers may fail to synchronize with
pending writers attempting to propagate priority to active readers, and
readers and writers deadlock on each other. This was observed in a
stable/12-based armv7 kernel where the compiler had reordered the load
of rmp_flags to before the stores updating the queue.
linux: fix handling of flags for 32 bit send(2) syscall
Previously the flags were passed as-is, which could resulted
in spurious EAGAIN returned for non-blocking sockets, which
broke some Steam games.
PR: 248065
Reported By: Alex S <iwtcex@gmail.com>
Tested By: Alex S <iwtcex@gmail.com>
Reviewed By: emaste
MFC After: 3 days
Sponsored By: The FreeBSD Foundation
Use compat.linux.emul_path instead of hardcoded path in /etc/rc.d/linux
In /etc/rc.d/linux the mounting paths of procfs, sysfs and devfs
are hardcoded to "/compat/linux". Switching to the content of
compat.linux.emul_path sysctl would allow to switch linuxulator
to different place.
Submitted by: freebsdnewbie_freenet.de
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27807
update the SACK loss recovery to RFC6675, with the following new features:
- improved pipe calculation which does not degrade under heavy loss
- engaging in Loss Recovery earlier under adverse conditions
- Rescue Retransmission in case some of the trailing packets of a request got lost
All above changes are toggled with the sysctl "rfc6675_pipe" (disabled by default).
Kristof Provost [Mon, 22 Feb 2021 07:19:43 +0000 (08:19 +0100)]
arp/nd: Cope with late calls to iflladdr_event
When tearing down vnet jails we can move an if_bridge out (as
part of the normal vnet_if_return()). This can, when it's clearing out
its list of member interfaces, change its link layer address.
That sends an iflladdr_event, but at that point we've already freed the
AF_INET/AF_INET6 if_afdata pointers.
In other words: when the iflladdr_event callbacks fire we can't assume
that ifp->if_afdata[AF_INET] will be set.
Kristof Provost [Sun, 21 Feb 2021 20:20:32 +0000 (21:20 +0100)]
bridge: Remove members when assigned to a new vnet
When the bridge is moved to a different vnet we must remove all of its
member interfaces (and span interfaces), because we don't know if those
will be moved along with it. We don't want to hold references to
interfaces not in our vnet.
Kristof Provost [Sat, 20 Feb 2021 09:11:30 +0000 (10:11 +0100)]
bridge: Support STP on VLAN devices
VLAN devices have type IFT_L2VLAN, so the STP code mistakenly believed
they couldn't be used for STP. That's not the case, so add the
ITF_L2VLAN to the check.
Emmanuel Vadot [Thu, 25 Feb 2021 15:34:28 +0000 (16:34 +0100)]
mkimg: We always want the last block of the last inserted partition
Even with an absolute offset we want to know the last block the partition
otherwise we endup with an image the size of the metadata.
This allow to create image with the ESP placed at a specific position which
is useful on arm/arm64 where u-boot have always a hard time to read the ESP
if it's not aligned on 512k.
mkimg -v -o sdcard -s gpt -p efi::54M:1M -p freebsd-ufs::1G
now works.
Michael Tuexen [Sun, 14 Feb 2021 11:10:31 +0000 (12:10 +0100)]
tcp: improve behaviour when using TCP_NOOPT
Use ISS for SEG.SEQ when sending a SYN-ACK segment in response to
an SYN segment received in the SYN-SENT state on a socket having
the IPPROTO_TCP level socket option TCP_NOOPT enabled.
Reviewed by: rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D28656
Michael Tuexen [Tue, 9 Feb 2021 22:35:55 +0000 (23:35 +0100)]
libsysdecode: fix decoding of TCP_NOPUSH and TCP_MD5SIG
TCP_FASTOPEN_MIN_COOKIE_LEN was incorrectly registered as a name of
a IPPROTO_TCP level socket option, which overwrote TCP_NOPUSH.
TCP_FASTOPEN_PSK_LEN was incorrectly registered as a name of an
IPPROTO_TCP level socket option, which overwrote TCP_MD5SIG.
Michael Tuexen [Sun, 31 Jan 2021 22:43:15 +0000 (23:43 +0100)]
sctp: improve input validation
Improve the handling of INIT chunks in specific szenarios and
report and appropriate error cause.
Thanks to Anatoly Korniltsev for reporting the issue for the
userland stack.
MFC note: this plus the merge of two preliminary removal of __NO_TLS
definitions for mips and risc-v break ABI. It was decided that doing
ABI break on tier 2 platforms at this stage of 13.0 release process is
better than drag on __NO_TLS presence for the 13.x branch lifetime.
Rick Macklem [Mon, 15 Feb 2021 02:16:58 +0000 (18:16 -0800)]
getdirentries.2: fix for NFS mounts
It was reported that getdirentries(2) was
returning dirents with d_off set to 0 for an NFS
mount.
This is believed to be correct behaviour at
this time (it may change for some NFS mounts
in the future), but is inconsistent with what the
getdirentries(2) man page says.
Toomas Soome [Sun, 21 Feb 2021 10:32:18 +0000 (12:32 +0200)]
loader: autoload_font will hung loader when there is no local console
If we start with console set to comconsole, the local
console (vidconsole, efi) is never initialized and attempt to
use the data can render the loader hung.
Mark Johnston [Mon, 22 Feb 2021 15:03:37 +0000 (10:03 -0500)]
m_uiotombuf_nomap(): Stop clearing PG_ZERO in newly allocated pages
The caller should not be passing M_ZERO in the first place, so PG_ZERO
will not be preserved by the page allocator and clearing it accomplishes
nothing.
Reviewed by: gallatin, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28808
Mark Johnston [Thu, 25 Feb 2021 23:49:47 +0000 (18:49 -0500)]
pmap: Fix largemap restart checks in the kernel_maps sysctl handler
The purpose of these checks is to ensure that the address of the
next-level page table page is valid, since nothing is synchronizing with
a concurrent update of the large map and large map PTPs are freed to the
system. However, if PG_PS is set, there is no next level.
Reported by: rpokala
Reviewed by: kib
Tested by: rpokala
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28922
Address two incorrect calculations and enhance readability of PRR code
- address second instance of cwnd potentially becoming zero
- fix sublte bug due to implicit int to uint typecase in max()
- fix bug due to typo in hand-coded CEILING() function by using howmany() macro
- use int instead of long, and add a missing long typecast
- replace if conditionals with easier to read imax/imin (as in pseudocode)
Reviewed By: #transport, kbowling
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28813
Mark Johnston [Wed, 24 Feb 2021 15:08:53 +0000 (10:08 -0500)]
iflib: Avoid double counting in rxeof
iflib_rxeof() was counting everything twice. This was introduced when
pfil hooks were added to the iflib receive path. We want to count rx
packets/bytes before the pfil hooks are executed, so remove the counter
adjustments that are executed after.
PR: 253583
Reviewed by: gallatin, erj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28900
Lutz Donnerhacke [Wed, 27 Jan 2021 20:19:14 +0000 (21:19 +0100)]
netgraph/ng_car: Add color marking code
Chained policing should be able to reuse the classification of
traffic. A new mbuf_tag type is defined to handle gereral QoS
marking. A new subtype is defined to track the color marking.
Handle partial data re-sending on ktls/sendfile on FreeBSD
Add a handler for EBUSY sendfile error in addition to
EAGAIN. With EBUSY returned the data still can be partially
sent and user code has to be notified about it, otherwise it
may try to send data multiple times.
Robert Watson [Tue, 16 Feb 2021 15:19:05 +0000 (15:19 +0000)]
Reimplemen FreeBSD/arm64 dtrace_gethrtime() to use the system timer.
dtrace_gethrtime() is the high-resolution nanosecond timestemp used
for the DTrace 'timestamp' built-in variable. The new implementation
uses the EL0 cycle counter and frequency registers in ARMv8-A. This
replaces a previous implementation that relied on an
instrumentation-safe implementation of getnanotime(), which provided
only timer resolution.
Robert Wing [Wed, 17 Feb 2021 09:22:23 +0000 (00:22 -0900)]
automount(8): fix absolute path when creating a mountpoint
When executing automount(8), it will attempt to create the directory where an
autofs filesystem is to be mounted. Explicity set the root path for this
directory to "/".
This fixes the issue where the directory being created was being treated as a
relative path instead of an absolute path (as expected).
Mark Johnston [Thu, 18 Feb 2021 15:50:57 +0000 (10:50 -0500)]
arm64: Include NUMA locality info in the CPU topology
The scheduler uses this topology to try and preserve locality when
migrating threads between CPUs and when performing work stealing.
Ensure that on NUMA systems it will at least take the NUMA topology into
account.
Mark Johnston [Mon, 22 Feb 2021 20:50:09 +0000 (15:50 -0500)]
vm_kern: Avoid sign extension in the KVA_QUANTUM definition
Otherwise, on a powerpc64 NUMA system with hashed page tables, the
first-level superpage reservation size is large enough that the value of
the kernel KVA arena import quantum, KVA_NUMA_IMPORT_QUANTUM, is
negative and gets sign-extended when passed to vmem_set_import(). This
results in a boot-time hang on such platforms.
Lutz Donnerhacke [Tue, 26 Jan 2021 15:50:04 +0000 (16:50 +0100)]
netgraph/ng_vlan_rotate: IEEE 802.1ad VLAN manipulation netgraph type
This node is part of an A10-NSP (L2-BSA) development.
Carrier networks tend to stack three or more tags for internal
purposes and therefore hiding the service tags deep inside of the
stack. When decomposing such an access network frame, the processing
order is typically reversed: First distinguish by service, than by
other means.
This new netgragh node allows to bring the relevant VLAN in front (to
the out-most position). This way other netgraph nodes (like ng_vlan)
can operate on this specific type.
The previous implementation was reported to try to coalesce packets
in situations when it should not, that resulted in assertion later.
This implementation better checks the first packet of the chain for
the coallescing elligibility.
Dimitry Andric [Mon, 22 Feb 2021 20:01:09 +0000 (21:01 +0100)]
Fix possibly unitialized variables in __cxa_demangle_gnu3()
After 0ee0dbfb0d26cf4bc37f24f12e76c7f532b0f368 where I imported a more
recent libcxxrt snapshot, the variables 'rtn' and 'has_ret' could in
some cases be used while still uninitialized. Most obviously this would
lead to a jemalloc complaint about a bad free(), aborting the program.
Fix this by initializing a bunch variables in their declarations. This
change has also been sent upstream, with some additional changes to be
used in their testing framework.
Warner Losh [Thu, 11 Feb 2021 23:06:30 +0000 (16:06 -0700)]
efibootmgr: Check for efi supported after parsing args
Move the check for efi variables being supported to after parsing the args. This
allows '-h' to produce both as a normal user as well as on all systems.
Warner Losh [Wed, 17 Feb 2021 22:08:19 +0000 (15:08 -0700)]
uart: only use MSI on devices that advertise 1 MSI vector
This updates r311987/fb1d9b7f4113d which allowed any number of vectors to be
used. Since we're just attaching one instance, the meaning of more than one
vector is not clear and seems to cause problems. Fall back to old methods for
these cards.
Warner Losh [Fri, 19 Feb 2021 22:34:25 +0000 (15:34 -0700)]
boot: remove gptboot.efifat, it never should have been
conical hat reduction: Make sure we also remove gotboot.efifat. It was created,
briefly, and shouldn't have existed in the first place. Kill it at the same
place we kill boot1.efifat.
Mark Johnston [Fri, 19 Feb 2021 22:08:34 +0000 (17:08 -0500)]
iflib: Fix detach of pseudo interfaces
In commit 38bfc6dee33b we added an IFDI_DETACH() call to
iflib_pseudo_deregister() since it looked like it was missing. One is
present in the error-handling path of iflib_pseudo_register(). However,
the detach actually comes from the DEVICE_DETACH() method for the
above-mentioned device_t, so now we're calling IFDI_DETACH() twice when
destroying a pseudo interface.
Fix the problem by not calling IFDI_DETACH() from the device detach
routine. This way we can ensure that iflib de-initialization always
happens in a consistent order. It also ensures that you can't do silly
things like "devctl detach <pseudo ifnet>", which would previously
detach the driver without tearing down the corresponding ifnet.
PR: 253541
Reviewed by: erj
Fixes: 38bfc6dee33b
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28774