David Bright [Wed, 15 Jul 2020 17:34:08 +0000 (17:34 +0000)]
MFC r362634:
Add CAP_EVENT to pidfiles.
CAP_EVENT was omitted on pidfiles (in
pidfile_open()). There seems no reason why a process that creates
and writes a pidfile cannot monitor events on that file. This mod adds
the capability.
Fix invalid VHDX generation for image larger than 4Gb
- Part of BAT payload location was lost due to invalid
BAT entry encoding type (32 bits instead of 64 bits)
- The sequence of PB/SB entries in BAT was broken due to
off-by-one index check. It worked for smaller than
4Gb because there were no SB entries in BAT.
MFC r362829:
The "pid" field in the LinuxKPI task struct is typically set to the thread ID
and not the process ID. Make sure the linux_task_exiting() function uses tdfind()
to lookup the BSD procedure structure pointer by the "pid" field, and only
fallback to pfind() when no match is found! This makes linux_task_exiting()
in line with the rest of the code.
MFC r362953:
Infiniband clients must be attached and detached in a specific order in ibcore.
Currently the linking order of the infiniband, IB, modules decide in which
order the clients are attached and detached. For example one IB client may
use resources from another IB client. This can lead to a potential deadlock
at shutdown. For example if the ipoib is unregistered after the ib_multicast
client is detached, then if ipoib is using multicast addresses a deadlock may
happen, because ib_multicast will wait for all its resources to be freed before
returning from the remove method.
Fix this by using module_xxx_order() instead of module_xxx().
- Fix formatting issues such as:
- Use Ql instead of Dq Li as Li is deprecated
- Address some mandoc warnings
- Add arguments missing from the list of options (i.e., document "-k keep"
instead of just "-k").
- Document that -k and -s can be specified multiple times
- Use sshd instead of named for the example in the BUGS section, as named
is not in the base system. Also, use Nm instead of Xr there as it is not
the sshd binary that is required to be running, but the service.
- Use Sy instead of Cm for KEYWORDS. Cm is reserved for command-line
modifiers of the CLI.
- Add an EXAMPLES section
- Cross-reference service(8).
Merge commit 065fc1eafe7c from llvm git (by Richard Smith):
PR45521: Preserve the value kind when performing a standard
conversion sequence on a glvalue expression.
If the sequence is supposed to perform an lvalue-to-rvalue
conversion, then one will be specified as the first conversion in the
sequence. Otherwise, one should not be invented.
This should fix clang crashing with "can't implicitly cast lvalue to
rvalue with this cast kind", followed by "UNREACHABLE executed at
/usr/src/contrib/llvm-project/clang/lib/Sema/Sema.cpp:538!", when
building recent versions of Ceph, and the CPAN module SYBER/Date-5.2.0.
Reported by: Willem Jan Withagen <wjw@digiware.nl>, eserte12@yahoo.de
PR: 245530, 247812
When job control is not enabled, the shell ignores SIGINT while waiting for
a foreground process unless that process exits on SIGINT. In this case, the
foreground process is sleep and it does not exit on SIGINT because the
signal is only sent to the shell. Depending on order of events, this could
cause the SIGINT to be unexpectedly ignored.
On lightly loaded bare metal, the chance of this happening tends to be less
than 0.01% but with higher loads and/or virtualization it becomes more
likely.
Starting the sleep in background and using the wait builtin ensures SIGINT
will not be ignored.
Brooks Davis [Thu, 9 Jul 2020 16:58:53 +0000 (16:58 +0000)]
MFC r362979:
Fix a -Wvoid-pointer-to-enum-cast warning missed in r359978.
This pattern is used in callbacks with void * data arguments and seems
both relatively uncommon and relatively harmless. Silence the warning
by casting through uintptr_t.
Ryan Moeller [Thu, 9 Jul 2020 09:33:32 +0000 (09:33 +0000)]
MFC r362824:
libifconfig: Add function to get bridge status
The new function operates similarly to ifconfig_lagg_get_lagg_status and
likewise is accompanied by a function to free the bridge status data structure.
I have included in this patch the relocation of some strings describing STP
parameters and the PV2ID macro from ifconfig into net/if_bridgevar.h as they
are useful for consumers of libifconfig.
MFC r361798, r361800: vfs: default disallow read(2) of a directory
This MFC is in accordance with the original MFC plan outlined in the commit
message for r361798, appearing in full (with exception to metadata) below.
To summarize: this MFC only merges back the sysctl with a default disallow
policy, as in head, to ensure we hit any issues quickly but in a fashion
that end users can easily revert. Interested parties can flip the
security.bsd.allow_read_dir sysctl back to 1 to fully honor the previous
behavior of allowing read(2) of any dir, filesystem permitting.
r361798:
vfs: add restrictions to read(2) of a directory [1/2]
Historically, we've allowed read() of a directory and some filesystems will
accommodate (e.g. ufs/ffs, msdosfs). From the history department staffed by
Warner: <<EOF
pdp-7 unix seemed to allow reading directories, but they were weird, special
things there so I'm unsure (my pdp-7 assembler sucks).
1st Edition's sources are lost, mostly. The kernel allows it. The
reconstructed sources from 2nd or 3rd edition read it though.
V6 to V7 changed the filesystem format, and should have been a warning, but
reading directories weren't materially changed.
4.1b BSD introduced readdir because of UFS. UFS broke all directory reading
programs in 1983. ls, du, find, etc all had to be rewritten. readdir() and
friends were introduced here.
SysVr3 picked up readdir() in 1987 for the AT&T fork of Unix. SysVr4 updated
all the directory reading programs in 1988 because different filesystem
types were introduced.
In the 90s, these interfaces became completely ubiquitous as PDP-11s running
V7 faded from view and all the folks that initially started on V7 upgraded
to SysV. Linux never supported this (though I've not done the software
archeology to check) because it has always had a pathological diversity of
filesystems.
EOF
Disallowing read(2) on a directory has the side-effect of masking
application bugs from relying on other implementation's behavior
(e.g. Linux) of rejecting these with EISDIR across the board, but allowing
it has been a vector for at least one stack disclosure bug in the past[0].
By POSIX, this is implementation-defined whether read() handles directories
or not. Popular implementations have chosen to reject them, and this seems
sensible: the data you're reading from a directory is not structured in some
unified way across filesystem implementations like with readdir(2), so it is
impossible for applications to portably rely on this.
With this patch, we will reject most read(2) of a dirfd with EISDIR. Users
that know what they're doing can conscientiously set
bsd.security.allow_read_dir=1 to allow read(2) of directories, as it has
proven useful for debugging or recovery. A future commit will further limit
the sysctl to allow only the system root to read(2) directories, to make it
at least relatively safe to leave on for longer periods of time.
While we're adding logic pertaining to directory vnodes to vn_io_fault, an
additional assertion has also been added to ensure that we're not reaching
vn_io_fault with any write request on a directory vnode. Such request would
be a logical error in the kernel, and must be debugged rather than allowing
it to potentially silently error out.
Commented out shell aliases have been placed in root's chsrc/shrc to promote
awareness that grep may become noisy after this change, depending on your
usage.
A tentative MFC plan has been put together to try and make it as trivial as
possible to identify issues and collect reports; note that this will be
strongly re-evaluated. Tentatively, I will MFC this knob with the default as
it is in HEAD to improve our odds of actually getting reports. The future
priv(9) to further restrict the sysctl WILL NOT BE MERGED BACK, so the knob
will be a faithful reversion on stable/12. We will go into the merge
acknowledging that the sysctl default may be flipped back to restore
historical behavior at *any* point if it's warranted.
MFC r362577: TCP: make after-idle work for transactional sessions.
The use of t_rcvtime as proxy for the last transmission
fails for transactional IO, where the client requests
data before the server can respond with a bulk transfer.
Set aside a dedicated variable to actually track the last
locally sent segment going forward.
There are cases when gif_interfaces cannot be replaced
with cloned_interfaces, such as tunnels with external IPv6 addresses
and internal IPv4 or vice versa. Such configuration requires
extra invocation of ifconfig(8) and supported with gif_interfaces only.
- Mention option's arguments in the list of options (so that now we mention
"-N system" instead of just "-N").
- Stylize signals and other constants like O_APPEND with Dv.
- Sort options.
- Change indentation width for readability.
- Fix a couple of typos.
- Sort symbols list.
- Use Sy instead of Cm for symbols. They are not command modifiers.
- Use Ex -std in the EXIT STATUS section for consistency with other manual
pages.
- Use Ql instead of Dq Li for inline code examples as Li has recently been
deprecated by mdoc.
- Update synopsis to present all available arguments.
- Consistently call the argument specifying an arbitrary directory a
"directory".
- Do not put macros into -width argument to Bl. They do not expand there.
- Stylize command modifiers like "daily" with Cm instead of Pa. While
technically periodic(8) operates on directories with such names, it is
confusing from the perspective of the manual page reader as Pa and Ar are
stylized the same way. Also, I cannot recall a single manual page where
Pa would be used to describe the syntax of command-line arguments.
Michael Tuexen [Wed, 1 Jul 2020 23:47:51 +0000 (23:47 +0000)]
MFC r349893:
This commit updates rack to what is basically being used at NF as
well as sets in some of the groundwork for committing BBR. The
hpts system is updated as well as some other needed utilities
for the entrance of BBR. This is actually part 1 of 3 more
needed commits which will finally complete with BBRv1 being
added as a new tcp stack.
Merge conflics were manually resolved.
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D20834
Michael Tuexen [Wed, 1 Jul 2020 22:22:26 +0000 (22:22 +0000)]
MFC r356663:
Fix race when accepting TCP connections.
When expanding a SYN-cache entry to a socket/inp a two step approach was
taken:
1) The local address was filled in, then the inp was added to the hash
table.
2) The remote address was filled in and the inp was relocated in the
hash table.
Before the epoch changes, a write lock was held when this happens and
the code looking up entries was holding a corresponding read lock.
Since the read lock is gone away after the introduction of the
epochs, the half populated inp was found during lookup.
This resulted in processing TCP segments in the context of the wrong
TCP connection.
This patch changes the above procedure in a way that the inp is fully
populated before inserted into the hash table.
Thanks to Paul <devgs@ukr.net> for reporting the issue on the net@
mailing list and for testing the patch!
Reviewed by: rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D22971
Michael Tuexen [Wed, 1 Jul 2020 21:52:14 +0000 (21:52 +0000)]
MFC r356270:
Improve input validation of the spp_pathmtu field in the
SCTP_PEER_ADDR_PARAMS socket option. The code in the stack assumes
sane values for the MTU.
This issue was found by running an instance of syzkaller.
MFC r356271:
Remove empty line which was added in previous commit by accident.
Michael Tuexen [Wed, 1 Jul 2020 20:45:26 +0000 (20:45 +0000)]
MFC r355268:
Add a description for the TCP sysctl variable rfc6675_pipe.
It was introduced by r290122, but no documentation was provided.
This is taken from https://reviews.freebsd.org/D21798, since it
is not related to the feature added there.
Michael Tuexen [Wed, 1 Jul 2020 20:41:23 +0000 (20:41 +0000)]
MFC r355266:
In order for the TCP Handshake to support ECN++, and further ECN-related
improvements, the ECN bits need to be exposed to the TCP SYNcache.
This change is a minimal modification to the function headers, without any
functional change intended.
Michael Tuexen [Wed, 1 Jul 2020 20:37:06 +0000 (20:37 +0000)]
MFC r355264:
Update the hostcache also for PTB messages received for SCTP/IPv6.
The corresponding code for SCTP/IPv4 was introduced in
https://svnweb.freebsd.org/base?view=revision&revision=317597
Submitted by: Julius Flohr
Differential Revision: https://reviews.freebsd.org/D22605
Michael Tuexen [Wed, 1 Jul 2020 20:32:51 +0000 (20:32 +0000)]
MFC r355172:
Really ignore the SCTP association identifier on 1-to-1 style sockets
as requiresd by the socket API specification.
Thanks to Inaki Baz Castillo, who found this bug running the userland
stack with valgrind and reported the issue in
https://github.com/sctplab/usrsctp/issues/408
Michael Tuexen [Wed, 1 Jul 2020 20:26:35 +0000 (20:26 +0000)]
MFC r355135: Plug two mbuf leaks during INIT-ACK handling.
One leak happens when there is not enough memory to allocate the
the resources for streams. The other leak happens if the are
unknown parameters in the received INIT-ACK chunk which require
reporting and the INIT-ACK requires sending an ABORT due to illegal
parameter combinations.
Hopefully this fixes
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=19083
Michael Tuexen [Wed, 1 Jul 2020 19:50:03 +0000 (19:50 +0000)]
MFC r354773: Improve TCP CUBIC specific after idle reaction.
The adjustments are inspired by the Linux stack, which has had a
functionally equivalent implementation for more than a decade now.
MFC r356224: Add curly braces missed in above change
Submitted by: rscheff
Reviewed by: Cheng Cui
Differential Revision: https://reviews.freebsd.org/D18982
if_vtnet: let vtnet_rx_vq_intr() and vtnet_rxq_tq_intr() share code
Since the two functions are similar, introduce a common function
(vtnet_rx_vq_process()) to share common code.
This also improves locking, by ensuring vrxs_rescheduled is accessed
under the RXQ lock, and taskqueue_enqueue() is not called under the
lock (therefore avoiding a spurious duplicate lock warning).
Michael Tuexen [Wed, 1 Jul 2020 19:23:15 +0000 (19:23 +0000)]
MFC r354772: Implement a TCP CUBIC-specific after idle reaction.
This patch addresses a very common case of frequent application stalls,
where TCP runs idle and looses the state of the network.
Submitted by: rscheff
Reviewed by: Cheng Cui
Differential Revision: https://reviews.freebsd.org/D18954
Michael Tuexen [Wed, 1 Jul 2020 18:03:38 +0000 (18:03 +0000)]
MFC r353480: Use event handler in SCTP
Use an event handler to notify the SCTP about IP address changes
instead of calling an SCTP specific function from the IP code.
This is a requirement of supporting SCTP as a kernel loadable module.
This patch was developed by markj@, I tweaked a bit the SCTP related
code.
MFC r362006: Prevent TCP Cubic to abruptly increase cwnd after app-limited
Cubic calculates the new cwnd based on absolute time
elapsed since the start of an epoch. A cubic epoch is
started on congestion events, or once the congestion
avoidance phase is started, after slow-start has
completed.
When a sender is application limited for an extended
amount of time and subsequently a larger volume of data
becomes ready for sending, Cubic recalculates cwnd
with a lingering cubic epoch. This recalculation
of the cwnd can induce a massive increase in cwnd,
causing a burst of data to be sent at line rate by
the sender.
This adds a flag to reset the cubic epoch once a
session transitions from app-limited to cwnd-limited
to prevent the above effect.
MFC r361987: Prevent TCP Cubic to abruptly increase cwnd after slow-start
Introducing flags to track the initial Wmax dragging and exit
from slow-start in TCP Cubic. This prevents sudden jumps in the
caluclated cwnd by cubic, especially when the flow is application
limited during slow start (cwnd can not grow as fast as expected).
The downside is that cubic may remain slightly longer in the
concave region before starting the convex region beyond Wmax again.
MFC r361806: Add O_DIRECT flag to DD for cache bypass
FreeBSD DD utility has not had support for the O_DIRECT flag, which
is useful to bypass local caching, e.g. for unconditionally issuing
NFS IO requests during testing.