Mateusz Guzik [Tue, 18 May 2021 19:07:19 +0000 (21:07 +0200)]
lockprof: add contested-only profiling
This allows tracking all wait times with much smaller runtime impact.
For example when doing -j 104 buildkernel on tmpfs:
no profiling: 2921.70s user 282.72s system 6598% cpu 48.562 total
all acquires: 2926.87s user 350.53s system 6656% cpu 49.237 total
contested only: 2919.64s user 290.31s system 6583% cpu 48.756 total
This change makes the TCP LRO code more generic and flexible with regards
to supporting multiple different TCP encapsulation protocols and in general
lays the ground for broader TCP LRO support. The main job of the TCP LRO code is
to merge TCP packets for the same flow, to reduce the number of calls to upper
layers. This reduces CPU and increases performance, due to being able to send
larger TSO offloaded data chunks at a time. Basically the TCP LRO makes it
possible to avoid per-packet interaction by the host CPU.
Because the current TCP LRO code was tightly bound and optimized for TCP/IP
over ethernet only, several larger changes were needed. Also a minor bug was
fixed in the flushing mechanism for inactive entries, where the expire time,
"le->mtime" was not always properly set.
To avoid having to re-run time consuming regression tests for every change,
it was chosen to squash the following list of changes into a single commit:
- Refactor parsing of all address information into the "lro_parser" structure.
This easily allows to reuse parsing code for inner headers.
- Speedup header data comparison. Don't compare field by field, but
instead use an unsigned long array, where the fields get packed.
- Refactor the IPv4/TCP/UDP checksum computations, so that they may be computed
recursivly, only applying deltas as the result of updating payload data.
- Make smaller inline functions doing one operation at a time instead of
big functions having repeated code.
- Refactor the TCP ACK compression code to only execute once
per TCP LRO flush. This gives a minor performance improvement and
keeps the code simple.
- Use sbintime() for all time-keeping. This change also fixes flushing
of inactive entries.
- Try to shrink the size of the LRO entry, because it is frequently zeroed.
- Removed unused TCP LRO macros.
- Cleanup unused TCP LRO statistics counters while at it.
- Try to use __predict_true() and predict_false() to optimise CPU branch
predictions.
Bump the __FreeBSD_version due to adding new member to the "lro_ctrl" structure.
Propagate down USB explore error codes, so that failures to enumerate USB HUBs
behind USB HUBs are detected and the USB reset counter logic will kick in
preventing enumeration of continuously failing ports.
Update USB_PORT_RESET_RECOVERY to comply with the USB 2.0 specification which
says it should be max 10 milliseconds.
This may fix some USB enumeration issues:
> usbd_req_re_enumerate: addr=3, set address failed! (USB_ERR_IOERROR, ignored)
> usbd_setup_device_desc: getting device descriptor at addr 3 failed,
Found by: Zhichao1.Li@dell.com
Sponsored by: Mellanox Technologies // NVIDIA Networking
Lutz Donnerhacke [Sun, 23 May 2021 12:43:00 +0000 (14:43 +0200)]
tests/libalias: Test LibAliasIn and redirection
Rework the tests to check the correct layer in a single test. Factor
out tests for reuse in other modules. Extend the test suite for
libalias(3) to incoming connections. Test the various types of
redirections.
gettimeofday(3) is almost as expensive as the calls to libalias.
So the call frequency for this call is reduced by a factor of 1000 in
order to neglect it's influence.
Using NAT entries became more realistic: A communication of a random
length of up to 150 packets (10% outgoing, 90% incoming) is applied
for each entry.
Add port forwardings to the performance tests. This will cause random
incoming packets to match the random port forwardings opends beforehand.
After a long test run, a lot of ressouces have been allocated.
Measure the time tot free them.
Dmitry Chagin [Wed, 19 May 2021 21:08:25 +0000 (00:08 +0300)]
tcsh: cleanup source tree to reduce diff size.
Remove makefiles, configure files and unused at build time files
to reduce the diff size. Otherwise the diff contains a lot of
unnecessary lines what makes reviewing and merging proccess so hard,
especially for re@.
Kristof Provost [Tue, 18 May 2021 07:24:50 +0000 (09:24 +0200)]
pf: Move nvlist conversion functions to pf_nv
Separate the conversion functions (between kernel structs and nvlists)
to pf_nv. This reduces the size of pf_ioctl.c, which is already quite
large and complex, a good bit. It also keeps all the fairly
straightforward conversion code together.
This update fixes the initialization of "scale" to 20 if started with
-l and the initial statement leads to an error (e.g. contains a syntax
error). Scale was initialized to 0 in that case.
Another change is the support of job control in interactive mode with
line editing enabled. The control characters have been interpreted as
editing commands only, prior to this version.
Rick Macklem [Tue, 18 May 2021 22:53:54 +0000 (15:53 -0700)]
nfsd: Add support for CLAIM_DELEG_CUR_FH to the NFSv4.1/4.2 Open
The Linux NFSv4.1/4.2 client now uses the CLAIM_DELEG_CUR_FH
variant of the Open operation when delegations are recalled and
the client has a local open of the file. This patch adds
support for this variant of Open to the NFSv4.1/4.2 server.
This patch only affects mounts from Linux clients when delegations
are enabled on the server.
Rick Macklem [Tue, 18 May 2021 23:17:58 +0000 (16:17 -0700)]
nfsd: Reduce the callback timeout to 800msec
Recent discussion on the nfsv4@ietf.org mailing list confirmed
that an NFSv4 server should reply to an RPC in less than 1second.
If an NFSv4 RPC requires a delegation be recalled,
the server will attempt a CB_RECALL callback.
If the client is not responsive, the RPC reply will be delayed
until the callback times out.
Without this patch, the timeout is set to 4 seconds (set in
ticks, but used as seconds), resulting in the RPC reply taking over 4sec.
This patch redefines the constant as being in milliseconds and it
implements that for a value of 800msec, to ensure the RPC
reply is sent in less than 1second.
This patch only affects mounts from clients when delegations
are enabled on the server and the client is unresponsive to callbacks.
tcp: Use local CC data only in the correct context
Most CC algos do use local data, and when calling
newreno_cong_signal from there, the latter misinterprets
the data as its own struct, leading to incorrect behavior.
Reported by: chengc_netapp.com
Reviewed By: chengc_netapp.com, tuexen, #transport
MFC after: 3 days
Sponsored By: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D30470
Rick Macklem [Sun, 16 May 2021 23:40:01 +0000 (16:40 -0700)]
NFSv4 server: Re-establish the delegation recall timeout
Commit 7a606f280a3e allowed the server to do retries of CB_RECALL
callbacks every couple of seconds. This was needed to allow the
Linux client to re-establish the back channel.
However this patch broke the delegation timeout check, such that
it would just keep retrying CB_RECALLS.
If the client has crashed or been network patitioned from the
server, this continues until the client TCP reconnects to
the server and re-establishes the back channel.
This patch modifies the code such that it still times out the
delegation recall after some minutes, so that the server will
allow the conflicting client request once the delegation times out.
This patch only affects the NFSv4 server when delegations are
enabled and a NFSv4 client that holds a delegation has crashed
or been network partitioned from the server for at least several
minutes when a delegation needs to be recalled.
update_rtm_from_rc() calls update_rtm_from_info() internally.
The latter one may update provided prtm pointer with a new rtm.
Reassign rtm from prtm afeter calling update_rtm_from_info() to
avoid touching the freed rtm.
Fix panic when trying to delete non-existent gateway in multipath route.
IF non-existend gateway was specified, the code responsible for calculating
an updated nexthop group, returned the same already-used nexthop group.
After the route table update, the operation result contained the same
old & new nexthop groups. Thus, the code responsible for decomposing
the notification to the list of simple nexthop-level notifications,
was not able to find any differences. As a result, it hasn't updated any
of the "simple" notification fields, resulting in empty rtentry pointer.
This empty pointer was the direct reason of a panic.
Fix the problem by returning ESRCH when the new nexthop group is the same
as the old one after applying gateway filter.
Reported by: Michael <michael.adm at gmail.com>
PR: 255665
Lutz Donnerhacke [Sun, 16 May 2021 21:37:37 +0000 (23:37 +0200)]
test/libalias: Tests for instantiation and outgoing NAT
In order to modify libalias for performance, the existing
functionality must not change. Enforce this.
Testing LibAliasOut functionality. This concentrates the typical use
case of initiating data transfers from the inside. Provide a
exhaustive test for the data structure in order to check for
performance improvements.
In order to compare upcoming changes for their effectivness, measure
performance by counting opertions and the runtime of each operation
over the time. Accumulate all tests in a single instance, so make it
complicated over the time. If you wait long enough, you will notice
the expiry of old flows.
Lutz Donnerhacke [Fri, 14 May 2021 13:08:08 +0000 (15:08 +0200)]
libalias: Style cleanup
libalias is a convolut of various coding styles modified by a series
of different editors enforcing interesting convetions on spacing and
comments.
This patch is a baseline to start with a perfomance rework of
libalias. Upcoming patches should be focus on the code, not on the
style. That's why most annoying style errors should be fixed
beforehand.
Jessica Clarke [Sat, 29 May 2021 03:36:36 +0000 (04:36 +0100)]
.github: Attempt to un-break Clang 9 action
GitHub removed Clang 9 from the 20.04 image[1], breaking this build.
Thus, manually add the specific versioned packages we need for the
Ubuntu jobs to ensure they're installed. Note that we don't do the same
for macOS, as Homebrew does not allow multiple llvm@N to co-exist,
giving an error if you attempt to install a second one. In practice we
don't actually use the compiler field here for anything other than the
build name, it's only the cross-bindir that matters, so when it
eventually moves to 12 the name will get confusing but the job will
still work.
Mark Johnston [Fri, 28 May 2021 14:41:43 +0000 (10:41 -0400)]
libradius: Fix attribute length validation in rad_get_attr(3)
The length of the attribute header needs to be excluded when comparing
the attribute length against the length of the packet. Otherwise,
validation may incorrectly fail when fetching the final attribute in a
message.
Fixes: 8d5c78130 ("libradius: Fix input validation bugs")
Reported by: Peter Eriksson
Tested by: Peter Eriksson
Sponsored by: The FreeBSD Foundation
Nathan Whitehorn [Fri, 14 May 2021 12:30:41 +0000 (08:30 -0400)]
Fix scripted installs on EFI systems using ZFS root with zfsboot.
Unlike attended installations, scripted installs did not mount non-ZFS
partitions when ZFS root (via zfsboot) was selected. Since this included
the ESP, the EFI loader was not installed. Copy logic from the
attended-install path to make this work.
Kevin Bowling [Wed, 21 Apr 2021 02:35:14 +0000 (19:35 -0700)]
ixgbe: Improve device name strings
This is just clerical work to ease bug triage and may be used to set
expectations around the ability for anyone in the community to perform
testing and development on older parts.
Kevin Bowling [Sun, 25 Apr 2021 08:22:23 +0000 (01:22 -0700)]
e1000: Rework em_msi_link interrupt filter
* Fix 82574 Link Status Changes, carrying the OTHER mask bit around as
needed.
* Move igb-class LSC re-arming out of FAST back into the handler.
* Clarify spurious/other interrupt re-arms in FAST.
In MSI-X mode, 82574 and igb-class devices use an interrupt filter to
handle Link Status Changes. We want to do LSC re-arms in the handler
to take advantage of autoclear (EIAC) single shot behavior.
82574 uses 'Other' in ICR and IMS for LSC interrupt types when in MSI-X
mode, so we need to set and re-arm the 'Other' bit during attach and
after ICR reads in the FAST handler if not an LSC or after handling on
LSC due to autoclearing.
This work was primarily done to address the referenced PR, but inspired
some clarification and improvement for igb-class devices once the
intentions of previous bug fix attempts became clearer.
Kevin Bowling [Wed, 21 Apr 2021 05:27:48 +0000 (22:27 -0700)]
e1000: Improve device name strings
This is just clerical work to ease bug triage and may be used to set
expectations around the ability for anyone in the community to perform
testing and development on older parts (this driver covers over 20 years
of silicon)
Reviewed by: erj
Approved by: markj
Sponsored by: Pink Floyd - Any Colour You Like (in kind)
Differential Revision: https://reviews.freebsd.org/D29872
Kevin Bowling [Fri, 16 Apr 2021 23:15:50 +0000 (16:15 -0700)]
e1000: Correct promisc multicast filter handling
There are a number of issues in the e1000 multicast filter handling
that have been present for a long time. Take the updated approach from
ixgbe(4) which does not have the issues.
The issues are outlined in the PR, in particular this solves crossing
over and under the hardware's filter limit, not programming the
hardware filter when we are above its limit, disabling SBP (show bad
packets) when the tunable is enabled and exiting promiscuous mode, and
an off-by-one error in the em_copy_maddr function.
Don Morris [Thu, 20 May 2021 14:54:38 +0000 (10:54 -0400)]
ufs: Avoid M_WAITOK allocations when building a dirhash
At this point the directory's vnode lock is held, so blocking while
waiting for free pages makes the system more susceptible to deadlock in
low memory conditions. This is particularly problematic on NUMA systems
as UMA currently implements a strict first-touch policy.
ufsdirhash_build() already uses M_NOWAIT for other allocations and
already handled failures for the block array allocation, so just convert
to M_NOWAIT.
Lutz Donnerhacke [Thu, 11 Feb 2021 22:59:11 +0000 (23:59 +0100)]
netgraph/ng_bridge: Avoid cache thrashing
Hint the compiler, that this update is needed at most once per second.
Only in this case the memory line needs to be written. This will
reduce the amount of cache trashing during forward of most frames.
Lutz Donnerhacke [Tue, 12 Jan 2021 21:28:29 +0000 (22:28 +0100)]
netgraph/ng_bridge: become SMP aware
The node ng_bridge underwent a lot of changes in the last few months.
All those steps were necessary to distinguish between structure
modifying and read-only data transport paths. Now it's done, the node
can perform frame forwarding on multiple cores in parallel.
Use the new control message to move ethernet addresses from a link to
a new link in ng_bridge(4). Send this message instead of doing the
work directly requires to move the loop detection into the control
message processing. This will delay the loop detection by a few
frames.
This decouples the read-only activity from the modification under a
more strict writer lock.
netgraph/ng_bridge: learn MACs via control message
Add a new control message to move ethernet addresses to a given link
in ng_bridge(4). Send this message instead of doing the work directly.
This decouples the read-only activity from the modification under a
more strict writer lock.
Decoupling the work is a prerequisite for multithreaded operation.
Kristof Provost [Mon, 24 May 2021 06:32:16 +0000 (08:32 +0200)]
pf: fix ioctl() memory leak
When we create an nvlist and insert it into another nvlist we must
remember to destroy it. The nvlist_add_nvlist() function makes a copy,
just like nvlist_add_string() makes a copy of the string. If we don't
we're leaking memory on every (nvlist-based) ioctl() call.
While here remove two redundant 'break' statements.
Kristof Provost [Tue, 18 May 2021 13:03:01 +0000 (15:03 +0200)]
pfctl: Fix crash on ALTQ configuration
The following config could crash pfctl:
altq on igb0 fairq bandwidth 1Gb queue { qLink }
queue qLink fairq(default)
That happens because when we're parsing the parent queue (on igb0) it
doesn't have a parent, and the check in eval_pfqueue_fairq() checks
pa->parent rather than parent.
Kristof Provost [Thu, 13 May 2021 07:51:28 +0000 (09:51 +0200)]
pf: Support killing floating states by interface
Floating states get assigned to interface 'all' (V_pfi_all), so when we
try to flush all states for an interface states originally created
through this interface are not flushed. Only if-bound states can be
flushed in this way.
Given that we track the original interface we can check if the state's
interface is 'all', and if so compare to the orig_if instead.
If copyin family of routines fault, kernel does clear PSL.AC on the
fault entry, but the AC flag of the faulted frame is kept intact. Since
onfault handler is effectively jump, AC survives until syscall exit.
Reported by: m00nbsd, via Sony
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
admbugs: 975
Mark Johnston [Wed, 12 May 2021 13:39:36 +0000 (09:39 -0400)]
Fix mbuf leaks in various pru_send implementations
The various protocol implementations are not very consistent about
freeing mbufs in error paths. In general, all protocols must free both
"m" and "control" upon an error, except if PRUS_NOTREADY is specified
(this is only implemented by TCP and unix(4) and requires further work
not handled in this diff), in which case "control" still must be freed.
This diff plugs various leaks in the pru_send implementations.
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
Kristof Provost [Sun, 16 May 2021 06:51:54 +0000 (08:51 +0200)]
pf tests: More set skip on <ifgroup> tests
Test the specific case reported in PR 255852. Clearing the skip flag
on groups was broken because pfctl couldn't work out if a kif was a
group or not, because the kernel no longer set the pfik_group pointer.
Kristof Provost [Sun, 16 May 2021 06:50:17 +0000 (08:50 +0200)]
pf: Set the pfik_group for userspace
Userspace relies on this pointer to work out if the kif is a group or
not. It can't use it for anything else, because it's a pointer to a
kernel address. Substitute 0xfeedc0de for 'true', so that we don't leak
kernel memory addresses to userspace.
Alexander Motin [Sat, 24 Apr 2021 03:18:01 +0000 (23:18 -0400)]
mpr/mps(4): Make device mapping some more robust.
Allow new enclosure to replace previously existing one if there is
no completely unused table entry, same as it is done for devices.
If we can not process DPM due to corruption -- wipe it and restart
from scratch. Otherwise I don't see a way to recover persistence if
something go wrong and there is no BIOS to recover it for us.
Together this solves a problem that appeared when 9300-8i firmware
update to 16.00.10.00 somehow switched its mapping mode from Device
Persistence to Enclosure/Slot without wiping the DPM table. It made
HBA completely unusable, since overflowed and conflicting mapping
table was unable to map any of enclosures and so devices.
Also while there make some enclosure mapping errors more informative.