tcp: LRO code to deal with all 12 TCP header flags
TCP per RFC793 has 4 reserved flag bits for future use. One
of those bits may be used for Accurate ECN.
This patch is to include these bits in the LRO code to ease
the extensibility if/when these bits are used.
TLS RX support is modeled after TLS TX support. The basic structures and layouts
are almost identical, except that the send tag created filters RX traffic and
not TX traffic.
The TLS RX tag keeps track of past TLS records up to a certain limit,
approximately 1 Gbyte of TCP data. TLS records of same length are joined
into a single database record.
Regularly the HW is queried for TLS RX progress information. The TCP sequence
number gotten from the HW is then matches against the database of TLS TCP
sequence number records and lengths. If a match is found a static params WQE
is queued on the IQ and the hardware should immediately resume decrypting TLS
data until the next non-sequential TCP packet arrives.
Offloading TLS RX data is supported for untagged, prio-tagged, and
regular VLAN traffic.
Currently, unicast/multicast loopback raw ethernet (non-RDMA) packets
are sent back to the vport. A unicast loopback packet is the packet
with destination MAC address the same as the source MAC address. For
multicast, the destination MAC address is in the vport's multicast
filter list.
Moreover, the local loopback is not needed if there is one or none
user space context.
After this patch, the raw ethernet unicast and multicast local
loopback are disabled by default. When there is more than one user
space context, the local loopback is enabled.
Note that when local loopback is disabled, raw ethernet packets are
not looped back to the vport and are forwarded to the next routing
level (eswitch, or multihost switch, or out to the wire depending on
the configuration).
mlx5: Implement flow steering helper functions for TCP sockets.
This change adds convenience functions to setup a flow steering rule based on
a TCP socket. The helper function gets all the address information from the
socket and returns a steering rule, to be used with HW TLS RX offload.
mlx5en: Create and destroy all flow tables and rules when the network interface attaches and detaches.
Previously flow steering tables and rules were only created and destroyed
at link up and down events, respectivly. Due to new requirements for adding
TLS RX flow tables and rules, the main flow steering table must always be
available as there are permanent redirections from the TLS RX flow table
to the vlan flow table.
mlx5en: Force all packets through the indirection table.
All packets must go through the indirection table, RQT,
because it is not possible to modify the RQN of the TIR
for direct dispatchment after it is created, typically
when the link goes up and down.
mlx5en: Implement support for internal queues, IQ.
Internal send queues are regular sendqueues which are reserved for WQE commands
towards the hardware and firmware. These queues typically carry resync
information for ongoing TLS RX connections and when changing schedule queues
for rate limited connections.
The internal queue, IQ, code is more or less a stripped down copy
of the existing SQ managing code with exception of:
1) An optional single segment memory buffer which can be read or
written as a whole by the hardware, may be provided.
2) An optional completion callback for all transmit operations, may
be provided.
3) Does not support mbufs.
mlx5en: Make the receive packet indirection table, RQT, static instead of dynamic.
Allocate the RQT once, pointing all initial entries to the drop RQN.
When opening the channels simplify modify the RQT, directing all traffic
to the new RQNs. Similarly when closing the channels point all RQT entries
back to the so-called drop RQN.
mlx5en: Implement dummy receive queue, RQ, for dropping packets.
What is a drop RQ and why is it needed?
The RSS indirection table, also called the RQT, selects the
destination RQ based on the receive queue number, RQN. The RQT is
frequently referred to by flow steering rules to distribute traffic
among multiple RQs. The problem is that the RQs cannot be destroyed
before the RQT referring them is destroyed too. Further, TLS RX
rules may still be referring to the RQT even if the link went
down. Because there is no magic RQN for dropping packets, we create
a dummy RQ, also called drop RQ, which sole purpose is to drop all
received packets. When the link goes down this RQN is filled in all
RQT entries, of the main RQT, so the real RQs which are about to be
destroyed can be released and the TLS RX rules can be sustained.
mlx5en: Patch to inhibit transmit doorbell writes during packet reception.
During packet reception the network stack frequently transmit data in
response to TCP window updates. To reduce the number of transmit doorbells
needed, inhibit all transmit doorbells designated for the same channel until
after the reception of packets for the given channel is completed.
While at it slightly refactor the mlx5e_tx_notify_hw() function:
1) The doorbell information is always stored into sq->doorbell.d64 .
No need to pass a separate pointer to this variable.
2) Move checks for skipping doorbell writes inside this function.
mlx5en: Use a UMA cache zone for managing TLS send tags
Instead of allocating directly from a normal zone. This way
import and release are guaranteed to process all allocated and then
deallocated items. Also, the release occurs in a sleepable context when
caller of uma_zfree() or uma_zdestroy() can sleep itself.
Andrew Turner [Tue, 1 Feb 2022 11:43:13 +0000 (11:43 +0000)]
Add the Arm SPE interrupt to acpidump
To support the Arm Statistical Profiling Extension (SPE) ACPI 6.3 added
a place to hold the SPE interrupt. Add to acpidump to show when printing
the Arm Generic Interrupt data.
ufs, msdosfs: do not record witness order when creating vnode
When allocating new vnode, we need to lock it exclusively before
making it externally visible. Since other threads cannot observe the
vnode yet, current lock order cannot create LoR conditions.
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34126
It contains assert-related definitions previously provided by
sys/systm.h. The new header is leaner than whole systm.h.
Include kassert.h from systm.h for compatibility.
John Baldwin [Tue, 1 Feb 2022 01:33:31 +0000 (17:33 -0800)]
fstyp: Remove __packed from struct exfat_de_label.
This fixes a -Waddress-of-packed-member warning about a possibly
unaligned pointer from GCC 9 when calling convert_label().
__packed has to be removed from struct exfat_dirent as well to fix an
alignment warning when casting from a struct exfat_dirent pointer to a
struct exfat_de_label pointer.
Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D32144
John Baldwin [Tue, 1 Feb 2022 01:11:27 +0000 (17:11 -0800)]
hyperv storvsc: Don't abuse struct sglist to hold virtual addresses.
struct sglist is intended for holding S/G lists of physical address
ranges, not virtual address ranges. GCC 9.x issues several warnings
due to casts between pointers and integers of different sizes as a
result (vm_paddr_t is 64-bits on i386). Instead, add a local 'struct
hv_sglist' which uses an array of 'struct iovec' to hold the S/G list
of virtual address ranges.
John Baldwin [Tue, 1 Feb 2022 00:40:04 +0000 (16:40 -0800)]
tcp_ratelimit: Handle some edge cases with TLS + RL send tags.
- After a connection has fallen back from NIC TLS to SW TLS, any
pacing rate changes should modify the inpcb send tag even though
SB_TLS_IFNET is set.
- If a connection tries to modify the pacing rate before the send
tag has been converted from plain TLS to TLS + RL, don't fail
the rate request set but let it fall through to setting the rate
on the non-TLS inpcb RL tag.
John Baldwin [Tue, 1 Feb 2022 00:39:21 +0000 (16:39 -0800)]
ktls: Try to enable TOE TLS after marking existing data not ready.
At the moment this is mostly a no-op but in the future there will be
in-flight encrypted data which requires software decryption. This
same setup is also needed for NIC TLS RX.
Note that this does break TOE TLS RX for AES-CBC ciphers since there
is no software fallback for AES-CBC receive. This will be resolved
one way or another before 14.0 is released.
Mark Johnston [Mon, 31 Jan 2022 21:14:00 +0000 (16:14 -0500)]
pf: Initialize pf_kpool mutexes earlier
There are some error paths in ioctl handlers that will call
pf_krule_free() before the rule's rpool.mtx field is initialized,
causing a panic with INVARIANTS enabled.
Fix the problem by introducing pf_krule_alloc() and initializing the
mutex there. This does mean that the rule->krule and pool->kpool
conversion functions need to stop zeroing the input structure, but I
don't see a nicer way to handle this except perhaps by guarding the
mtx_destroy() with a mtx_initialized() check.
Constify some related functions while here and add a regression test
based on a syzkaller reproducer.
Reported by: syzbot+77cd12872691d219c158@syzkaller.appspotmail.com
Reviewed by: kp
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34115
Kristof Provost [Mon, 31 Jan 2022 17:31:53 +0000 (18:31 +0100)]
libpfctl: fix pfctl_kill_states()
735748f30a changed the output of the states so that the creator id
endianness would be consistent. This means that we need to convert the
host endianness creatorid back to big-endian before we give it to the
kernel.
Kornel Duleba [Thu, 27 Jan 2022 10:25:25 +0000 (11:25 +0100)]
enetc: Wait for pending transmissions before disabling TX queues
According to the RM it's not safe to disable a TX ring while it is busy
transmitting frames.
In order to be safe wait until the ring is empty. (cidx==pidx)
Use this opportunity to remove a set-but-unused variable.
Obtained from: Semihalf
Sponsored by: Alstom Group
Kornel Duleba [Thu, 27 Jan 2022 09:24:26 +0000 (10:24 +0100)]
enetc: Simply TX ring credits counting logic
According to the RM rings can hold at most ring_size - 1 descriptors at any time.
No additional logic is needed since iflib already respects this constrain.
Thanks to that the pidx == cidx situation is not ambiguous and indicates an
empty ring.
Use that to simplify the logic that calculates the amount of processed frames.
Obtained from: Semihalf
Sponsored by: Alstom Group
Kornel Duleba [Thu, 27 Jan 2022 08:26:07 +0000 (09:26 +0100)]
enetc: Disable HW IP packet alignment
The NIC can IP align received packets.
It was observed that it caused some rare stalls, that required full board reset.
Disable this feature for now. It doesn't provide any significant performance
improvement anyway.
Obtained from: Semihalf
Sponsored by: Alstom Group
ufs: be more persistent with finishing some operations
when the vnode is doomed after relock. The mere fact that the vnode is
doomed does not prevent us from doing UFS operations on it while it is
still belongs to UFS, which is determined by non-NULL v_data. Not
finishing some operations, e.g. not syncing the inode block only because
the vnode started reclamation, is not correct.
Add macro IS_UFS() which incapsulates the v_data != NULL, and use it
instead of VN_IS_DOOMED() for places where the operation completion is
important.
Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34072
syncer VOP_FSYNC(): unlock syncer vnode around call to VFS_SYNC()
The lock is unneccessary since the mount point is busied, which prevents
unmount and syncer vnode deallocation. Having the vnode locked causes
innocent LoRs and complicates debugging.
Also stop starting write accounting around it. Any caller of
VOP_FSYNC() must do it already, and sync_vnode() does.
Reported and tested by: pho
Reviewed by: markj, mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34072
The buffer must not be accessed by any other thread, it is freshly
allocated. As such, LK_NOWAIT should be nop but also it prevents
recording the order between the buffer lock and any other locks we might
own in the call to getnewbuf(). In particular, if we own FFS snap lock,
it should avoid triggering false positive warning.
Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34072
Kirk McKusick [Mon, 31 Jan 2022 01:20:10 +0000 (17:20 -0800)]
In GEOM debugging output, show consumer for cloned and duplicated bio's.
When using bio's created by g_clone_bio() or g_duplicate_bio()
their consumer device (the device to which their I/O requests
are sent) is listed by the geom debugging facility as [unknown].
If available, this update lists the consumer associated with
the bio's parent.
Dimitry Andric [Sun, 30 Jan 2022 20:41:24 +0000 (21:41 +0100)]
Apply clang fix for assertion failure compiling science/chrono
Merge commit 6b0f35931a44 from llvm git (by Jennifer Yu):
Fix signal during the call to checkOpenMPLoop.
The root problem is a null pointer is accessed during the call to
checkOpenMPLoop, because loop up bound expr is an error expression
due to error diagnostic was emit early.
To fix this, in setLCDeclAndLB, setUB and setStep instead return false,
return true when LB, UB or Step contains Error, so that the checking is
stopped in checkOpenMPLoop.
Note this only fixes the assertion reported in bug 261567; some other
tweaks for port dependencies are probably still required to make it
build to completion.
For historical reasons, the integer is stored with an offset of plus 14.
That means, for a given max path length of 1024 the valid values
are -1009 .. 1037 and not -1023 .. 1023
Alexander Motin [Sun, 30 Jan 2022 02:59:03 +0000 (21:59 -0500)]
GEOM: Set G_CF_DIRECT_SEND/RECEIVE for taste consumers.
All I/O requests through the taste consumers are synchronous, done
with g_read_data() and without any locks held. It makes no sense
to delegate the I/O to g_down/g_up threads.
This removes many of context switches during disk retaste.
If the NVMe Controller doesn't support Namespace Management, it should
return "Invalid Namespace or Format" when the Host request Identify
Namespace with the global NSID value.
Chuck Tuffli [Sun, 30 Jan 2022 07:10:59 +0000 (23:10 -0800)]
bhyve nvme: Fix Set Features, AEN
NVMe Controllers which do not support Endurance Groups must return an
error when the Endurance Group Event Aggregate Log Change Notices bit is
set in Set Features, Asynchronous Event Configuration.
Chuck Tuffli [Sun, 30 Jan 2022 07:09:57 +0000 (23:09 -0800)]
bhyve nvme: Fix LBA out-of-range calculation
The function which checks for a valid LBA range mistakenly named an
input value as NLB ("Number of Logical Blocks") instead of "number of
blocks". The NVMe specification defines NLB as a zero-based value (i.e.
NLB=0x0 represents 1 block, 0x1 is 2 blocks, etc.), but the passed
parameter is a 1's-based value.
Fix is to rename the variable to avoid future confusion.
While in the neighborhood, also check that the starting LBA is less than
the size of the backing storage to avoid an integer overflow.
Chuck Tuffli [Sun, 30 Jan 2022 07:09:10 +0000 (23:09 -0800)]
bhyve nvme: Update v1.4 Identify Controller data
Compliant v1.4 Controllers must report a Controller Type (CNTRLTYPE).
Also, do not advertise secure erase functionality in the Format NVM
Attributes field of the Identify Controller data structure as the
Controller does not implement secure erase.
Chuck Tuffli [Sun, 30 Jan 2022 07:08:47 +0000 (23:08 -0800)]
bhyve nvme: Add Temperature Threshold support
This adds the ability for a guest OS to send Set / Get Feature,
Temperature Threshold commands. The implementation assumes a constant
temperature and will generate an Asynchronous Event Notification if the
specified threshold is above/below this value. Although the
specification allows 9 temperature values, this implementation only
implements the Composite Temperature.
While in the neighborhood, move the clear of the CSTS register in the
reset function after all other cleanup. This avoids a race with the
guest thinking the reset is complete (i.e. CSTS.RDY = 0) before the NVMe
emulation is actually complete with the reset.
Chuck Tuffli [Sun, 30 Jan 2022 07:07:29 +0000 (23:07 -0800)]
bhyve nvme: Remove redundant AER Limit checks
The NVMe emulation checked if the Asynchronous Event Request Limit
(a.k.a AERL) would be exceeded in pci_nvme_aer_add(), but this function
is only called from nvme_opc_async_event_req() which also checks for
exceeding the AERL.