Roy Marples [Wed, 20 Oct 2021 15:47:29 +0000 (11:47 -0400)]
net: Allow binding of unspecified address without address existance
Previously in_pcbbind_setup returned EADDRNOTAVAIL for empty
V_in_ifaddrhead (i.e., no IPv4 addresses configured) and in6_pcbbind
did the same for empty V_in6_ifaddrhead (no IPv6 addresses).
An equivalent test has existed since 4.4-Lite. It was presumably done
to avoid extra work (assuming the address isn't going to be found
later).
In normal system operation *_ifaddrhead will not be empty: they will
at least have the loopback address(es). In practice no work will be
avoided.
Further, this case caused net/dhcpd to fail when run early in boot
before assignment of any addresses. It should be possible to bind the
unspecified address even if no addresses have been configured yet, so
just remove the tests.
The now-removed "XXX broken" comments were added in 59562606b9d3,
which converted the ifaddr lists to TAILQs. As far as I (emaste) can
tell the brokenness is the issue described above, not some aspect of
the TAILQ conversion.
Bjoern A. Zeeb [Tue, 9 Nov 2021 22:11:15 +0000 (22:11 +0000)]
arm64/gicv3: improve a panic message
Print the device/unit in the panic message for which we cannot get
the MSI device ID to have a clue where to start looking.
While here use __func__ instead of hardcoding the function name.
Dmitry Salychev [Sat, 7 Aug 2021 17:17:57 +0000 (17:17 +0000)]
Parse named nodes from IORT ACPI on arm64
Add the ability to map named components from IORT to their
SMMU or ITS node in order to setup interrupts.
It is now possible to find a node by its name (substring) and
resource ID similar to PCI nodes.
This is needed by work on a driver for NXP's Second Generation
Data Path Acceleration Architecture (DPAA2).
bhyve: net_backends, automatically IFF_UP tap devices
If you want communications with the outside world and tell bhyve to
create an interfaces then it should be usable as well.
Rather than relying on the sysctl net.link.tap.up_on_open automatically
try to IFF_UP the opened tap device.
Bjoern A. Zeeb [Mon, 25 Oct 2021 16:18:11 +0000 (16:18 +0000)]
mlx4: rename conflicting netdev_priv() to mlx4_netdev_priv()
netdev_priv() is a LinuxKPI function which was used with the old ifnet
linux/netdevice.h implementation which was not adaptable to modern
Linux drviers unless rewriting them for ifnet in first place which
defeats the purpose.
Rename the netdev_priv() calls in mlx4 to mlx4_netdev_priv()
returning the ifnet softc to avoid conflicting symbol names
with different implementations in the future.
Bjoern A. Zeeb [Mon, 25 Oct 2021 17:06:09 +0000 (17:06 +0000)]
LinuxKPI: pci.h make pci_dev argument const for pci_{read,write}_config*()
Make the struct pci_dev argument to the pci_{read,write}_config*()
functions "const" to match the Linux definition as some drivers
try to pass in a const argument which we currently fail to honor.
Bjoern A. Zeeb [Mon, 25 Oct 2021 17:15:01 +0000 (17:15 +0000)]
LinuxKPI: pci.h / linux_pci.c rename pci_driver field
Rename the struct pci_driver {} field to the list_head from links
to node as a driver is actually initialsing this to {} which seems
questionable but it will at least make us match the Linux structure
field name.
Bjoern A. Zeeb [Sun, 6 Jun 2021 21:14:28 +0000 (21:14 +0000)]
net80211/drivers: improve ieee80211_rx_stats for band
While IEEE80211_R_BAND was defined, there was no place to store the
band. Add a field for that, adjust ieee80211_lookup_channel_rxstatus()
to require it, and update drivers passing "R_{FREQ|IEEE}" in already to
provide the band as well. For the moment keep the fall-back code
requiring all three fields.
net80211: correct input_sta length checks and control frame handling
Correct input_sta "assertion" checks. CTS/ACK CTRL frames are shorter
then sizeof(struct ieee80211_frame_min) and were thus running into the
is_rx_tooshort error case.
Use ieee80211_anyhdrsize() to handle this better but make sure we do
at least have the first 2 octets needed for that.
While here move the safety checks before any code which may not obey
them later, just for good style.
The non-scanning check further down assumes a frame format also not
matching control frames. For now skip the checks for control frames
which allows us to deal with some of them at least now.
Sponsored by: The FreeBSD Foundation
Obtained from: 20210906 wireless v0.91 code drop
Bjoern A. Zeeb [Wed, 6 Oct 2021 18:09:39 +0000 (18:09 +0000)]
net80211: correct length check in ieee80211_ies_expand()
In ieee80211_ies_expand() we are looping over Elements
(also known as Information Elements or IEs).
The comment suggests that we assume well-formedness of
the IEs themselves.
Checking the buffer length being least 2 (1 byte Element ID and
1 byte Length fields) rather than just 1 before accessing ie[1]
is still good practise and can prevent and out-of-bounds read in
case the input is not behaving according to the comment.
Bjoern A. Zeeb [Wed, 6 Oct 2021 18:41:37 +0000 (18:41 +0000)]
net80211: proper ssid length check in setmlme_assoc_adhoc()
A user supplied SSID length is used without proper checks in
setmlme_assoc_adhoc() which can lead to copies beyond the end
of the user supplied buffer.
The ssid is a fixed size array for the ioctl and the argument
to setmlme_assoc_adhoc().
In addition to an ssid_len check of 0 also error in case the
ssid_len is larger than the size of the ssid array to prevent
problems.
Mathy Vanhoef [Sun, 6 Jun 2021 22:10:56 +0000 (22:10 +0000)]
net80211: prevent plaintext injection by A-MSDU RFC1042/EAPOL frames
No longer accept plaintext A-MSDU frames that start with an RFC1042
header with EtherType EAPOL. This is done by only accepting EAPOL
packets that are included in non-aggregated 802.11 frames.
Note that before this patch, FreeBSD also only accepted EAPOL frames
that are sent in a non-aggregated 802.11 frame due to bugs in
processing EAPOL packets inside A-MSDUs. In other words,
compatibility with legitimate devices remains the same.
This relates to section 6.5 in the 2021 Usenix "FragAttacks" (Fragment
and Forge: Breaking Wi-Fi Through Frame Aggregation and Fragmentation)
paper.
Mathy Vanhoef [Sun, 6 Jun 2021 22:10:52 +0000 (22:10 +0000)]
net80211: mitigation against A-MSDU design flaw
Mitigate A-MSDU injection attacks by detecting if the destination address
of a subframe equals an RFC1042 (i.e., LLC/SNAP) header, and if so
dropping the complete A-MSDU frame. This mitigates known attacks,
although new (unknown) aggregation-based attacks may remain possible.
This defense works because in A-MSDU aggregation injection attacks, a
normal encrypted Wi-Fi frame is turned into an A-MSDU frame. This means
the first 6 bytes of the first A-MSDU subframe correspond to an RFC1042
header. In other words, the destination MAC address of the first A-MSDU
subframe contains the start of an RFC1042 header during an aggregation
attack. We can detect this and thereby prevent this specific attack.
This relates to section 7.2 in the 2021 Usenix "FragAttacks" (Fragment
and Forge: Breaking Wi-Fi Through Frame Aggregation and Fragmentation)
paper.
ieee80211_defrag() accepts fragmented 802.11 frames in a protected Wi-Fi
network even when some of the fragments are not encrypted.
Track whether the fragments are encrypted or not and only accept
successive ones if they match the state of the first fragment.
This relates to section 6.3 in the 2021 Usenix "FragAttacks" (Fragment
and Forge: Breaking Wi-Fi Through Frame Aggregation and Fragmentation)
paper.
Coherent is lower 32bit only by default in Linux and our only default
dma mask is 64bit currently which violates expectations unless
dma_set_coherent_mask() was called explicitly with a different mask.
Implement coherent by creating a second tag, and storing the tags in the
objects and use the tag from the object wherever possible.
This currently does not update the scatterlist or pool (both could be
converted but S/G cannot be MFCed as easily).
There is a 2nd change embedded in the updated logic of
linux_dma_alloc_coherent() to always zero the allocation as
otherwise some drivers get cranky on uninialised garbage.
LinuxKPI: dma-mapping.h unify "mask" and "dma_mask"
In some places we are using "mask" and others "dma_mask" for the
same thing. Harmonize the various places to "dma_mask" as used in
linux_pci.c. For the declaration remove the argument names to
avoid the entire problem.
This is in preparation for an upcoming change.
No functional changes intended.
As reported by multiple people testing iwlwifi, device_release_driver()
can lead to a panic on secondary errors (usually during attach).
Disable device_release_driver() for the short-term to prevent the panic
but leave it in place so it can be re-worked and fixed properly for
the long-term more easily.
net80211: add func/line information to IEEE80211_DISCARD* macros
While debugging is very good in net80211, some log messages are
repeated in multiple places 1:1. In order to distinguish where the
discard happened and to speed up analysis, add __func__:__LINE__
information to all these messages.
Add fsleep() function now required by rtw88. This seems to be
making a decision depending on time to sleep on how to sleep.
Given our compat framework already is lenient on how long to sleep,
this is a cut down version.
LinuxKPI: add module_pci_driver() and pci_alloc_irq_vectors()
Add the two new functions needed by rtw88 to register the driver and
handle the module bits as well as a version of pci_alloc_irq_vectors()
for what is needed.
Mark Johnston [Wed, 3 Nov 2021 16:28:48 +0000 (12:28 -0400)]
kasan: Disable validation of function parameters passed by value
It appears that the emitted code in the caller does not update shadow
state for values passed on the stack to the callee, which it seemingly
ought to do after pushing values on the stack and prior to the call
itself. This leaves open a window where an interrupt handler can cause
regions of the stack containing these values to be poisoned, resulting
in rare false positive reports. This happens particularly in the amd64
TLB invalidation code, where we liberally pass cpuset_t's around by
value.
LLVM has a flag to disable validation of accesses of function parameters
passed by value. Such validation is itself a relatively new feature.
Turn it off for now.
Reported by: pho, syzkaller
Sponsored by: The FreeBSD Foundation
Rick Macklem [Wed, 3 Nov 2021 21:25:44 +0000 (14:25 -0700)]
nfscl: Fix forced dismount from looping on commit
When a forced dismount is in progress, it is possible to
end up looping, retrying commits that fail.
This patch fixes the problem by pretending
that commits succeeded when a forced dismount is in prgress.
Rick Macklem [Wed, 3 Nov 2021 19:15:40 +0000 (12:15 -0700)]
nfscl: Fix use after free for forced dismount
When a forced dismount is done and delegations are being
issued by the server (disabled by default for FreeBSD
servers), the delegation structure is free'd before the
loop calling vflush(). This could result in a use after
free of the delegation structure.
This patch changes the code so that the delegation
structures are not free'd until after the vflush()
loop for forced dismounts.
Found during a recent IETF NFSv4 working group testing event.
Rick Macklem [Wed, 3 Nov 2021 00:28:13 +0000 (17:28 -0700)]
nfscl: Check for a forced dismount in nfscl_getref()
The nfscl_getref() function is called within nfscl_doiods() when
the NFSv4.1/4.2 pNFS client is doing I/O on a DS. As such,
nfscl_getref() needs to check for a forced dismount.
This patch adds that check.
Found during a recent IETF NFSv4 working group testing event.
Rick Macklem [Tue, 2 Nov 2021 00:21:31 +0000 (17:21 -0700)]
nfscl: Use a smaller initial delay time for NFSERR_DELAY
For NFS RPCs that receive a NFSERR_DELAY reply, the delay time
is initially 1sec and then increases exponentially to NFS_TRYLATERDEL.
It was found that this delay time is excessive for some NFSv4
servers, which work well with a 1msec delay.
A 1sec delay resulted in very slow performance for Remove and
Rename when delegations and pNFS were enabled.
This patch decreases the initial delay time to 1msec.
Found during a recent IETF NFSv4 working group testing event.
Mark Johnston [Sat, 16 Oct 2021 17:13:26 +0000 (13:13 -0400)]
bhyve: Fix the WITH_BHYVE_SNAPSHOT build
Note, this breaks compatibility with snapshots generated by older builds
of bhyve(8).
Fixes: 7fa233534736 ("bhyve: Map the MSI-X table unconditionally for passthrough")
Reported by: Greg V <greg@unrelenting.technology>
Reviewed by: grehan, bz
Sponsored by: The FreeBSD Foundation
Mark Johnston [Sat, 9 Oct 2021 15:36:19 +0000 (11:36 -0400)]
bhyve: Map the MSI-X table unconditionally for passthrough
It is possible for the PBA to reside in the same page as the MSI-X
table. And, while devices are not supposed to do this, at least some
Intel wifi devices place registers in a page shared with the MSI-X
table. To handle the first case we currently map the PBA page using
/dev/mem, and the second case is not handled.
Kill two birds with one stone: map the MSI-X table BAR using the
PCIOCBARMMAP ioctl instead of /dev/mem, and map the entire table so that
accesses beyond the bounds of the table can be emulated. Regions of the
BAR not containing the table are left unmapped.
PR: 251046
Reviewed by: bz, grehan, jhb
Sponsored by: The FreeBSD Foundation
Mark Johnston [Tue, 9 Nov 2021 18:07:57 +0000 (13:07 -0500)]
pci: Implement pci_bar_enabled() for SR-IOV VFs
In a VF's configuration space, "memory space enable" is hard-wired to 0,
so the existing implementation always returns false. We need to read
the SR-IOV control register from the PF device to get the value of the
MSE bit.
Fix pci_bar_enabled() to read this register instead for VFs. I don't
see any way to access the PF's config space without a backpointer in the
pci device ivars, so I added one.
This fixes a regression where bhyve(8) fails to map the MSI-X table
after commit 7fa233534736 ("bhyve: Map the MSI-X table unconditionally
for passthrough") when a VF is passed through, since with that commit we
use PCIOCBARMMAP to map the table and that ioctl always fails for VFs
without this change. As a bonus, pciconf(8) now correctly reports the
enablement of BARs for VFs.
Reported and tested by: Raúl Muñoz <raul.munoz@custos.es>
Reviewed by: rstone, jhb
Sponsored by: The FreeBSD Foundation
Mitchell Horne [Mon, 8 Nov 2021 19:33:25 +0000 (15:33 -0400)]
hwpmc: initialize arm64 counter/interrupt state
Performance counters and overflow interrupts are assumed to be disabled
by default, but this is not guaranteed. Ensure we disable both during
per-cpu initialization, before enabling the PMU. Otherwise, some systems
(such as the Ampere eMAG) would experience an interrupt storm upon
loading the hwpmc module.
Reviewed by: br
MFC after: 5 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32854
Mark Johnston [Mon, 1 Nov 2021 13:27:52 +0000 (09:27 -0400)]
uma: Fix handling of reserves in zone_import()
Kegs with no items reserved have uk_reserve = 0. So the check
keg->uk_reserve >= dom->ud_free_items will be true once all slabs are
depleted. Then, rather than go and allocate a fresh slab, we return to
the cache layer.
The intent was to do this only when the keg actually has a reserve, so
modify the check to verify this first. Another approach would be to
make uk_reserve signed and set it to -1 until uma_zone_reserve() is
called, but this requires a few casts elsewhere.
Fixes: 1b2dcc8c54a8 ("uma: Avoid depleting keg reserves when filling a bucket")
Sponsored by: The FreeBSD Foundation
Mark Johnston [Mon, 1 Nov 2021 13:27:35 +0000 (09:27 -0400)]
uma: Improve M_USE_RESERVE handling in keg_fetch_slab()
M_USE_RESERVE is used in a couple of places in the VM to avoid unbounded
recursion when the direct map is not available, as is the case on 32-bit
platforms or when certain kernel sanitizers (KASAN and KMSAN) are
enabled. For example, to allocate KVA, the kernel might allocate a
kernel map entry, which might require a new slab, which requires KVA.
For these zones, we use uma_prealloc() to populate a reserve of items,
and then in certain serialized contexts M_USE_RESERVE can be used to
guarantee a successful allocation. uma_prealloc() allocates the
requested number of items, distributing them evenly among NUMA domains.
Thus, in a first-touch zone, to satisfy an M_USE_RESERVE allocation we
might have to check the slab lists of other domains than the current one
to provide the semantics expected by consumers.
So, try harder to find an item if M_USE_RESERVE is specified and the keg
doesn't have anything for current (first-touch) domain. Specifically,
fall back to a round-robin slab allocation. This change fixes boot-time
panics on NUMA systems with KASAN or KMSAN enabled.[1]
Alternately we could have uma_prealloc() allocate the requested number
of items for each domain, but for some existing consumers this would be
quite wasteful. In general I think keg_fetch_slab() should try harder
to find free slabs in other domains before trying to allocate fresh
ones, but let's limit this to M_USE_RESERVE for now.
Also fix a separate problem that I noticed: in a non-round-robin slab
allocation with M_WAITOK, rather than sleeping after a failed slab
allocation we simply try again. Call vm_wait_domain() before retrying.
Reported by: mjg, tuexen [1]
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Rick Macklem [Sat, 30 Oct 2021 03:35:02 +0000 (20:35 -0700)]
nfscl: Use NFSMNTP_DELEGISSUED in two more functions
Commit 5e5ca4c8fc53 added a NFSMNTP_DELEGISSUED flag to indicate when
a delegation has been issued to the mount. For the common case
where an NFSv4 server is not issuing delegations, this flag
can be checked to avoid acquisition of the NFSCLSTATEMUTEX.
This patch adds checks for NFSMNTP_DELEGISSUED being set
to two more functions.
This change appears to be performance neutral for a small number
of opens, but should reduce lock contention for a large number of opens
for the common case where server is not issuing delegations.
Rick Macklem [Sat, 30 Oct 2021 23:35:02 +0000 (16:35 -0700)]
nfscl: Fix race between Lookup and Setattr of size
PR#259071 provides a test program that fails for the NFS client.
Testing with it, there appears to be a race between Lookup
and VOPs like Setattr-of-size, where Lookup ends up loading
stale attributes (including what might be the wrong file size)
into the NFS vnode's attribute cache.
The race occurs when the modifying VOP (which holds a lock
on the vnode), blocks the acquisition of the vnode in Lookup,
after the RPC (with now potentially stale attributes).
Here's what seems to happen:
Child Parent
does stat(), which does
VOP_LOOKUP(), doing the Lookup
RPC with the directory vnode
locked, acquiring file attributes
valid at this point in time
blocks waiting for locked file does ftruncate(), which
vnode does VOP_SETATTR() of Size,
changing the file's size
while holding an exclusive
lock on the file's vnode
releases the vnode lock
acquires file vnode and fills in
now stale attributes including
the old wrong Size
does a read() which returns
wrong data size
This patch fixes the problem by saving a timestamp in the NFS vnode
in the VOPs that modify the file (Setattr-of-size, Allocate).
Then lookup/readdirplus compares that timestamp with the time just
before starting the RPC after it has acquired the file's vnode.
If the modifying RPC occurred during the Lookup, the attributes
in the RPC reply are discarded, since they might be stale.
With this patch the test program works as expected.
Note that the test program does not fail on a July stable/12,
although this race is in the NFS client code. I suspect a
fairly recent change to the name caching code exposed this
bug.
Rick Macklem [Sun, 31 Oct 2021 00:08:28 +0000 (17:08 -0700)]
nfscl: Add setting n_localmodtime to the Write RPC code
Similar to commit 2be417843a, I believe there could be a race between
the NFS client VOP_LOOKUP() and file Writing that could result in stale
file attributes being loaded into the NFS vnode by VOP_LOOKUP().
I have not been able to reproduce a failure due to this race, but
I believe that there are two possibilities:
The Lookup RPC happens while VOP_WRITE() is being executed and loads
stale file attributes after VOP_WRITE() returns when it has already
completed the Write/Commit RPC(s).
--> For this case, setting the local modify timestamp at the end of
VOP_WRITE() should ensure that stale file attributes are not loaded.
The Lookup RPC occurs after VOP_WRITE() has returned, while
asynchronous Write/Commit RPCs are in progress and then is
blocked by the vnode held by VOP_OPEN/VOP_CLOSE/VOP_FSYNC which
will flush writes via ncl_flush() or ncl_vinvalbuf(), clearing the
NMODIFIED flag (which indicates Writes-in-progress). The VOP_LOOKUP()
then acquires the NFS vnode lock and fills in stale file attributes.
--> Setting the local modify timestamp in ncl_flsuh() and ncl_vinvalbuf()
when they clear NMODIFIED should ensure that stale file attributes
are not loaded.
Rick Macklem [Sun, 31 Oct 2021 23:31:31 +0000 (16:31 -0700)]
nfscl: Do pNFS layout return_on_close synchronously
For pNFS servers that specify that Layouts are to be returned
upon close, they may expect that LayoutReturn to happen before
the associated Close.
This patch modifies the NFSv4.1/4.2 pNFS client so that this
is done. This only affects a pNFS mount against a non-FreeBSD
NFSv4.1/4.2 server that specifies return_on_close in LayoutGet
replies.
Found during a recent IETF NFSv4 working group testing event.
pf: ensure we have the correct source/destination IP address in ICMP errors
When we route-to a packet that later turns out to not fit in the
outbound interface MTU we generate an ICMP error.
However, if we've already changed those (i.e. we've passed through a NAT
rule) we have to undo the transformation first.
Andriy Gapon [Thu, 4 Nov 2021 11:56:12 +0000 (13:56 +0200)]
htu21: allow configuration via hints on FDT-based systems
On-board devices should be configured via the FDT and overlays.
Hints are primarily useful for external and temporarily attached devices.
Adding hints is much easier and faster than writing and compiling
an overlay.
Andriy Gapon [Sat, 6 Nov 2021 16:47:32 +0000 (18:47 +0200)]
htu21: don't needlessly bother hardware when measurements are not needed
sysctl(8) first queries a sysctl to get a size of its value even if the
sysctl is of a fixed size, e.g. it has an integer type.
Only after that sysctl(8) queries an actual value of the sysctl.
Previosuly the driver would needlessly read a sensor in the first step.
Peter Grehan [Mon, 1 Nov 2021 13:35:43 +0000 (23:35 +1000)]
igc: Use hardware routine for PHY reset
Summary:
The previously used software reset routine wasn't sufficient
to reset the PHY if the bootloader hadn't left the device in
an initialized state. This was seen with the onboard igc port
on an 11th-gen Intel NUC.
The software reset isn't used in the Linux driver so all related
code has been removed.
Tested on: Netgate 6100 onboard ports, a discrete PCIe I225-LM card,
and an 11th-gen Intel NUC.
ip_divert: calculate delayed checksum for IPv6 adress family
Before passing an IPv6 packet to application apply delayed checksum
calculation. Mbuf flags will be lost when divert listener will return a
packet back, so we will not be able to do delayed checksum calculation
later. Also an application will get a packet with correct checksum.
Felix Johnson [Mon, 8 Nov 2021 06:14:58 +0000 (01:14 -0500)]
find(1): Update date format reference and remove cvs(1) references
cvs(1) is not installed by default. Change the date format reference to
note that find(1) understands ISO8601 and RFC822 date formats. Also
remove references to cvs(1).
Mark Johnston [Wed, 3 Nov 2021 19:09:17 +0000 (15:09 -0400)]
scsi_cd: Improve TOC access validation
1. During CD probing, we read the TOC header to find the number of
entries, then read the TOC itself. The header determines the number
of entries, which determines the amount of data to read from the
device into the softc in the CD_STATE_MEDIA_TOC_FULL state. We
hard-code a limit of 99 tracks (plus one for the lead-out) in the
softc, but were not validating that the size reported by the media
would fit in this hard-coded limit. Kernel memory corruption could
occur if not.[1] Add validation to check this, and refuse to cache
the TOC if it would not fit.
2. The CDIOCPLAYTRACKS ioctl uses caller provided track numbers to index
into the TOC, but we only validate the starting index. Add
validation of the ending index.
Also, raise the hard-coded limit from 100 tracks to 170, per a
suggestion from Ken.
Reported by: C Turt <ecturt@gmail.com> [1]
Reviewed by: ken, avg
Sponsored by: The FreeBSD Foundation
Rick Macklem [Tue, 26 Oct 2021 02:09:14 +0000 (19:09 -0700)]
nfscl: Add a missing delegation lock release
There was a case in nfscl_doiods() where the function would return
without releasing the delegation shared lock, if it was aquired by
the call to nfscl_getstateid(). This patch adds that release.
I have never observed a failure due to this missing release, so I
do not know if it ever happens in practice. However, since the pNFS
client is not yet heavily used, it might be the case.
Found by code inspection during a recent NFSv4 IETF working group
testing event.