Kristof Provost [Wed, 12 May 2021 17:13:40 +0000 (19:13 +0200)]
tests: Only log critical errors from scapy
Since 2.4.5 scapy started issuing warnings about a few different
configurations during our tests. These are harmless, but they generate
stderr output, which upsets atf_check.
Configure scapy to only log critical errors (and thus not warnings) to
fix these tests.
IEEE Std 802.1D-2004 Section 17.14 defines permitted ranges for timers.
Incoming BPDU messages should be checked against the permitted ranges.
The rest of 17.14 appears to be enforced already.
sbin/ipfw: Fix parsing error in table based forward
The argument parser does not recognise the optional port for an
"tablearg" argument. Fix simplifies the code by make the internal
representation expicit for the parser. Includes the fix from D30208.
Mark Johnston [Mon, 3 May 2021 16:51:04 +0000 (12:51 -0400)]
Add missing sockaddr length and family validation to various protocols
Several protocol methods take a sockaddr as input. In some cases the
sockaddr lengths were not being validated, or were validated after some
out-of-bounds accesses could occur. Add requisite checking to various
protocol entry points, and convert some existing checks to assertions
where appropriate.
Reported by: syzkaller+KASAN
Reviewed by: tuexen, melifaro
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29519
Kristof Provost [Tue, 4 May 2021 17:23:15 +0000 (19:23 +0200)]
in6_mcast: Return EADDRINUSE when we've already joined the group
Distinguish between truly invalid requests and those that fail because
we've already joined the group. Both cases fail, but differentiating
them allows userspace to make more informed decisions about what the
error means.
For example. radvd tries to join the all-routers group on every SIGHUP.
This fails, because it's already joined it, but this failure should be
ignored (rather than treated as a sign that the interface's multicast is
broken).
This puts us in line with OpenBSD, NetBSD and Linux.
AIM (adaptive interrupt moderation) was part of BSD11 driver. Upon IFLIB
migration, AIM feature got lost. Re-introducing AIM back into IFLIB
based IXGBE driver.
One caveat is that in BSD11 driver, a queue comprises both Rx and Tx
ring. Starting from BSD12, Rx and Tx have their own queues and rings.
Also, IRQ is now only configured for Rx side. So, when AIM is
re-enabled, we should now consider only Rx stats for configuring EITR
register in contrast to BSD11 where Rx and Tx stats were considered to
manipulate EITR register.
Rick Macklem [Sun, 2 May 2021 23:04:27 +0000 (16:04 -0700)]
copy_file_range(2): improve copying of a large hole to EOF
PR#255523 reported that a file copy for a file with a large hole
to EOF on ZFS ran slowly over NFSv4.2.
The problem was that vn_generic_copy_file_range() would
loop around reading the hole's data and then see it is all
0s. It was coded this way since UFS always allocates a data
block near the end of the file, such that a hole to EOF never exists.
This patch modifies vn_generic_copy_file_range() to check for a
ENXIO returned from VOP_IOCTL(..FIOSEEKDATA..) and handle that
case as a hole to EOF. asomers@ confirms that it works for his
ZFS test case.
Navdeep Parhar [Sat, 1 May 2021 23:53:50 +0000 (16:53 -0700)]
cxgbe(4): Use ifaddr_event_ext instead of ifaddr_event for CLIP management.
The _ext event notification includes the address being added/removed and
that gives the driver an easy way to ignore non-IPv6 addresses. Remove
'tom' from the handler's name while here, it was moved out of t4_tom a
long time ago.
cxgbe(4): Do not panic when tx is called with invalid checksum requests.
There is no need to panic in if_transmit if the checksums requested are
inconsistent with the frame being transmitted. This typically indicates
that the kernel and driver were built with different INET/INET6 options,
or there is some other kernel bug. The driver should just throw away
the requests that it doesn't understand and move on.
cxgbe(4): Add flag to reliably stop the driver from accessing hw stats.
There are two kinds of routines in the driver that read statistics from
the hardware: the cxgbe_* variants read the per-port MPS/MAC registers
and the vi_* variants read the per-VI registers. They can be called
from the 1Hz callout or if_get_counter. All stats collection now takes
place under the callout lock and there is a new flag to indicate that
these routines should not access any hardware register.
Navdeep Parhar [Tue, 30 Mar 2021 04:35:05 +0000 (21:35 -0700)]
cxgbe/t4_tom: restore socket's protosw before entering TIME_WAIT.
This fixes a panic due to stale so->so_proto if t4_tom is unloaded and
one or more connections that were previously offloaded are still around
in TIME_WAIT state.
Navdeep Parhar [Wed, 24 Mar 2021 01:01:01 +0000 (18:01 -0700)]
cxgbe(4): Allow a T6 adapter to switch between TOE and NIC TLS mode.
The hw.cxgbe.kern_tls tunable was used for this in the past and if it
was set then all T6 adapters would be configured for NIC TLS operation
and could not be reconfigured for TOE without a reload. With this
change ifconfig can be used to manipulate toe and txtls caps like any
other caps. hw.cxgbe.kern_tls continues to work as usual but its
effects are not permanent any more.
* Enable nic_ktls_ofld in the default configuration file and use the
firmware instead of direct register manipulation to apply/rollback
NIC TLS configuration. This allows the driver to switch the hardware
between TOE and NIC TLS mode in a safe manner. Note that the
configuration is adapter-wide and not per-port.
* Remove the kern_tls config file as it works with 100G T6 cards only
and leads to firmware crashes with 25G cards. The configurations
included with the driver (with the exception of the FPGA configs) are
supposed to work with all adapters.
Navdeep Parhar [Fri, 19 Mar 2021 20:28:11 +0000 (13:28 -0700)]
cxgbe(4): create a separate helper routine to write the global RSS key.
While here, make sure only the PF driver attempts to program the global
RSS key (with options RSS). The VF driver doesn't have access to those
device registers.
Navdeep Parhar [Fri, 19 Mar 2021 19:30:57 +0000 (12:30 -0700)]
cxgbe(4): make it safe to call setup_memwin repeatedly.
A repeat call will recreate the memory windows in the hardware and move
them to their last-known positions without repeating any of the software
initialization.
Navdeep Parhar [Fri, 5 Mar 2021 19:28:18 +0000 (11:28 -0800)]
cxgbe(4): Fix an assertion that is not valid during attach.
Firmware access from t4_attach takes place without any synchronization.
The driver should not panic (debug kernels) if something goes wrong in
early communication with the firmware. It should still load so that
it's possible to poke around with cxgbetool.
Navdeep Parhar [Fri, 19 Feb 2021 22:18:08 +0000 (14:18 -0800)]
cxgbe(4): Use the correct filter width for T5+.
T5 and above have extra bits for the optional filter fields. This is a
correctness issue and not just a waste because a filter mode valid on a
T4 (36b) may not be valid on a T5+ (40b).
Navdeep Parhar [Fri, 19 Feb 2021 21:47:18 +0000 (13:47 -0800)]
cxgbe(4): Add a driver ioctl to set the filter mask.
Allow the filter mask (aka the hashfilter mode when hashfilters are
in use) to be set any time it is safe to do so. The requested mask
must be a subset of the filter mode already. The driver will not change
the mode or ingress config just to support a new mask.
Navdeep Parhar [Fri, 19 Feb 2021 21:05:19 +0000 (13:05 -0800)]
cxgbe(4): Use firmware commands to get/set filter configuration.
1. Query the firmware for filter mode, mask, and related ingress config
instead of trying to figure them out from hardware registers. Read
configuration from the registers only when the firmware does not
support this query.
2. Use the firmware to set the filter mode. This is the correct way to
do it and is more flexible as well. The filter mode (and associated
ingress config) can now be changed any time it is safe to do so.
The user can specify a subset of a valid mode and the driver will
enable enough bits to make sure that the mode is maxed out -- that
is, it is not possible to set another bit without exceeding the
total width for optional filter fields. This is a hardware
requirement that was not enforced by the driver previously.
Navdeep Parhar [Thu, 18 Feb 2021 09:15:46 +0000 (01:15 -0800)]
cxgbe(4): Break up t4_read_chip_settings.
Read the PF-only hardware settings directly in get_params__post_init.
Split the rest into two routines used by both the PF and VF drivers: one
that reads the SGE rx buffer configuration and another that verifies
miscellaneous hardware configuration.
Alexander Motin [Sun, 2 May 2021 23:35:28 +0000 (19:35 -0400)]
Improve UMA cache reclamation.
When estimating working set size, measure only allocation batches, not free
batches. Allocation and free patterns can be very different. For example,
ZFS on vm_lowmem event can free to UMA few gigabytes of memory in one call,
but it does not mean it will request the same amount back that fast too, in
fact it won't.
Update working set size on every reclamation call, shrinking caches faster
under pressure. Lack of this caused repeating vm_lowmem events squeezing
more and more memory out of real consumers only to make it stuck in UMA
caches. I saw ZFS drop ARC size in half before previous algorithm after
periodic WSS update decided to reclaim UMA caches.
Introduce voluntary reclamation of UMA caches not used for a long time. For
each zdom track longterm minimal cache size watermark, freeing some unused
items every UMA_TIMEOUT after first 15 minutes without cache misses. Freed
memory can get better use by other consumers. For example, ZFS won't grow
its ARC unless it see free memory, since it does not know it is not really
used. And even if memory is not really needed, periodic free during
inactivity periods should reduce its fragmentation.
Reviewed by: markj, jeff (previous version)
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D29790
Mark Johnston [Tue, 11 May 2021 21:36:12 +0000 (17:36 -0400)]
cryptodev: Fix some input validation bugs
- When we do not have a separate IV, make sure that the IV length
specified by the session is not larger than the payload size.
- Disallow AEAD requests without a separate IV. crp_sanity() asserts
that CRYPTO_F_IV_SEPARATE is set for AEAD requests, and some (but not
all) drivers require it.
- Return EINVAL for AEAD requests if an IV is specified but the
transform does not expect one.
Dmitry Wagin [Tue, 23 Mar 2021 16:01:15 +0000 (12:01 -0400)]
libc: Some enhancements to syslog(3)
- Defined MAXLINE constant (8192 octets by default instead 2048) for
centralized limit setting up. It sets maximum number of characters of
the syslog message. RFC5424 doesn't limit maximum size of the message.
Named after MAXLINE in syslogd(8).
- Fixed size of fmt_cpy buffer up to MAXLINE for rendering formatted
(%m) messages.
- Introduced autoexpansion of sending socket buffer up to MAXLINE.
Dmitry Wagin [Tue, 23 Mar 2021 16:15:28 +0000 (12:15 -0400)]
syslogd: Increase message size limits
Add a -M option to control the maximum length of forwarded messages.
syslogd(8) used to truncate forwarded messages to 1024 bytes, but after
commit 1a874a126a54 ("Add RFC 5424 syslog message output to syslogd.")
applies a more conservative limit of 480 bytes for IPv4 per RFC 5426
section 3.2. Restore the old default behaviour of truncating to 1024
bytes. RFC 5424 specifies no upper limit on the length of forwarded
messages, while for RFC 3164 the limit is 1024 bytes.
Increase MAXLINE to 8192 bytes to correspond to commit 672ef817a192.
Replaced bootfile[] size for MAXPATHLEN used in getbootfile(3) as a
returned value. Using (MAXLINE+1) as a size for bootfile[] is excessive.
This allows us to kill states created from a rule with route-to/reply-to
set. This is particularly useful in multi-wan setups, where one of the
WAN links goes down.
Submitted by: Steven Brown
Obtained from: https://github.com/pfsense/FreeBSD-src/pull/11/
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30058
Glen Barber [Thu, 18 Feb 2021 04:00:03 +0000 (23:00 -0500)]
release: permanently remove the 'reldoc' target and associates
Following 7b1d1a1658ffb69eff93afc713f9e88ed8b20eac, the structure
for the reldoc target has significantly changed as result of the
ASCIIDoctor/Hugo migration. As the release notes related files
on the installation medium are inherently out of date, purge them
entirely.
Discussed within: re, doceng
No objection: re (silence), doceng (silence)
Timeout: 2 weeks
Sponsored by: Rubicon Communications, LLC ("Netgate")
Mark Johnston [Wed, 5 May 2021 21:06:23 +0000 (17:06 -0400)]
igmp: Avoid an out-of-bounds access when zeroing counters
When verifying, byte-by-byte, that the user-supplied counters are
zero-filled, sysctl_igmp_stat() would check for zero before checking the
loop bound. Perform the checks in the correct order.
Reported by: KASAN
Sponsored by: The FreeBSD Foundation
Rick Macklem [Wed, 28 Apr 2021 00:30:16 +0000 (17:30 -0700)]
nfscl: add check for NULL clp and forced dismounts to nfscl_delegreturnvp()
Commit aad780464fad added a function called nfscl_delegreturnvp()
to return delegations during the NFS VOP_RECLAIM().
The function erroneously assumed that nm_clp would
be non-NULL. It will be NULL for NFSV4.0 mounts until
a regular file is opened. It will also be NULL during
vflush() in nfs_unmount() for a forced dismount.
This patch adds a check for clp == NULL to fix this.
Also, since it makes no sense to call nfscl_delegreturnvp()
during a forced dismount, the patch adds a check for that
case and does not do the call during forced dismounts.
pf: Optionally attempt to preserve rule counter values across ruleset updates
Usually rule counters are reset to zero on every update of the ruleset.
With keepcounters set pf will attempt to find matching rules between old
and new rulesets and preserve the rule counters.
pf: Implement the NAT source port selection of MAP-E Customer Edge
MAP-E (RFC 7597) requires special care for selecting source ports
in NAT operation on the Customer Edge because a part of bits of the port
numbers are used by the Border Relay to distinguish another side of the
IPv4-over-IPv6 tunnel.
Alex Richardson [Mon, 19 Apr 2021 23:19:20 +0000 (00:19 +0100)]
libc/string/memset.c: Use unsigned long for stores
While most 64-bit architectures have an assembly implementation of this
file, RISC-V does not. As we now store 8 bytes instead of 4 it should speed
up RISC-V.
Alex Richardson [Mon, 19 Apr 2021 23:15:57 +0000 (00:15 +0100)]
libc/string/bcopy.c: Use intptr_t as the copy type
While most 64-bit architectures have an assembly implementation of this
file RISC-V does not. As we now copy 8 bytes instead of 4 it should speed
up RISC-V. Using intptr_t instead of int also allows using this file for
CHERI pure-capability code since trying to copy pointers using integer
loads/stores will invalidate pointers.
Alex Richardson [Tue, 20 Apr 2021 00:46:36 +0000 (01:46 +0100)]
bsd.compiler.mk: detect Apple Clang for cross-builds
Apple clang uses a different versioning scheme, so if we enable or
disable certain warnings for Clang 11+, those might not be supported
in Apple Clang 11+. This adds 'apple-clang' to COMPILER_FEATURES, so that
bootstrap tools Makefiles can avoid warnings on macOS.
Alex Richardson [Thu, 25 Mar 2021 11:12:17 +0000 (11:12 +0000)]
truss: improved support for decoding compat32 arguments
Currently running `truss -a -e` does not decode any
argument values for freebsd32_* syscalls (open/readlink/etc.)
This change checks whether a syscall starts with freebsd{32,64}_ and if
so strips that prefix when looking up the syscall information. To ensure
that the truss logs include the real syscall name we create a copy of
the syscall information struct with the updated.
The other problem is that when reading string array values, truss
naively iterates over an array of char* and fetches the pointer value.
This will result in arguments not being loaded if the pointer is not
aligned to sizeof(void*), which can happens in the compat32 case. If it
happens to be aligned, we would end up printing every other value.
To fix this problem, this changes adds a pointer_size member to the
procabi struct and uses that to correctly read indirect arguments
as 64/32 bit addresses in the the compat32 case (and also compat64 on
CheriBSD).
The motivating use-case for this change is using truss for 64-bit
programs on a CHERI system, but most of the diff also applies to 32-bit
compat on a 64-bit system, so I'm upstreaming this instead of keeping it
as a local CheriBSD patch.
Alex Richardson [Thu, 4 Mar 2021 18:28:25 +0000 (18:28 +0000)]
truss: split counting of syscalls and syscall calling convention
This change is a refactoring cleanup to improve support for compat32
syscalls (and compat64 on CHERI systems). Each process ABI now has it's
own struct sycall instead of using one global list. The list of all
syscalls is replaced with a list of seen syscalls. Looking up the syscall
argument passing convention now interates over the fixed-size array instead
of using a link-list that's populated on startup so we no longer need the
init_syscall() function.
The actual functional changes are in D27625.
Rick Macklem [Tue, 27 Apr 2021 00:48:21 +0000 (17:48 -0700)]
nfscl: fix the handling of NFSERR_DELAY for Open/LayoutGet RPCs
For a pNFS mount, the NFSv4.1/4.2 client uses compound RPCs that
have both Open and LayoutGet operations in them.
If the pNFS server were tp reply NFSERR_DELAY for one of these
compounds, the retry after a delay cannot be handled by
newnfs_request(), since there is a reference held on the open
state for the Open operation in them.
Fix this by adding these RPCs to the "don't do delay here"
list in newnfs_request().
This patch is only needed if the mount is using pNFS (the "pnfs"
mount option) and probably only matters if the MDS server
is issuing delegations as well as pNFS layouts.
Rick Macklem [Tue, 27 Apr 2021 22:32:35 +0000 (15:32 -0700)]
nfsd: fix a NFSv4.1 Linux client mount stuck in CLOSE_WAIT
It was reported that a NFSv4.1 Linux client mount against
a FreeBSD12 server was hung, with the TCP connection in
CLOSE_WAIT state on the server.
When a NFSv4.1/4.2 mount is done and the back channel is
bound to the TCP connection, the soclose() is delayed until
a new TCP connection is bound to the back channel, due to
a reference count being held on the SVCXPRT structure in
the krpc for the socket. Without the soclose() call, the socket
will remain in CLOSE_WAIT and this somehow caused the Linux
client to hang.
This patch adds calls to soshutdown(.., SHUT_WR) that
are performed when the server side krpc sees that the
socket is no longer usable. Since this can be done
before the back channel is bound to a new TCP connection,
it allows the TCP connection to proceed to CLOSED state.
Mark Johnston [Tue, 4 May 2021 12:53:57 +0000 (08:53 -0400)]
nfsclient: Copy only initialized fields in nfs_getattr()
When loading attributes from the cache, the NFS client is careful to
copy only the fields that it initialized. After fetching attributes
from the server, however, it would copy the entire vattr structure
initialized from the RPC response, so uninitialized stack bytes would
end up being copied to userspace. In particular, va_birthtime (v2 and
v3) and va_gen (v3) had this problem.
Use a common subroutine to copy fields provided by the NFS client, and
ensure that we provide a dummy va_gen for the v3 case.
Reviewed by: rmacklem
Reported by: KMSAN
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30090
Allow up to 5 labels to be set on each rule.
This offers more flexibility in using labels. For example, it replaces
the customer 'schedule' keyword used by pfSense to terminate states
according to a schedule.
Add a test case where the pfctl optimizer will generate a table
automatically. These tables have long names, which we accidentally broke
in the nvlist ADDRULE ioctl.
When parsing the nvlist for a struct pf_addr_wrap we unconditionally
tried to parse "ifname". This broke for PF_ADDR_TABLE when the table
name was longer than IFNAMSIZ. PF_TABLE_NAME_SIZE is longer than
IFNAMSIZ, so this is a valid configuration.
Only parse (or return) ifname or tblname for the corresponding
pf_addr_wrap type.
This manifested as a failure to set rules such as these, where the pfctl
optimiser generated an automatic table:
pass in proto tcp to 192.168.0.1 port ssh
pass in proto tcp to 192.168.0.2 port ssh
pass in proto tcp to 192.168.0.3 port ssh
pass in proto tcp to 192.168.0.4 port ssh
pass in proto tcp to 192.168.0.5 port ssh
pass in proto tcp to 192.168.0.6 port ssh
pass in proto tcp to 192.168.0.7 port ssh
Rick Macklem [Mon, 26 Apr 2021 23:24:10 +0000 (16:24 -0700)]
nfsd: fix the slot sequence# when a callback fails
Commit 4281bfec3628 patched the server so that the
callback session slot would be free'd for reuse when
a callback attempt fails.
However, this can often result in the sequence# for
the session slot to be advanced such that the client
end will reply NFSERR_SEQMISORDERED.
To avoid the NFSERR_SEQMISORDERED client reply,
this patch negates the sequence# advance for the
case where the callback has failed.
The common case is a failed back channel, where
the callback cannot be sent to the client, and
not advancing the sequence# is correct for this
case. For the uncommon case where the client's
reply to the callback is lost, not advancing the
sequence# will indicate to the client that the
next callback is a retry and not a new callback.
But, since the FreeBSD server always sets "csa_cachethis"
false in the callback sequence operation, a retry
and a new callback should be handled the same way
by the client, so this should not matter.
Until you have this patch in your NFSv4.1/4.2 server,
you should consider avoiding the use of delegations.
Even with this patch, interoperation with the
Linux NFSv4.1/4.2 client in kernel versions prior
to 5.3 can result in frequent 15second delays if
delegations are enabled. This occurs because, for
kernels prior to 5.3, the Linux client does a TCP
reconnect every time it sees multiple concurrent
callbacks and then it takes 15seconds to recover
the back channel after doing so.
Rick Macklem [Fri, 23 Apr 2021 22:24:47 +0000 (15:24 -0700)]
nfsd: fix session slot handling for failed callbacks
When the NFSv4.1/4.2 server does a callback to a client
on the back channel, it will use a session slot in the
back channel session. If the back channel has failed,
the callback will fail and, without this patch, the
session slot will not be released.
As more callbacks are attempted, all session slots
can become busy and then the nfsd thread gets stuck
waiting for a back channel session slot.
This patch frees the session slot upon callback
failure to avoid this problem.
Without this patch, the problem can be avoided by leaving
delegations disabled in the NFS server.