Randall Stewart [Fri, 4 Jun 2021 09:26:43 +0000 (05:26 -0400)]
tcp: A better fix for the previously attempted fix of the ack-war issue with tcp.
So it turns out that my fix before was not correct. It ended with us failing
some of the "improved" SYN tests, since we are not in the correct states.
With more digging I have figured out the root of the problem is that when
we receive a SYN|FIN the reassembly code made it so we create a segq entry
to hold the FIN. In the established state where we were not in order this
would be correct i.e. a 0 len with a FIN would need to be accepted. But
if you are in a front state we need to strip the FIN so we correctly handle
the ACK but ignore the FIN. This gets us into the proper states
and avoids the previous ack war.
I back out some of the previous changes but then add a new change
here in tcp_reass() that fixes the root cause of the issue. We still
leave the rack panic fixes in place however.
Randall Stewart [Thu, 27 May 2021 14:50:32 +0000 (10:50 -0400)]
tcp: When we have an out-of-order FIN we do want to strip off the FIN bit.
The last set of commits fixed both a panic (in rack) and an ACK-war (in freebsd and bbr).
However there was a missing case, i.e. where we get an out-of-order FIN by itself.
In such a case we don't want to leave the FIN bit set, otherwise we will do the
wrong thing and ack the FIN incorrectly. Instead we need to go through the
tcp_reasm() code and that way the FIN will be stripped and all will be well.
Randall Stewart [Wed, 26 May 2021 10:43:30 +0000 (06:43 -0400)]
tcp: Add a socket option to rack so we can test various changes to the slop value in timers.
Timer_slop, in TCP, has been 200ms for a long time. This value dates back
a long time when delayed ack timers were longer and links were slower. A
200ms timer slop allows 1 MSS to be sent over a 60kbps link. Its possible that
lowering this value to something more in line with todays delayed ack values (40ms)
might improve TCP. This bit of code makes it so rack can, via a socket option,
adjust the timer slop.
Randall Stewart [Tue, 25 May 2021 17:23:31 +0000 (13:23 -0400)]
tcp: Fix bugs related to the PUSH bit and rack and an ack war
Michaels testing with UDP tunneling found an issue with the push bit, which was only partly fixed
in the last commit. The problem is the left edge gets transmitted before the adjustments are done
to the send_map, this means that right edge bits must be considered to be added only if
the entire RSM is being retransmitted.
Now syzkaller also continued to find a crash, which Michael sent me the reproducer for. Turns
out that the reproducer on default (freebsd) stack made the stack get into an ack-war with itself.
After fixing the reference issues in rack the same ack-war was found in rack (and bbr). Basically
what happens is we go into the reassembly code and lose the FIN bit. The trick here is we
should not be going into the reassembly code if tlen == 0 i.e. the peer never sent you anything.
That then gets the proper action on the FIN bit but then you end up in LAST_ACK with no
timers running. This is because the usrclosed function gets called and the FIN's and such have
already been exchanged. So when we should be entering FIN_WAIT2 (or even FIN_WAIT1) we get
stuck in LAST_ACK. Fixing this means tweaking the usrclosed function so that we properly
recognize the condition and drop into FIN_WAIT2 where a timer will allow at least TP_MAXIDLE
before closing (to allow time for the peer to retransmit its FIN if the ack is lost). Setting the fast_finwait2
timer can speed this up in testing.
Randall Stewart [Mon, 24 May 2021 18:42:15 +0000 (14:42 -0400)]
tcp: Fix an issue with the PUSH bit as well as fill in the missing mtu change for fsb's
The push bit itself was also not actually being properly moved to
the right edge. The FIN bit was incorrectly on the left edge. We
fix these two issues as well as plumb in the mtu_change for
alternate stacks.
Michael Tuexen [Sat, 22 May 2021 12:35:09 +0000 (14:35 +0200)]
tcp: Handle stack switch while processing socket options
Handle the case where during socket option processing, the user
switches a stack such that processing the stack specific socket
option does not make sense anymore. Return an error in this case.
Michael Tuexen [Fri, 21 May 2021 07:45:00 +0000 (09:45 +0200)]
tcp: Fix sending of TCP segments with IP level options
When bringing in TCP over UDP support in
https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605,
the length of IP level options was considered when locating the
transport header. This was incorrect and is fixed by this patch.
Randall Stewart [Thu, 13 May 2021 11:36:04 +0000 (07:36 -0400)]
tcp: Incorrect KASSERT causes a panic in rack
Skyzall found an interesting panic in rack. When a SYN and FIN are
both sent together a KASSERT gets tripped where it is validating that
a mbuf pointer is in the sendmap. But a SYN and FIN often will not
have a mbuf pointer. So the fix is two fold a) make sure that the
SYN and FIN split the right way when cloning an RSM SYN on left
edge and FIN on right. And also make sure the KASSERT properly
accounts for the case that we have a SYN or FIN so we don't
panic.
Reviewed by: mtuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D30241
Randall Stewart [Tue, 11 May 2021 12:15:05 +0000 (08:15 -0400)]
tcp: In rack, we must only convert restored rtt when the hostcache does restore them.
Rack now after the previous commit is very careful to translate any
value in the hostcache for srtt/rttvar into its proper format. However
there is a snafu here in that if tp->srtt is 0 is the only time that
the HC will actually restore the srtt. We need to then only convert
the srtt restored when it is actually restored. We do this by making
sure it was zero before the call to cc_conn_init and it is non-zero
afterwards.
Randall Stewart [Mon, 10 May 2021 15:25:51 +0000 (11:25 -0400)]
tcp:Host cache and rack ending up with incorrect values.
The hostcache up to now as been updated in the discard callback
but without checking if we are all done (the race where there are
more than one calls and the counter has not yet reached zero). This
means that when the race occurs, we end up calling the hc_upate
more than once. Also alternate stacks can keep there srtt/rttvar
in different formats (example rack keeps its values in microseconds).
Since we call the hc_update *before* the stack fini() then the
values will be in the wrong format.
Rack on the other hand, needs to convert items pulled from the
hostcache into its internal format else it may end up with
very much incorrect values from the hostcache. In the process
lets commonize the update mechanism for srtt/rttvar since we
now have more than one place that needs to call it.
Randall Stewart [Fri, 7 May 2021 21:32:32 +0000 (17:32 -0400)]
This takes Warners suggested approach to making it so that
platforms that for whatever reason cannot include the RATELIMIT option
can still work with rack. It adds two dummy functions that rack will
call and find out that the highest hw supported b/w is 0 (which
kinda makes sense and rack is already prepared to handle).
Reviewed by: Michael Tuexen, Warner Losh
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30163
Randall Stewart [Fri, 7 May 2021 18:06:43 +0000 (14:06 -0400)]
Fix a UDP tunneling issue with rack. Basically there are two
issues.
A) Not enough hdrlen was being calculated when a UDP tunnel is
in place.
and
B) Not enough memory is allocated in racks fsb. We need to
overbook the fsb to include a udphdr just in case.
Submitted by: Peter Lei
Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30157
Randall Stewart [Thu, 6 May 2021 15:22:26 +0000 (11:22 -0400)]
This brings into sync FreeBSD with the netflix versions of rack and bbr.
This fixes several breakages (panics) since the tcp_lro code was
committed that have been reported. Quite a few new features are
now in rack (prefecting of DGP -- Dynamic Goodput Pacing among the
largest). There is also support for ack-war prevention. Documents
comming soon on rack..
Martin Matuska [Tue, 8 Jun 2021 15:01:18 +0000 (17:01 +0200)]
zfs: merge openzfs/zfs@7d9f3ef0e (zfs-2.1-release) into stable/13
Notable upstream pull request merges:
#11710 Allow zfs to send replication streams with missing snapshots
#11786 Ratelimit deadman zevents as with delay zevents
#11813 Allow pool names that look like Solaris disk names
#11822 Atomically check and set dropped zevent count
#11822 Don't scale zfs_zevent_len_max by CPU count
#11837 zfs get -p only outputs 3 columns if "clones" property is empty
#11849 Use dsl_scan_setup_check() to setup a scrub
#11861 Improvements to the 'compatibility' property
#11862 cmd/zfs receive: allow dry-run (-n) to check property args
#11864 receive: don't fail inheriting (-x) properties on wrong dataset type
#11877 Combine zio caches if possible
#11881 FreeBSD: use vnlru_free_vfsops if available
#11883 FreeBSD: add support for lockless symlink lookup
#11884 FreeBSD: add missing seqc write begin/end around zfs_acl_chown_setattr
#11896 Fix crash in zio_done error reporting
#11905 zfs-send(8): Restore sorting of flags
#11926 FreeBSD: damage control racing .. lookups in face of mkdir/rmdir
#11938 Fix AVX512BW Fletcher code on AVX512-but-not-BW machines
#11966 Scale worker threads and taskqs with number of CPUs
#11997 FreeBSD: Don't force xattr mount option
#11997 FreeBSD: Use SET_ERROR to trace xattr name errors
#11998 Simplify/fix dnode_move() for dn_zfetch
#12003 FreeBSD: Initialize/destroy zp->z_lock
#12010 Fix dRAID self-healing short columns
#12033 Revert "Fix raw sends on encrypted datasets when copying back snapshots"
#12040 Reinstate the old zpool read label logic as a fallback
#12049 FreeBSD: avoid memory allocation in arc_prune_async
#12061 Fix dRAID sequential resilver silent damage handling
#12077 FreeBSD: Retry OCF ENOMEM errors.
#12088 Propagate vdev state due to invalid label corruption
#12097 FreeBSD: Update dataset_kstats for zvols in dev mode
Mark Johnston [Tue, 1 Jun 2021 23:38:22 +0000 (19:38 -0400)]
amd64: Clear the local TSS when creating a new thread
Otherwise it is copied from the creating thread. Then, if either thread
exits, the other is left with a dangling pointer, typically resulting in
a page fault upon the next context switch.
Reported by: syzkaller
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Markus Stoff [Tue, 18 May 2021 20:35:33 +0000 (22:35 +0200)]
ng_parse: IP address parsing in netgraph eating too many characters
Once the final component of the IP address has been parsed, the offset
on the input must not be advanced, as this would remove an unparsed
character from the input.
Submitted by: Markus Stoff
Reviewed by: donner
Differential Revision: https://reviews.freebsd.org/D26489
Randall Stewart [Tue, 26 Jan 2021 16:54:42 +0000 (11:54 -0500)]
This pulls over all the changes that are in the netflix
tree that fix the ratelimit code. There were several bugs
in tcp_ratelimit itself and we needed further work to support
the multiple tag format coming for the joint TLS and Ratelimit dances.
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D28357
Kristof Provost [Thu, 3 Jun 2021 13:22:19 +0000 (15:22 +0200)]
pf tests: Make killstate:match more robust
The killstate:match test starts nc as a background process. There was no
guarantee that the nc process would have connected by the time we check
for states, so this test occasionally failed without good reason.
Teach the test to wait for at least some states to turn up before
executing the critical checks.
MFC after: 3 days
Sponsored by: Rubicon Communications, LLC ("Netgate")
Lutz Donnerhacke [Sat, 15 May 2021 13:24:12 +0000 (15:24 +0200)]
libalias: Remove unused function LibAliasCheckNewLink
The functionality to detect a newly created link after processing a
single packet is decoupled from the packet processing. Every new
packet is processed asynchronously and will reset the indicator, hence
the function is unusable. I made a Google search for third party code,
which uses the function, and failed to find one.
That's why the function should be removed: It unusable and unused.
A much simplified API/ABI will remain in anything below 14.
Mark Johnston [Tue, 1 Jun 2021 23:38:09 +0000 (19:38 -0400)]
amd64: Relax the assertion added in commit 4a59cbc12
We only need to ensure that interrupts are disabled when handling a
fault from iret. Otherwise it's possible to trigger the assertion
legitimately, e.g., by copying in from an invalid address.
Fixes: 4a59cbc12
Reported by: pho
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Mark Johnston [Mon, 31 May 2021 22:49:33 +0000 (18:49 -0400)]
amd64: Avoid enabling interrupts when handling kernel mode prot faults
When PTI is enabled, we may have been on the trampoline stack when iret
faults. So, we have to switch back to the regular stack before
re-entering trap().
trap() has the somewhat strange behaviour of re-enabling interrupts when
handling certain kernel-mode execeptions. In particular, it was doing
this for exceptions raised during execution of iret. When switching
away from the trampoline stack, however, the thread must not be migrated
to a different CPU. Fix the problem by simply leaving interrupts
disabled during the window.
Reported by: syzbot+6cfa544fd86ad4647ffc@syzkaller.appspotmail.com
Reported by: syzbot+cfdfc9e5a8f28f11a7f5@syzkaller.appspotmail.com
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Mark Johnston [Mon, 31 May 2021 22:51:14 +0000 (18:51 -0400)]
x86: Fix lapic_ipi_alloc() on i386
The loop which checks to see if "dynamic" IDT entries are allocated
needs to compare with the trampoline address of the reserved ISR.
Otherwise it will never succeed.
Reported by: Harry Schmalzbauer <freebsd@omnilan.de>
Tested by: Harry Schmalzbauer <freebsd@omnilan.de>
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Rich Ercolani [Wed, 2 Jun 2021 13:00:29 +0000 (13:00 +0000)]
vfs: fix MNT_SYNCHRONOUS check in vn_write
ca1ce50b2b5ef11d ("vfs: add more safety against concurrent forced
unmount to vn_write") has a side effect of only checking MNT_SYNCHRONOUS
if O_FSYNC is set.
Michael Tuexen [Sun, 2 May 2021 20:38:27 +0000 (22:38 +0200)]
sctp: improve error handling in INIT/INIT-ACK processing
When processing INIT and INIT-ACK information, also during
COOKIE processing, delete the current association, when it
would end up in an inconsistent state.
Michael Tuexen [Mon, 26 Apr 2021 08:38:05 +0000 (10:38 +0200)]
sctp: improve handling of illegal packets containing INIT chunks
Stop further processing of a packet when detecting that it
contains an INIT chunk, which is too small or is not the only
chunk in the packet. Still allow to finish the processing
of chunks before the INIT chunk.
Thanks to Antoly Korniltsev and Taylor Brandstetter for reporting
an issue with the userland stack, which made me aware of this
issue.
Ed Maste [Sun, 2 May 2021 19:28:36 +0000 (15:28 -0400)]
Restore Cirrus-CI boot smoke test
This reverts commit a7d593dd1da27833b5384349700bc3c7bcae6aad.
We now use compute_engine_instance which allows us to specify a custom
disk size. Also go back to using the default qemu version (rather than
qemu42 or qemu-devel) as any issues were fixed some time ago.
Reviewed by: lwhsu, markj
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30082
Merge commit f511dc75e4c1 from llvm git (by Mark Johnston):
[asan] Add an offset for the kernel address sanitizer on FreeBSD
This is based on a port of the sanitizer runtime to the FreeBSD kernel
that has been commited as https://cgit.freebsd.org/src/commit/?id=38da497a4dfcf1979c8c2b0e9f3fa0564035c147
and the following commits.
Reviewed By: emaste, dim
Differential Revision: https://reviews.llvm.org/D98285
cron: consume blanks in system crontabs before options
On system crontabs, multiple blanks are not being consumed after reading the
username. This change adds blank consumption before parsing any -[qn] options.
Without this change, an entry like:
* * * * * username -n true # Two spaces between username and option.
will fail, as the shell will try to execute (' -n true'), while an entry like:
* * * * * username -n true # One space between username and option.
works as expected (executes 'true').
For user crontabs, this is not an issue as the preceding (day of week
or @shortcut) processing consumes any leading whitespace.
PR: 253699
Submitted by: Eric A. Borisch <eborisch@gmail.com>
MFC after: 1 week
Rick Macklem [Sat, 22 May 2021 21:51:38 +0000 (14:51 -0700)]
nfscl: Add hash lists for the NFSv4 opens
A problem was reported via email, where a large (130000+) accumulation
of NFSv4 opens on an NFSv4 mount caused significant lock contention
on the mutex used to protect the client mount's open/lock state.
Although the root cause for the accumulation of opens was not
resolved, it is obvious that the NFSv4 client is not designed to
handle 100000+ opens efficiently. When searching for an open,
usually for a match by file handle, a linear search of all opens
is done.
This patch adds a table of hash lists for the opens, hashed on
file handle. This table will be used by future commits to
search for an open based on file handle more efficiently.
Kristof Provost [Tue, 1 Jun 2021 14:05:47 +0000 (16:05 +0200)]
pf: Fix more ioctl memory leaks
We must also remember to free nvlists added to a parent nvlist with
nvlist_append_nvlist_array().
More importantly, when nvlist_pack() allocates memory for us it does so
in the M_NVLIST zone, so we must free it with free(.., M_NVLIST). Using
free(.., M_TEMP) as we did silently failed to free the memory.
Rick Macklem [Fri, 21 May 2021 01:37:40 +0000 (18:37 -0700)]
nfsd: Add support for CLAIM_DELEG_PREV_FH to the NFSv4.1/4.2 Open
Commit b3d4c70dc60f added support for CLAIM_DELEG_CUR_FH to Open.
While doing this, I noticed that CLAIM_DELEG_PREV_FH support
could be added the same way. Although I am not aware of any extant
NFSv4.1/4.2 client that uses this claim type, it seems prudent to add
support for this variant of Open to the NFSv4.1/4.2 server.
This patch does not affect mounts from extant NFSv4.1/4.2 clients,
as far as I know.
Mark Johnston [Fri, 21 May 2021 21:44:46 +0000 (17:44 -0400)]
Fix handling of errors from pru_send(PRUS_NOTREADY)
PRUS_NOTREADY indicates that the caller has not yet populated the chain
with data, and so it is not ready for transmission. This is used by
sendfile (for async I/O) and KTLS (for encryption). In particular, if
pru_send returns an error, the caller is responsible for freeing the
chain since other implicit references to the data buffers exist.
For async sendfile, it happens that an error will only be returned if
the connection was dropped, in which case tcp_usr_ready() will handle
freeing the chain. But since KTLS can be used in conjunction with the
regular socket I/O system calls, many more error cases - which do not
result in the connection being dropped - are reachable. In these cases,
KTLS was effectively assuming success.
So:
- Change sosend_generic() to free the mbuf chain if
pru_send(PRUS_NOTREADY) fails. Nothing else owns a reference to the
chain at that point.
- Similarly, in vn_sendfile() change the !async I/O && KTLS case to free
the chain.
- If async I/O is still outstanding when pru_send fails in
vn_sendfile(), set an error in the sfio structure so that the
connection is aborted and the mbuf chain is freed.
Reviewed by: gallatin, tuexen
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
Mark Johnston [Fri, 21 May 2021 21:44:40 +0000 (17:44 -0400)]
tcp: Make error handling in tcp_usr_send() more consistent
- Free the input mbuf in a single place instead of in every error path.
- Handle PRUS_NOTREADY consistently.
- Flush the socket's send buffer if an implicit connect fails. At that
point the mbuf has already been enqueued but we don't want to keep it
in the send buffer.
Reviewed by: gallatin, tuexen
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
Jessica Clarke [Fri, 28 May 2021 18:07:17 +0000 (19:07 +0100)]
aic7xxx: Fix re-building firmware with -fno-common
The generated C output for aicasm_scan.l defines yylineno already, so
references to it from other files should use an extern declaration.
The STAILQ_HEAD use in aicasm_symbol.h also provided an identifier,
causing it to both define the struct type and define a variable of that
struct type, causing any C file including the header to define the same
variable. This variable is not used (and confusingly clashes with a
field name just below) and was likely caused by confusion when switching
between defining fields using similar type macros and defining the type
itself.
PAPANI SRIKANTH [Fri, 28 May 2021 06:17:56 +0000 (00:17 -0600)]
Newly added features and bug fixes in latest Microchip SmartPQI driver
It includes:
1)Newly added TMF feature.
2)Added newly Huawei & Inspur PCI ID's
3)Fixed smartpqi driver hangs in Z-Pool while running on FreeBSD12.1
4)Fixed flooding dmesg in kernel while the controller is offline during in ioctls.
5)Avoided unnecessary host memory allocation for rcb sg buffers.
6)Fixed race conditions while accessing internal rcb structure.
7)Fixed where Logical volumes exposing two different names to the OS it's due to the system memory is overwritten with DMA stale data.
8)Fixed dynamically unloading a smartpqi driver.
9)Added device_shutdown callback instead of deprecated shutdown_final kernel event in smartpqi driver.
10)Fixed where Os is crashed during physical drive hot removal during heavy IO.
11)Fixed OS crash during controller lockup/offline during heavy IO.
12)Fixed coverity issues in smartpqi driver
13)Fixed system crash while creating and deleting logical volume in a continuous loop.
14)Fixed where the volume size is not exposing to OS when it expands.
15)Added HC3 pci id's.
Kristof Provost [Thu, 27 May 2021 09:28:36 +0000 (11:28 +0200)]
libpfctl: fix memory leak
When we create an nvlist and insert it into another nvlist we must
remember to destroy it. The nvlist_add_nvlist() function makes a copy,
just like nvlist_add_string() makes a copy of the string.
Mark Johnston [Mon, 31 May 2021 22:53:34 +0000 (18:53 -0400)]
tcp, udp: Permit binding with AF_UNSPEC if the address is INADDR_ANY
Prior to commit f161d294b we only checked the sockaddr length, but now
we verify the address family as well. This breaks at least ttcp. Relax
the check to avoid breaking compatibility too much: permit AF_UNSPEC if
the address is INADDR_ANY.
Fixes: f161d294b
Reported by: Bakul Shah <bakul@iitbombay.org>
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
Mark Johnston [Thu, 27 May 2021 19:49:12 +0000 (15:49 -0400)]
ktrace: Handle negative array sizes in ktrstructarray
ktrstructarray() may be used to create copies of kevent(2) change and
event arrays. It is called before parameter validation is done and so
should check for bogus array lengths before allocating a copy.
Reported by: syzkaller
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Rick Macklem [Wed, 19 May 2021 21:52:56 +0000 (14:52 -0700)]
nfscl: Fix NFSv4.1/4.2 mount recovery from an expired lease
The most difficult NFSv4 client recovery case happens when the
lease has expired on the server. For NFSv4.0, the client will
receive a NFSERR_EXPIRED reply from the server to indicate this
has happened.
For NFSv4.1/4.2, most RPCs have a Sequence operation and, as such,
the client will receive a NFSERR_BADSESSION reply when the lease
has expired for these RPCs. The client will then call nfscl_recover()
to handle the NFSERR_BADSESSION reply. However, for the expired lease
case, the first reclaim Open will fail with NFSERR_NOGRACE.
This patch recognizes this case and calls nfscl_expireclient()
to handle the recovery from an expired lease.
This patch only affects NFSv4.1/4.2 mounts when the lease
expires on the server, due to a network partitioning that
exceeds the lease duration or similar.