This change touches both kernel and netstat(1), but either of the changes
will fix printing pcb addresses with -A.
The thing is that historically netstat(1) treated TCP differently, and
printed tcpcb address instead of inpcb address. This is not documented
anywhere! With e68b3792440 these two addresses became the same. It is
highly likely they will be the same for a long time, but it might be they
will start to differ again in a far future. My proposal is to stop
treating TCP differently with netstat(1) and right now is a good opportunity
to do that, since there will be no behavior change at all. The kernel
change to tcp_inptoxtp() will go into stable/14 to make it compatible with
netstat(1) binary from stable/13. We can drop it later, probably together
with in_ppcb pointer from inpcb. The in_ppcb in xinpcb will stay for size
compatibility.
Bjoern A. Zeeb [Tue, 14 Feb 2023 15:53:13 +0000 (15:53 +0000)]
dpaa2: add console support for FDT based systems
Add DPAA2 console support for MC and AIOP (latter untested) for FDT
systems. ACPI systems are prepared but need some proper bus function
in order to get the address from MC (and likely a file splitup then).
This will come at a later stage once other ACPI/FDT bus parts are
cleared up.
The work was originally done in July 2022 and finally switched to
bus_space[1] lately to be ready for main.
Revert "libc: Implement bsort(3) a bitonic type of sorting algorithm."
Some points for the future:
- libc is not the right place for sorting algorithms.
Probably libutil is better suited for this purpose or
a dedicated libsort. Should move all sorting algorithms
away from libc eventually.
- CheriBSD uses capabilities for memory access, and could
benefit from a standard memswap() function.
- Do something about qsort() in FreeBSD's libc like:
- Mark it deprecated on FreeBSD, as a first step,
due to missing limits on CPU time.
- Audit the use of qsort() in the FreeBSD base system
and consider swapping to other existing sorting
algorithms.
Simon J. Gerraty [Thu, 20 Apr 2023 17:05:43 +0000 (10:05 -0700)]
Fix building host tools for host
Several makefile depend on tools built for host.
At least when using DIRDEPS_BUILD we can build these for the
pseudo machine "host" to facilitate building on older host versions.
Ideally we would build these tools in their own directories to avoid
building more than needed.
For now, setting an appropriate default for BTOOLSPATH will suffice
Ed Maste [Tue, 18 Apr 2023 13:57:29 +0000 (09:57 -0400)]
makefs: set cd9660 Rock Ridge timestamps for . and ..
DOT and DOTDOT entries have special handling, and previously only Rock
Ridge PX (POSIX attributes) entries were attached. Add TF (timestamp)
entries as well.
PR: 203531
Reported by: Thomas Schmitt <scdbackup@gmx.net>
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39662
Mark Johnston [Thu, 20 Apr 2023 15:48:33 +0000 (11:48 -0400)]
inpcb: Release the inpcb cred reference before freeing the structure
Now that the inp_cred pointer is accessed only while the inpcb lock is
held, we can avoid deferring a crfree() call when freeing an inpcb.
This fixes a problem introduced when inpcb hash tables started being
synchronized with SMR: the credential reference previously could not be
released until all lockless readers have drained, and there is no
mechanism to explicitly purge cached, freed UMA items. Thus, ucred
references could linger indefinitely, and since ucreds hold a jail
reference, the jail would linger indefinitely as well. This manifests
as jails getting stuck in the DYING state.
Mark Johnston [Thu, 20 Apr 2023 15:48:19 +0000 (11:48 -0400)]
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Mark Johnston [Thu, 20 Apr 2023 15:48:01 +0000 (11:48 -0400)]
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Move a KASSERT out of a function and make it a CTASSERT with
appropriate comments.
Skeleton implement two tkip functions, still left TODO, initializing
variables with dummy values to quiten compiler warnings. It is
unclear to me if we should still ever properly implement TKIP
compat code at this point. If so the current code gives a good
idea what needs to be done in addition to allocating references
to real state along with keyconf.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Quieten some more (valid) gcc warnings and disable dead code.
There are more warnings, some probably a compiler problem, the
other related to firmware structs which I do not want to adjust
just locally. Leave a comment to revisit after a next driver
update.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
network.subr: adjust regex for wlans_xxxxx rc.conf entries
Drivers like ath1[012]k will not match the current wlans_*-regex as
they have digits followed by letters. Adjust the regex to allow
this combination in order to be able to configure interfaces with
names like wlans_ath11k0="..."
MFC after: 3 days
Reviewed by: emaste
Differential Revision: https://reviews.freebsd.org/D39674
Rather than using ACCESS_ONCE() in READ_ONCE() add a missing cast
to const in order to satisfy -Wcast-equal by gcc.
Sadly we cannot do the same to WRITE_ONCE() which still is very
noisy.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D39706
Harmonize sk_buff_head and sk_buff further and fix -Warray-bounds
warnings reports by gcc. At the same time simplify some code by
re-using other functions or factoring some code out.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
A one-bit wide bit-field can take only the values 0 and -1. Clang 16
introduced a warning that "implicit truncation from 'int' to a one-bit
wide bit-field changes value from 1 to -1". Fix by using c99 bool.
Randall Stewart [Wed, 19 Apr 2023 18:02:12 +0000 (14:02 -0400)]
tcp: rack the request level logging is a bit too noisy when doing point logging.
When doing request level BB logging the hybrid_bw_log() does not have proper screening to minimize logging
when point level logging is in use. Lets fix it properly so you have to have the proper knobs set to get the
more noisy logging.
loader: Change version calculation to be more consistent.
Use 1000 * major + minor when calculating the version number that
gets set in the Ficl environment or lua loader property. This allows
for more room if the minor number needs to go above 9.
Randall Stewart [Wed, 19 Apr 2023 17:17:04 +0000 (13:17 -0400)]
tcp: Rack can crash with the new non-TSO fix..
Turns out the location of the check to see if we can do output is in the wrong place. We need
to jump off to the compressed acks before handling that case since th is NULL in the
compressed ack case which is handled differently anyway.
Randall Stewart [Wed, 19 Apr 2023 16:54:25 +0000 (12:54 -0400)]
We have a TCP_LOG_CONNEND log that should come out at the very last log of every connection. This
holds some nice stats about why/how the connection ended. Though with the current code it does not
come out without accounting due to the placement of the ifdefs. Also we need to make sure the stacks
fini has ran before calling in from tcp_subr so we get all logs the stack may make at its ending.
cxgbe/iw_cxgbe: debug routines to dump STAG (steering tag) entries.
t4_dump_stag to dump hw state for a known STAG.
t4_dump_all_stag to dump hw state for all valid STAGs. This routine
walks the entire STAG region looking for valid entries and this can take
a while for some configurations.
Brooks Davis [Wed, 19 Apr 2023 15:58:46 +0000 (16:58 +0100)]
Cirrus-CI: Check that make sysent was run
Run the `make sysent` target and verify that the repo isn't modified
afterwards. This ensures that a pushed branch contains all the
required bits after a change to syscall.master.
struct dpaa2_cmd is no longer malloc'ed, but can be allocated on stack
and initialized with DPAA2_CMD_INIT() on demand. Drivers stopped caching
their DPAA2 command objects (and associated tokens) in the software
contexts in order to avoid using them concurrently.
When sorting, both the C11 standard (ISO/IEC 9899:2011, K.3.6.3.2) and
the ISO/IEC JTC1 SC22 WG14 N1172 standard, does not define objects of
zero size as undefined behaviour. However Microsoft's cpp-docs does.
Add proper checks for this. Found while working on bsort(3).
libc: Implement bsort(3) a bitonic type of sorting algorithm.
The bsort(3) algorithm works by swapping objects, similarly to qsort(3),
and does not require any significant amount of additional memory.
The bsort(3) algorithm doesn't suffer from the processing time issues
known the plague the qsort(3) family of algorithms, and is bounded by
a complexity of O(log2(N) * log2(N) * N), where N is the number of
elements in the sorting array. The additional complexity compared to
mergesort(3) is a fair tradeoff in situations where no memory may
be allocated.
The bsort(3) APIs are identical to those of qsort(3), allowing for
easy drop-in and testing.
The design of the bsort(3) algorithm allows for future parallell CPU
execution when sorting arrays. The current version of the bsort(3)
algorithm is single threaded. This is possible because fixed areas
of the sorting data is compared at a time, and can easily be divided
among different CPU's to sort large arrays faster.
Randall Stewart [Tue, 18 Apr 2023 16:21:56 +0000 (12:21 -0400)]
tcp: bbr.c is non-capable of doing ECN and sets an INP flag to fend off ECN however our syncache is not aware of that flag.
We need to make the syncache aware of the flag and not do ECN if its set. Note that this
is not 100% full proof but the best we can do (i.e. its still possible that you can get in a
situation where the peer try's to do ecn).
pf: change pf_rules_lock and pf_ioctl_lock to per-vnet locks
Both pf_rules_lock and pf_ioctl_lock only ever affect one vnet, so
there's no point in having these locks affect other vnets.
(In fact, the only lock in pf that can affect multiple vnets is
pf_end_lock.)
That's especially important for the rules lock, because taking the write
lock suspends all network traffic until it's released. This will reduce
the impact a vnet running pf can have on other vnets, and improve
concurrency on machines running multiple pf-enabled vnets.
The explanation from https://reviews.freebsd.org/D39637 by stevek:
The "use_xsave" variable is a global and that is only supposed to be
initialized early before scheduling gets started. However, with the way
the ifuncs for "fpusave" and "fpurestore" are implemented, the value
could be changed at runtime when scheduling is active if "use_xsave"
was set to 0 by the tunable. This leaves a window of opportunity where
"use_xsave" gets re-initialized to 1 and a context switch could occur
with a thread that was not set up to be able to use xsave functionality.
This can lead to an "privileged instruction fault".
The fix is to protect "use_xsave" from being initialized more than once.
Reported and reviewed by: stevek
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39660
Since syncer vnode vector does not provide a fallback to the default
one, its VOP_GETWRITEMOUNT() implementation implicitly returned
EOPNOTSUPP, which means that syncer ignored suspension.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Skip Pass 5 in fsck_ffs(8) when corrupt cylinder groups remain unfixed.
Pass 1 of fsck_ffs checks the integrity of all the cylinder groups.
If any are found to have been corrupted it offers to rebuild them.
Pass 5 then makes a second pass over the cylinder groups to validate
their block and inode maps. Pass 5 assumes that the cylinder groups
are not corrupted and can segment fault if they are corrupted. Rather
than rerunning the corruption checks a second time in pass 5, this
fix keeps track whether any corrupt cylinder groups were found but not
fixed in pass 1 either due to running with the -n flag or by explicitly
answering `no' when asked whether to fix a corrupted cylinder group.
If any corrupted cylinder groups remain after pass 1, fsck_ffs will
decline to run pass 5. Instead it marks the filesystem as unclean
so that fsck_ffs will need to be run again before the filesystem can
be mounted.
This patch cleans up and documents the return value from check_cgmagic().
It also renames the variable / parameter "rebuildcg" to "rebuiltcg".
This parameter describes whether the cylinder group has been rebuilt
rather than whether it should be rebuilt.
Add `chdb' command to fsdb(8) to set direct block numbers.
Add the ability to set direct blocks numbers in inodes so that manual
corrections can be made. No checking of the values is attempted so
accidental or deliberate bad values can be set.
Warner Losh [Tue, 18 Apr 2023 21:29:45 +0000 (15:29 -0600)]
stand: Add a snarky note about the upstream ZFS situation
The latest import of openzfs broke the hacks that we used to omit the
special registers being used on arm64. Add snarky note documenting this
situation since it's a mess now since the hack was only partially
undone, leaving behind a mess.
Mark Johnston [Tue, 18 Apr 2023 18:32:04 +0000 (14:32 -0400)]
loader.efi: Fix some arm64 PE metadata
- Mark the file as an executable in the COFF header.
- Provide separate .text and .data sections.
- Provide sane file and section alignment values. These values are the
defaults defined in the PE specification.
- Set appropriate characteristics for each of .text and .data.
This is required for the MS devkit to load our UEFI image.