John Baldwin [Fri, 7 Oct 2022 17:14:03 +0000 (10:14 -0700)]
sys: Consolidate common implementation details of PV entries.
Add a <sys/_pv_entry.h> intended for use in <machine/pmap.h> to
define struct pv_entry, pv_chunk, and related macros and inline
functions.
Note that powerpc does not yet use this as while the mmu_radix pmap
in powerpc uses the new scheme (albeit with fewer PV entries in a
chunk than normal due to an used pv_pmap field in struct pv_entry),
the Book-E pmaps for powerpc use the older style PV entries without
chunks (and thus require the pv_pmap field).
Kristof Provost [Fri, 7 Oct 2022 16:14:02 +0000 (18:14 +0200)]
w: don't truncate if we're writing libxo json/xml
If we're writing structured output (i.e. json or xml) we shouldn't worry
about terminal width, and instead always output full width information.
This means that, for example, if we're called from crontab with 'w
--libxo json' we'll provide full the command field rather than
pointlessly truncating it.
Zhenlei Huang [Fri, 7 Oct 2022 11:37:12 +0000 (13:37 +0200)]
if_vxlan(4): Correct the statistic for output bytes
The vxlan interface encapsulates the Ethernet frame by prepending IP/UDP
and vxlan headers. For statistics, only the payload, i.e. the
encapsulated (inner) frame should be counted.
Add extra EINVAL information about wrong block size to read(2)/write(2)
The read system call will return EINVAL if the current file offset is
not a multiple of the block size. This also applies to write(2). Add an
entry for EINVAL about this error to both man pages.
Gleb Smirnoff [Fri, 7 Oct 2022 02:22:23 +0000 (19:22 -0700)]
tcp: remove INP_TIMEWAIT flag
Mechanically cleanup INP_TIMEWAIT from the kernel sources. After 0d7445193ab, this commit shall not cause any functional changes.
Note: this flag was very often checked together with INP_DROPPED.
If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we
will be able to remove most of this checks and turn them to
assertions. Some of them can be turned into assertions right now,
but that should be carefully done on a case by case basis.
Gleb Smirnoff [Fri, 7 Oct 2022 02:22:23 +0000 (19:22 -0700)]
tcp: remove tcptw, the compressed timewait state structure
The memory savings the tcptw brought back in 2003 (see 340c35de6a2) no
longer justify the complexity required to maintain it. For longer
explanation please check out the email [1].
Surpisingly through almost 20 years the TCP stack functionality of
handling the TIME_WAIT state with a normal tcpcb did not bitrot. The
existing tcp_input() properly handles a tcpcb in TCPS_TIME_WAIT state,
which is confirmed by the packetdrill tcp-testsuite [2].
This change just removes tcptw and leaves INP_TIMEWAIT. The flag will
be removed in a separate commit. This makes it easier to review and
possibly debug the changes.
Mitchell Horne [Wed, 5 Oct 2022 16:14:36 +0000 (13:14 -0300)]
riscv: handle kernel PTE edge-case in pmap_enter_l2()
Page table pages are never freed from the kernel pmap, instead they are
zeroed when a range is unmapped. This allows future mappings to be
constructed more quickly. Detect this scenario in pmap_enter_l2(), so we
don't fail to create a superpage mapping when the 2MB range is actually
available.
Reviewed by: markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D36885
Mitchell Horne [Wed, 5 Oct 2022 17:11:02 +0000 (14:11 -0300)]
riscv: handle superpage in pmap_enter_quick_locked()
Previously, if pmap_enter_l2() was asked to re-map an existing superpage
(the result of madvise(MADV_WILLNEED) on a mapped range), it could
'fail' to do so, falling back to trying pmap_enter_quick_locked() for
each 4K virtual page. Because this function does not check if the l2
entry it finds is a superpage, it would proceed, sometimes resulting in
the creation of false PV entries.
If the relevant range was later munmap'ed, the system would panic during
the process' exit in pmap_remove_pages(), while attempting to clean up
the PV entries for mappings which no longer exist.
Instead, we should return early in the presence of an existing
superpage, as is done in other pmaps.
PR: 266108
Reviewed by: markj, alc
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D36563
Jessica Clarke [Thu, 6 Oct 2022 19:04:04 +0000 (20:04 +0100)]
bsdinstall: Fix race condition when shutting down after installation
Whilst reboot(8) will block whilst it runs, shutdown(8) does not,
daemonizing instead. This means that we must wait after running it,
otherwise we will exit and cause the system to attempt to go multi-user
in parallel with the shutdown daemon killing init. With the new
multi-console support in the installer, runconsoles will immediately
kill this daemon, racing with the daemon being able to signal init as
desired, and I have seen this race be lost in QEMU with a single CPU. In
the past this wasn't such an issue, since shutdown's daemon puts itself
in a new session group immediately after fork (and the parent doesn't
wait until that has happened, so whilst there's technically a race
condition in there where it could receive a SIGHUP from the death of the
parent's session leader, in practice this is very unlikely to be hit.
This means that the only consequence of this oversight before was that
you might get the beginnings of more console output on the way to
multi-user and thus the console would look a little confusing.
Reviewed by: gjb
Fixes: e4505364c087 ("release/rc.local: Provide option to shutdown after installation complete")
Fixes: a09af1b7fd95 ("bsdinstall release: Start installer on multiple consoles")
Differential Revision: https://reviews.freebsd.org/D36879
Cy Schubert [Thu, 6 Oct 2022 17:56:48 +0000 (10:56 -0700)]
nvmecontrol: Fix i386 build
Fix:
--- all_subdir_sbin ---
/opt/src/git-src/sbin/nvmecontrol/modules/samsung/samsung.c:149:64:
error: format specifies type 'unsigned long' but the argument has type
'uint64_t' (aka 'unsigned long long') [-Werror,-Wformat]
printf(" Read Reclaim Count : %lu\n",
le64dec(&temp->rrc));
~~~
^~~~~~~~~~~~~~~~~~~
%llu
/opt/src/git-src/sbin/nvmecontrol/modules/samsung/samsung.c:150:64:
error: forma t specifies type 'unsigned long' but the argument has type
'uint64_t' (aka 'unsigned long long') [-Werror,-Wformat]
printf(" Lifetime Uncorrectable ECC Count : %lu\n",
le64dec(&temp->lueccc));
~~~
^~~~~~~~~~~~~~~~~~~~~~
%llu
2 errors generated.
Navdeep Parhar [Wed, 5 Oct 2022 18:05:12 +0000 (11:05 -0700)]
cxgbe/cxgbei: Do not validate the hardware iSCSI tag mask.
This was added in 7cba15b16eb2 in 2016 and firmwares at that time were
already setting up the iSCSI tag mask properly. Since then it has also
become possible to split the iSCSI region between multiple PCIE PFs but
the driver's calculation takes only its own PF's allocation into account
and that means this code is incorrect and not just a harmless no-op.
Ignore IPv6 NA and drop IPv6 NS when BACKUP CARP address is used
When system acts as CARP BACKUP ignore received IPv6 Neighbor Advertisements
to ensure that neighbor cache will not be changed.
Also do not send IPv6 Neighbor Solicitation from CARP BACKUP source address.
Such packets can confuse network switch and it detects MAC addresses
flapping.
Alexander Motin [Thu, 6 Oct 2022 16:15:25 +0000 (12:15 -0400)]
vmd: Bypass MSI/MSI-X remapping when possible.
By default all VMD devices remap children MSI/MSI-X interrupts into their
own. It creates additional isolation, but also complicates things due to
sharing, etc. Fortunately some VMD devices can bypass the remapping.
Add tunable to control it for remap testing or if something go wrong.
Kornel Dulęba [Thu, 6 Oct 2022 14:25:47 +0000 (16:25 +0200)]
uart_dev_snps: Fix device probing
The "uart_bus_probe" function is used as a generic part of uart probe
logic. It returns a driver priority(negative number) if successful and
an error code otherwise.
Fix the error checking condition to account for that.
Also, while here return "BUS_PROBE_VENDOR", instead of "0".
This fixes uart on clearfog pro with recent DT.
Kornel Dulęba [Thu, 6 Oct 2022 14:20:58 +0000 (16:20 +0200)]
test/sys/opencrypto: Fix NIST KAT parser iterator
When yield a.k.a "generator" iterator is used we need to return all
data using "yield", before returning from the function.
Because of that only encryption tests were run for AES-CBC, other modes
were affected as well.
Add one more loop to the iterator "next" routine to fix that.
This unveiled a problem in the GCM AEAD parser logic, which didn't
correctly handle tests cases with empty plaintext, i.e. AAD only.
Include the fix in this patch as it's a rather trivial one.
John Baldwin [Wed, 5 Oct 2022 23:48:05 +0000 (16:48 -0700)]
rs: Fix some pointer arith UB.
If the next column was blank, then the length of the following entry
was computed as the end of the following entry minus a global variable
"blank" which is not in the same string or allocation. Instead, save
the start value of 'p' explicitly instead of abusing '*ep'. Possibly
we should just increment p before saving it in sp in the 'blank' case,
but at worst that would just mean maxlen might be one char too large
which should be harmless.
John Baldwin [Wed, 5 Oct 2022 23:47:40 +0000 (16:47 -0700)]
rs: Fix a use after free.
Using a pointer passed to realloc() after realloc() even for pointer
arithmetic is UB. It also breaks in practice on CHERI systems as
the updated value of 'sp' in this case would have had the bounds from
the old allocation.
This would be much cleaner if elem were a std::vector<char *>.
John Baldwin [Wed, 5 Oct 2022 23:46:01 +0000 (16:46 -0700)]
msk: Use a void cast to mark values of dummy reads as unused.
Note that this required adding missing ()'s around the outermost level
of MSK_READ_MIB*. Otherwise, the void cast was only applied to the
first register read. This also meant that MSK_READ_MIB64 was pretty
broken as the uint64_t cast only applied to the first 16-bit register
read in each MSK_READ_MIB32 invocation and the 32-bit shift was only
applied to the second register read of the pair.
Mark Johnston [Wed, 5 Oct 2022 19:12:46 +0000 (15:12 -0400)]
vm_page: Fix a logic error in the handling of PQ_ACTIVE operations
As an optimization, vm_page_activate() avoids requeuing a page that's
already in the active queue. A page's location in the active queue is
mostly unimportant.
When a page is unwired and placed back in the page queues,
vm_page_unwire() avoids moving pages out of PQ_ACTIVE to honour the
request, the idea being that they're likely mapped and so will simply
get bounced back in to PQ_ACTIVE during a queue scan.
In both cases, if the page was logically in PQ_ACTIVE but had not yet
been physically enqueued (i.e., the page is in a per-CPU batch), we
would end up clearing PGA_REQUEUE from the page. Then, batch processing
would ignore the page, so it would end up unwired and not in any queues.
This can arise, for example, when a page is allocated and then
vm_page_activate() is called multiple times in quick succession. The
result is that the page is hidden from the page daemon, so while it will
be freed when its VM object is destroyed, it cannot be reclaimed under
memory pressure.
Fix the bug: when checking if a page is in PQ_ACTIVE, only perform the
optimization if the page is physically enqueued.
Brooks Davis [Wed, 5 Oct 2022 16:26:30 +0000 (17:26 +0100)]
Rename MACHINE_ABI and TARGET_ABI
The MACHINE_ABI and TARGET_ABI variables are used to set the middle of
the target triple (e.g., "-unknown-" or "-gnueabihf-"). They are not set
by any tool in the base system and I've only found the latter mentioned
in one review online. As such, rename them to to MACHINE_TRIPLE_ABI and
TARGET_TRIPLE_ABI to clear the way to use MACHINE_ABI as a supplement to
MACHINE_CPU, etc.
Use time_t rather than uint32_t to represent the timestamps. That means
we have 64 bits rather than 32 on all platforms except i386, avoiding
the Y2K38 issues on most platforms.
Kristof Provost [Wed, 5 Oct 2022 10:11:07 +0000 (12:11 +0200)]
dhclient-script: cope with /32 address leases
On certain cloud platforms (Google Cloud, Packet.net and others) the
DHCP server offers a /32 address. This makes adding the default route
fail since it is not reachable via any interface. Linux's
dhclient-script seem to usually have a special case for that and
explicitly adds an interface route to the router's address.
FreeBSD's dhclient-script already has a special case for when the router
address is the same as the leased address. Now also add one for when
it's a different address that doesn't fall in the interface's subnet.
PR: 241792
Event: Aberdeen hackathon 2022
Submitted by: sigsys@gmail.com
Reviewed by: dch, kp, bz (+1 on the idea, not reviewed), thj
MFC after: 1 week
Mark Johnston [Tue, 4 Oct 2022 16:54:36 +0000 (12:54 -0400)]
dtrace: Add a "regs" variable
This allows invop-based providers (i.e., fbt and kinst) to expose the
register file of the CPU at the point where the probe fired. It does
not work for SDT providers because their probes are implemented as plain
function calls and so don't save registers. It's not clear what
semantics "regs" should have for them anyway.
This is akin to "uregs", which nominally provides access to the
userspace registers. In fact, DIF already had a DIF_VAR_REGS variable
defined, it was simply unimplemented.
Usage example: print the contents of %rdi upon each call to
amd64_syscall():
Mark Johnston [Tue, 4 Oct 2022 16:46:39 +0000 (12:46 -0400)]
makefs: Plug a memory leak
nvlist_find_string() would return a copy of the found value, but callers
assumed they would have to make their own copy. It's simpler to change
nvlist_find_string() than it is to change callers, so do that.
Andrew Turner [Tue, 4 Oct 2022 11:46:24 +0000 (12:46 +0100)]
Clear the indirect flag in the GICv3 ITS driver
Summary:
The indirect flag tells the hardware to use a flat or two level table.
As we only support using the flat table ensure the flag that marks
which is in use is set correctly.
We can't rely on this being set correctly as some firmware may set the
indirect flag, e.g. booting from LinuxBoot.
Reviewed by: imp
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D36873
Alexander Motin [Tue, 4 Oct 2022 14:34:15 +0000 (10:34 -0400)]
pci: Disable Electromechanical Interlock.
Add sysctl/tunable to control Electromechanical Interlock support.
Disable it by default since Linux does not do it either and it seems
the number of systems having it broken is higher than having working.
This fixes NVMe backplane operation on ASUS RS500A-E11-RS12U server
with AMD EPYC 7402 CPU, where attempts to control reported interlock
for some reason end up in PCIe link loss, while interlock status does
not change (it is not really there).
time(3): Increase precision of time conversion functions by using gcd.
When converting times to and from units which have many leading zeros,
it pays off to compute the greatest common divisor first, and then do the
scaling product. This way all time unit conversion comes down to scaling a
signed or unsigned 64-bit value by a fraction represented by two signed
or unsigned 32-bit integers.
SBT_1S is defined as 2^32 . When scaling using powers of 10 above 1,
the gcd of SBT_1S and 10^N is always greater than or equal to 4,
when N is greater or equal to 2.
Scaling a sbt value to milliseconds is then done by multiplying by
(1000 / 8) and dividing by (2^32 / 8).
This trick allows for higher precision at very little additional CPU cost.
It shall also be noted that the Xtosbt() functions prior to this patch,
sometimes were off-by-one:
For example when converting 1 / 8 of a second to sbt as 125ms the old sbt
conversion function would compute 0x20000001 while the new function computes
0x20000000 which multiplied by 8 becomes SBT_1S, which is the correct value.
Randall Stewart [Tue, 4 Oct 2022 11:09:01 +0000 (07:09 -0400)]
tcp idle reduce does not work for a server.
TCP has an idle-reduce feature that allows a connection to reduce its
cwnd after it has been idle more than an RTT. This feature only works
for a sending side connection. It does this by at output checking the
idle time (t_rcvtime vs ticks) to see if its more than the RTO timeout.
The problem comes if you are a web server. You get a request and
then send out all the data.. then go idle. The next time you would
send is in response to a request from the peer asking for more data.
But the thing is you updated t_rcvtime when the request came in so
you never reduce.
The fix is to do the idle reduce check also on inbound.
Gleb Smirnoff [Tue, 4 Oct 2022 03:53:04 +0000 (20:53 -0700)]
netinet*: remove PRC_ constants and streamline ICMP processing
In the original design of the network stack from the protocol control
input method pr_ctlinput was used notify the protocols about two very
different kinds of events: internal system events and receival of an
ICMP messages from outside. These events were coded with PRC_ codes.
Today these methods are removed from the protosw(9) and are isolated
to IPv4 and IPv6 stacks and are called only from icmp*_input(). The
PRC_ codes now just create a shim layer between ICMP codes and errors
or actions taken by protocols.
- Change ipproto_ctlinput_t to pass just pointer to ICMP header. This
allows protocols to not deduct it from the internal IP header.
- Change ip6proto_ctlinput_t to pass just struct ip6ctlparam pointer.
It has all the information needed to the protocols. In the structure,
change ip6c_finaldst fields to sockaddr_in6. The reason is that
icmp6_input() already has this address wrapped in sockaddr, and the
protocols want this address as sockaddr.
- For UDP tunneling control input, as well as for IPSEC control input,
change the prototypes to accept a transparent union of either ICMP
header pointer or struct ip6ctlparam pointer.
- In icmp_input() and icmp6_input() do only validation of ICMP header and
count bad packets. The translation of ICMP codes to errors/actions is
done by protocols.
- Provide icmp_errmap() and icmp6_errmap() as substitute to inetctlerrmap,
inet6ctlerrmap arrays.
- In protocol ctlinput methods either trust what icmp_errmap() recommend,
or do our own logic based on the ICMP header.
Gleb Smirnoff [Tue, 4 Oct 2022 03:53:04 +0000 (20:53 -0700)]
netipsec: move specific ipsecmethods declarations to ipsec_support.h
where struct ipsec_methods is defined. Not a functional change.
Allows further modification of method prototypes without breaking
compilation of other ipsec compilation units.
Gleb Smirnoff [Tue, 4 Oct 2022 03:53:04 +0000 (20:53 -0700)]
netinet*: remove dead code from TCP, UDP, SCTP control input
Now these functions are called only from icmp*_input(). The pointer
to the ICMP data is never NULL and cmd has a limited set of values.
In the past the functions were demultiplexing control messages from
ICMP layer, as well as internally generated events. In the latter
case the the pointer to IP would be NULL.