Brooks Davis [Wed, 3 Jan 2024 16:39:53 +0000 (16:39 +0000)]
posixshm largepage_mmap: fix a racy test
You can't ever safely map a single page and then map a superpage sized
mapping over it with MAP_FIXED. Even in a single-threaded program, ASLR
might mean you land too close to another mapping and on CheriBSD we
don't allow the initial reservation to grow because doing so requires
program changes that are hard to automate.
To avoid this, map the entire region we want to use upfront.
vm_page_reclaim_contig(): update comment to chase recent changes
Commit 2619c5ccfe ("Avoid waiting on physical allocations that can't
possibly be satisfied") changed the return value from bool to errno.
Adjust the function description to match reality.
John Baldwin [Tue, 2 Jan 2024 21:15:13 +0000 (13:15 -0800)]
i386: Always bounce DMA requests above 4G for !PAE kernels
i386 kernels without 'options PAE' will still use PAE page tables if
the CPU supports PAE both to support larger amounts of RAM and for
PG_NX permissions. However, to avoid changing the API, bus_addr_t and
related constants (e.g. BUS_SPACE_MAXADDR) are still limited to
32 bits.
To cope with this, the x86 bus_dma code included an extra check to
bounce requests for addresses above BUS_SPACE_MAXADDR. This check was
elided (probably because it looks always-true on its face and had no
comment explaining its purpose) in recent refactoring. To fix,
restore a custom address-validation function for i386 kernels without
options PAE that includes this check.
Gleb Smirnoff [Tue, 2 Jan 2024 21:05:46 +0000 (13:05 -0800)]
netlink: refactor control data generation for recvmsg(2)
Netlink should return a very simple control data on every recvmsg(2)
syscall. This data is associated with a syscall, not with an nlmsg,
neither with internal our internal representation (nl_bufs). There is
no need to pre-allocate it in non-sleepable context and attach to
nl_buf. Allocate right in the syscall with M_WAITOK. This also
shaves lots of code and simplifies things.
Gleb Smirnoff [Tue, 2 Jan 2024 21:05:25 +0000 (13:05 -0800)]
netlink: improve nl_soreceive()
The previous commit conservatively mimiced operation of soreceive_generic().
The new code does two things:
- parses Netlink message headers and always returns at least one full nlmsg
- hides nl_buf boundaries from the userland, copying out several at once
More details can be found in the large comment block added.
Gleb Smirnoff [Tue, 2 Jan 2024 21:04:01 +0000 (13:04 -0800)]
netlink: use protocol specific receive buffer
Implement Netlink socket receive buffer as a simple TAILQ of nl_buf's,
same part of struct sockbuf that is used for send buffer already.
This shaves a lot of code and a lot of extra processing. The pcb rids
of the I/O queues as the socket buffer is exactly the queue. The
message writer is simplified a lot, as we now always deal with linear
buf. Notion of different buffer types goes away as way as different
kinds of writers. The only things remaining are: a socket writer and
a group writer.
The impact on the network stack is that we no longer use mbufs, so
a workaround from d18715475071 disappears.
Note on message throttling. Now the taskqueue throttling mechanism
needs to look at both socket buffers protected by their respective
locks and on flags in the pcb that are protected by the pcb lock.
There is definitely some room for optimization, but this changes tries
to preserve as much as possible.
Note on new nl_soreceive(). It emulates soreceive_generic(). It
must undergo further optimization, see large comment put in there.
Note on tests/sys/netlink/test_netlink_message_writer.py. This test
boiled down almost to nothing with mbufs removed. However, I left
it with minimal functionality (it basically checks that allocating N
bytes we get N bytes) as it is one of not so many examples of ktest
framework that allows to test KPIs with python.
Note on Linux support. It got much simplier: Netlink message writer
loses notion of Linux support lifetime, it is same regardless of
process ABI. On socket write from Linux process we perform
conversion immediately in nl_receive_message() and on an output
conversion to Linux happens in in nl_send_one(). XXX: both
conversions use M_NOWAIT allocation, which used to be the case
before this change, too.
Gleb Smirnoff [Tue, 2 Jan 2024 21:03:49 +0000 (13:03 -0800)]
tests/netlink: add netlink socket buffer test
With upcoming protocol specific socket buffer for Netlink we need some
additional tests that cover basic socket operations, w/o much of actual
Netlink knowledge. Following tests are performed:
1) Overflow. If an application keeps sending messages to the kernel,
but doesn't read out the replies, then first the receive buffer shall
fill and after that further messages from applications will be queued
on the send buffer until it is filled. After that socket operations
should block. However, reading from the receive buffer some data should
wake up the taskqueue and the send buffer should start draining again.
2) Peek & trunc. Check that socket correctly reports amount of readable
data with MSG_PEEK & MSG_TRUNC. This is typical pattern of Netlink apps.
3) Sizes. Check that zero size read doesn't affect the socket, undersize
read will return one truncated message and the message is removed from
the buffer. Check that large buffer will be filled in one read, without
any boundaries imposed by internal representation of the buffer. Check
that any meaningful read is amended with control data if requested so.
Gleb Smirnoff [Tue, 2 Jan 2024 21:03:40 +0000 (13:03 -0800)]
netlink: uninline some KPI functions that work with struct nl_writer
These functions work with a buffer embedded into nl_writer, which
is going to go opaque with upcoming changes. Make them private to
the netlink module. No functional change intended.
Gleb Smirnoff [Tue, 2 Jan 2024 21:03:21 +0000 (13:03 -0800)]
netlink: use domain specific send buffer
Instead of using generic socket code, create Netlink specific socket
buffer. It is a simple TAILQ of writes that came from userland. This
saves us one memory allocation that could fail and one memory copy.
Alex Richardson [Tue, 2 Jan 2024 19:06:51 +0000 (11:06 -0800)]
kldxref: fix bootstrapping on Linux with Clang 16
The glibc fts_open() callback type does not have the second const
qualifier and it appears that Clang 16 errors by default for mismatched
function pointer types. Add an ifdef to handle this case.
Richard Kümmel [Fri, 15 Dec 2023 11:49:45 +0000 (12:49 +0100)]
Fix udp IPv4-mapped address
Do not use the cached route if the destination isn't the same.
This fix a problem where an UDP packet will be sent via the wrong route
and interface if a previous one was sent via them.
Mark Johnston [Mon, 1 Jan 2024 18:54:15 +0000 (13:54 -0500)]
zfs: Fix SPA sysctl handlers
sbuf_cpy() resets the sbuf state, which is wrong for sbufs allocated by
sbuf_new_for_sysctl(). In particular, this code triggers an assertion
failure in sbuf_clear().
Simplify by just using sysctl_handle_string() for both reading and
setting the tunable.
Apply to FreeBSD directly since this bug causes "sysctl -a" to crash the
kernel.
Warner Losh [Mon, 1 Jan 2024 06:12:54 +0000 (23:12 -0700)]
test-includes: Add -ansi to the compile line to catch problems
We support C89 files, but compile everything in the tree with C99 or
newer. By compiling these -ansi, that will force C89 which doesn't
understand inline. All our header files must use __inline instead of
inline when they define inline functions.
Rick Macklem [Sun, 31 Dec 2023 23:55:24 +0000 (15:55 -0800)]
vfs_vnops.c: Fix vn_generic_copy_file_range() for truncation
When copy_file_range(2) was first being developed,
*inoffp + len had to be <= infile_size or an error was
returned. This semantic (as defined by Linux) changed
to allow *inoffp + len to be greater than infile_size and
the copy would end at *inoffp + infile_size.
Unfortunately, the code that decided if the outfd should
be truncated in length did not get updated for this
semantics change.
As such, if a copy_file_range(2) is done, where infile_size - *inoffp
is less that outfile_size but len is large, the outfd file is truncated
when it should not be. (The semantics for this for Linux is to not
truncate outfd in this case.)
This patch fixes the problem. I believe the calculation is safe
for all non-negative values of outsize, *outoffp, *inoffp and insize,
which should be ok, since they are all guaranteed to be non-negative.
Note that this bug is not observed over NFSv4.2, since it truncates
len to infile_size - *inoffp.
Mark Johnston [Sun, 31 Dec 2023 16:36:12 +0000 (11:36 -0500)]
gtaskqueue: Fix a typo
This is a no-op in practice since gtaskqueue_thread_enqueue() and
taskqueue_thread_enqueue() are identical, and while _gtaskqueue_create()
compares the enqueue callback pointer with gtaskqueue_thread_enqueue(),
the result has no effect since TQ_FLAGS_UNLOCKED_ENQUEUE was copied
directly from subr_taskqueue.c and is unused in the gtaskqueue code.
Fix it anyway since it's a bug. More generally we really need to
consolidate subr_taskqueue.c and subr_gtaskqueue.c.
Mark Johnston [Sun, 31 Dec 2023 16:15:48 +0000 (11:15 -0500)]
frag6: Reduce code duplication
The code which removes a fragment queue from the per-VNET hash table was
duplicated three times. Factor it out into a function. No functional
change intended.
Michael Osipov [Fri, 24 Nov 2023 09:26:41 +0000 (10:26 +0100)]
periodic: Make daily diff(1) output as small is possible
Make, by default, daily diff(1) ignore whitespace changes and the unified output
a context of zero (0) lines. This reduces output of unrelated lines in e-mails
delivered to root.
Michael Osipov [Fri, 24 Nov 2023 09:26:41 +0000 (10:26 +0100)]
periodic: Make security diff(1) output as small is possible
Make, by default, security diff(1) produce a unified output with a context of
zero (0) lines. This reduces output of unrelated lines in e-mails delivered
to root.
Simon J. Gerraty [Sat, 30 Dec 2023 17:10:03 +0000 (09:10 -0800)]
bsd.man.mk allow staging compressed pages
In the DIRDEPS_BUILD we use staging.
The staging logic in bsd.man.mk was in the wrong place, shift it
and add compressed man pages to the stage set if appropriate.
The kern.pid_max_limit will hold the PID_MAX value the kernel was
compiled with. The existing kern.pid_max sysctl can be modified and
doesn't really represent maximum PID number in the system, as there
may still be processes created with higher PIDs before kern.pid_max
was lowered.
Rick Macklem [Fri, 29 Dec 2023 22:59:00 +0000 (14:59 -0800)]
copy_file_range.2: Clarify that only regular files work
PR#273962 reported that copy_file_range(2) did not work
on shared memory objects and returned EINVAL.
Although the reporter felt this was incorrect, it is what
the Linux copy_file_range(2) syscall does.
Since there was no collective agreement that the FreeBSD
semantics should be changed to no longer be Linux compatible,
copy_file_range(2) still works on regular files only.
This man page update clarifies that. If, someday, copy_file_range(2)
is changed to support non-regular files, then the man page will
need to be updated to reflect that.
Mateusz Guzik [Fri, 29 Dec 2023 18:51:56 +0000 (18:51 +0000)]
llvm: Support: don't block signals around close if it can be avoided
Signal blocking originally showed up in 51c2afc4b65b2782 ("Support:
Don't call close again if we get EINTR"), but it was overzealous --
there are systems where the error is known to be fine.
This commit elides signal blocking for said systems (the list is
incomplete though).
Note close() can still fail for other reasons (like ENOSPC), in which
case an error will be returned while the fd slot is cleared up.
Reviewed by: dim
Differential Revision: https://reviews.freebsd.org/D42984
After ad874544d9f018bf8eef4053b5ca7b856c4674cb, interface name
validation has been removed, resulting in two unit tests failures.
Drop the failing tests since they no longer apply.
Kenneth D. Merry [Thu, 28 Dec 2023 21:23:16 +0000 (16:23 -0500)]
camcontrol: Add a sense subcommand
As the name suggests, this sends a SCSI REQUEST SENSE to a device,
and prints out decoded sense information. It can also print out a
hexdump of the sense data.
sbin/camcontrol/camcontrol.c:
Add the new sense subcommand.
John Baldwin [Thu, 28 Dec 2023 19:17:22 +0000 (11:17 -0800)]
mbuf.9: Document mtodo
mtodo() accepts an mbuf and offset and returns a void * pointer to the
requested offset into the mbuf's associated data. Similar to mtod(),
no bounds checking is performed.
Joerg Pulz [Fri, 27 Oct 2023 15:27:37 +0000 (17:27 +0200)]
isp(4): Rework firmware handling/loading
Correctly identify the active firmware in flash on adapters with
primary and secondary firmware region in flash.
Correctly identify the active NVRAM on adapters with primary
and secondary NVRAM region in flash.
Loading ispfw(4) moved from isp_pci_attach() to isp_reset().
Drop the reference to ispfw(4) after using it so one can kldunload(8) it.
New isp_load_ram() function to load either ispfw(4) or flash firmware
into RISC's RAM.
New functions to read data from flash. The old ones will be removed later.
A bunch of new helper functions to identify and validate active flash
regions for firmware, auxiliary and NVRAM.
Overhaul ISP_FW_* macros and make use of it when comparing firmware
versions. We can handle firmware versions up to 255.255.255.
Firmware load priority slightly changed:
For 27xx and newer adapters:
- load ispfw(4) firmware
- request (active) flash firmware information
- compare version numbers of ispfw(4) and flash firmware
- load firmware with highest version into RISC's RAM
- if loading ispfw(4) is disabled or failed - load firmware from flash
- if everything else fails use MBOX_LOAD_FLASH_FIRMWARE as fallback
For 26xx and older adapters nothing changed:
- load ispfw(4) firmware and load it into RISC's RAM
- if loading ispfw(4) is disabled or failed use MBOX_EXEC_FIRMWARE
- for 26xx a preceding MBOX_LOAD_FLASH_FIRMWARE is used
New read only sysctl(8)'s:
dev.isp.N.fw_version_run: the firmware version actually running
dev.isp.N.fw_version_ispfw: the firmware version provided by ispfw(4)
dev.isp.N.fw_version_flash: the (active) firmware version in flash
While here:
- firmware attribute handling/parsing reworked
+ renamed defines from ISP2400_FW_ATTR_* to ISP_FW_ATTR_*
+ changed values to match new handling/parsing
+ added some more attributes
- enable FLT support on 26xx based adapters
- log level adjustments
- new function return status codes (some for now, some for later use)
- some minor style changes
Tested and approved to work on real hardware with:
- Qlogic ISP 2532 (QLogic QLE2560 8Gb FC Adapter)
- Qlogic ISP 2031 (QLogic QLE2662 16Gbit 2Port FC Adapter)
- Qlogic ISP 2722 (QLogic QLE2690 16Gb FC Adapter)
- Qlogic ISP 2812 (QLogic QLE2772 32Gbit 2Port FC Adapter)
PR: 273263
Reviewed by: mav
Pull Request: https://github.com/freebsd/freebsd-src/pull/877
MFC after: 1 month
Sponsored by: Technical University of Munich
Mark Johnston [Thu, 28 Dec 2023 17:08:04 +0000 (12:08 -0500)]
cam: Let cam_periph_unmapmem() return an error
As of commit b059686a71c8, cam_periph_unmapmem() can legitimately fail
if the copyout() operation fails. However, this failure was never
signaled to upper layers. In practice it is unlikely to occur
since cap_periph_mapmem() would most likely fail in such
circumstances anyway, but an error is nonetheless possible.
However, some code reading revealed a few paths where the return value
of cam_periph_mapmem() is not checked, and this is definitely a bug.
Add error checking there and let cam_periph_unmapmem() return errors
from copyout().
This commit introduces SRD metrics through sysctl.
The metrics can be queried using the following sysctl node:
sysctl dev.ena.<device index>.ena_srd_info
This commit adds sysctl support for customer metrics.
Different customer metrics can be found in the following sysctl node:
sysctl dev.ena.<device index>.customer_metrics
ena: Introduce shared sample interval for all stats
Rename sample_interval node to stats_sample_interval and move
it up in the sysctl tree to make it clear that it's relevant for
all the stats and not only ENI metrics (Currently, sample interval node
is found under eni_metrics node).
Path to node:
dev.ena.<device_index>.stats_sample_interval
Once this parameter is set it will set the sample interval for all the
stats node including SRD/customer metrics.
Osama Abboud [Mon, 30 Oct 2023 11:27:03 +0000 (11:27 +0000)]
ena: Add sysctl support for spreading IRQs
This commit allows spreading IO IRQs over different CPUs through sysctl.
Two sysctl nodes are introduced:
1- base_cpu: servers as the first CPU to which the first IO IRQ
will be bound.
2- cpu_stride: sets the distance between every two CPUs to which every
two consecutive IO IRQs are bound.
For example for doing the following IO IRQs / CPU binding:
Run the following commands:
sysctl dev.ena.<device index>.irq_affinity.base_cpu=0
sysctl dev.ena.<device_index>.irq_affinity.cpu_stride=2
Also introduced rss_enabled field, which is intended to replace
'#ifdef RSS' in multiple places, in order to prevent code duplication.
We want to bind interrupts to CPUs in case of rss set OR in case
the newly defined sysctl paremeter is set. This requires to remove a
couple of '#ifdef RSS' as well in the structs, since we'll be using the
relevant parameters in the CPU binding code.
Osama Abboud [Thu, 28 Dec 2023 13:25:43 +0000 (13:25 +0000)]
ena: Upgrade ena-com to freebsd v2.7.0
This commit introduces a number of infrastructures in ena-com, some of
which are being used as of ENA v2.7.0 while other certain infrastructure
assets have been made available for potential future application.
Upgrade ena-com to include the following changes:
* Introduce customer metrics infrastructures
* Introduce SRD metrics infrastructures
* Remove unused fields from ena_com_io_cq and ena_com_io_sq structs
* Minor rework of ena_com_fill_hash_function
* Introduce PHC infrastructures
* Update the licenses for ena-com files
* Delete duplicate *_defs.h found in ena-com and ena_defs directories
* Add validation for completion descriptors consistency
* Move ena_fbsd_log.h file to ena_plat.h
Approved by: cperciva (mentor)
Sponsored by: Amazon, Inc.
Dimitry Andric [Thu, 28 Dec 2023 12:57:41 +0000 (13:57 +0100)]
Reorganize libclang_rt Makefile and make more lib/arch combos available
Upstream has made more clang runtime libraries available for more
architectures, so add them. To make this easier, split up subdir lists
into functional parts (asan, tsan, etc), and put each architecture into
its own .if block.
Effectively, this adds the following libraries for aarch64: asan, cfi,
fuzzer, msan, safestack, stats, tsan, ubsan, xray.
Warner Losh [Thu, 21 Dec 2023 20:36:12 +0000 (13:36 -0700)]
vtnet: Better adjust for ethernet alignment.
Move adjustment of the mbuf from where we allocate it to where we are
about to queue it to the device. Do this only on those platforms that
require it. This allows us to receive an entire jumbo frame on other
platforms. It also doesn't make the adjustment on subsequent frames when
we queue mulitple mbufs for LRO operations.
For the normal use case on armv7, there's no difference because we only
ever allocate one mbuf. However, for the LRO cases it increases what's
available in LRO. It also ensure that we get enough mbufs in those cases
as well (though I have no ability to test this on a LRO scenario with
armv7).
This has the side effect of reverting 527b62e37e68.
Jose Luis Duran [Thu, 28 Dec 2023 05:26:23 +0000 (22:26 -0700)]
mtree: Update mtree flags in README file
- Add -b (suppress blank lines before directories).
- The equivalent of `-i` in fmtree is `-j` in mtree (nmtree) (indent the
output 4 spaces).
- Add `-F freebsd9` compatibility flavor (print the closing `..` at the
end).