While here, add device_printf()'s to all failure points. Also fix an
existing bug where we'd unlock an already unlocked channel, in case we
went to "out" (now "out2") before locking the channel.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Reviewed by: dev_submerge.ch
Differential Revision: https://reviews.freebsd.org/D44993
John Baldwin [Mon, 6 May 2024 17:49:04 +0000 (10:49 -0700)]
git-arc: Add list mode support for the update command
This can be particularly useful to do bulk-updates of multiple commits
using the same message, e.g.
git arc update -lm "Move function xyz to libfoo" main..myfeature
Similar to the list mode for the create command, git arc will list all
the candidate revisions with a single prompt. Once that is confirmed,
all the revisions are updated without showing the diffs or pausing
further prompts.
Warner Losh [Mon, 6 May 2024 15:10:46 +0000 (09:10 -0600)]
endian.h: Define uint{16,32,64}_t
The Draft Posix Issue 8 standard requires that these be defined. Define
them in the usual way that lets multiple headers define them. Opted to
not just use #include <stdint.h>, allowed by the draft, to be
conservative. Add notes about how we comply with Issue 8, and that we've
opted to define these only as macros, though the standard allows
functions, macros or both.
adduser: Fix confusion between `uclass` and `_class`.
This caused adduser to produce an invalid `pw(8)` command line. Due to
bugs in `pw(8)`, the command line was silently accepted and led to the
user being created, but locked out and with no home directory.
Also fix the default value for the “Another user?” prompt.
Kristof Provost [Mon, 6 May 2024 09:39:08 +0000 (11:39 +0200)]
if: guard against if_ioctl being NULL
There are situations where an struct ifnet has a NULL if_ioctl pointer.
For example, e6000sw creates such struct ifnets for each of its ports so it can
call into the MII code.
If there is then a link state event this calls do_link_state_change()
-> rtnl_handle_ifevent() -> dump_iface() -> get_operstate() ->
get_operstate_ether(). That wants to know if the link is up or down, so it tries
to ioctl(SIOCGIFMEDIA), which doesn't go well if if_ioctl is NULL.
Randall Stewart [Sun, 5 May 2024 13:08:47 +0000 (09:08 -0400)]
TCP can be subject to Sack Attacks lets fix this issue.
There is a type of attack that a TCP peer can launch on a connection. This is for sure in Rack or BBR and probably even the default stack if it uses lists in sack processing. The idea of the attack is that the attacker is driving you to look at 100's of sack blocks that only update 1 byte. So for example if you have 1 - 10,000 bytes outstanding the attacker sends in something like:
ACK 0 SACK(1-512) SACK(1024 - 1536), SACK(2048-2536), SACK(4096 - 4608), SACK(8192-8704)
This first sack looks fine but then the attacker sends
ACK 0 SACK(1-512) SACK(1025 - 1537), SACK(2049-2537), SACK(4097 - 4609), SACK(8193-8705)
ACK 0 SACK(1-512) SACK(1027 - 1539), SACK(2051-2539), SACK(4099 - 4611), SACK(8195-8707)
...
These blocks are making you hunt across your linked list and split things up so that you have an entry for every other byte. Has your list grows you spend more and more CPU running through the lists. The idea here is the attacker chooses entries as far apart as possible that make you run through the list. This example is small but in theory if the window is open to say 1Meg you could end up with 100's of thousands link list entries.
To combat this we introduce three things.
when the peer requests a very small MSS we stop processing SACK's from them. This prevents a malicious peer from just using a small MSS to do the same thing.
Any time we get a sack block, we use the sack-filter to remove sacks that are smaller than the smallest v4 mss (minus 40 for max TCP options) unless it ties up to snd_max (since that is legal). All other sacks in theory should be at least an MSS. If we get such an attacker that means we basically start skipping all but MSS sized Sacked blocks.
The sack filter used to throw away data when its bounds were exceeded, instead now we increase its size to 15 and then throw away sack's if the filter gets over-run to prevent the malicious attacker from over-running the sack filter and thus we start to process things anyway.
The default stack will need to start using the sack-filter which we have talked about in past conference calls to take full advantage of the protections offered by it (and reduce cpu consumption when processing sacks).
After this set of changes is in rack can drop its SAD detection completely
Colin Percival [Sun, 5 May 2024 05:31:19 +0000 (22:31 -0700)]
release: Use qemu when cross-building vm images
For a bit over 5 years, we have used qemu when cross-building cloudware
images; in particular, it's necessary when installing packages which
might include post-install scripts.
Use qemu in the vm-images target too; while "generic" vm images don't
install packages, they still run newaliases and /etc/rc.d/ldconfig,
both of which fail without appropriate emulation.
Apr 22, 2024:
fixed regex engine gototab reallocation issue that was
introduced during the Nov 24 rewrite. Thanks to Arnold Robbins.
Fixed a scan bug in split in the case the separator is a single
character. thanks to Oguz Ismail for spotting the issue.
Mar 10, 2024:
fixed use-after-free bug in fnematch due to adjbuf invalidating
the pointers to buf. thanks to github user caffe3 for spotting
the issue and providing a fix, and to Miguel Pineiro Jr.
for the alternative fix.
MAX_UTF_BYTES in fnematch has been replaced with awk_mb_cur_max.
thanks to Miguel Pineiro Jr.
Rick Macklem [Sat, 4 May 2024 21:30:07 +0000 (14:30 -0700)]
nfsd: Fix Link conformance with RFC8881 for delegations
RFC8881 specifies that, when a Link operation occurs on an
NFSv4, that file delegations issued to other clients must
be recalled. Discovered during a recent discussion on nfsv4@ietf.org.
Although I have not observed a problem caused by not doing
the required delegation recall, it is definitely required
by the RFC, so this patch makes the server do the recall.
Tested during a recent NFSv4 IETF Bakeathon event.
Apr 22, 2024:
fixed regex engine gototab reallocation issue that was
introduced during the Nov 24 rewrite. Thanks to Arnold Robbins.
Fixed a scan bug in split in the case the separator is a single
character. thanks to Oguz Ismail for spotting the issue.
Mar 10, 2024:
fixed use-after-free bug in fnematch due to adjbuf invalidating
the pointers to buf. thanks to github user caffe3 for spotting
the issue and providing a fix, and to Miguel Pineiro Jr.
for the alternative fix.
MAX_UTF_BYTES in fnematch has been replaced with awk_mb_cur_max.
thanks to Miguel Pineiro Jr.
Note: This brings in the matchop-deref.* files that were missing (but in
FreeBSD already) and adds system-stauts.ok2. The latter has been deleted
in FreeBSD since it does not fit ATF well. Care must be taken to remove it
before the merge this time.
Lexi Winter [Sat, 4 May 2024 16:42:40 +0000 (10:42 -0600)]
rc.conf.5: modernise network_interfaces
It's not 1996 anymore, and we use CIDR nowadays. Update the various
ifconfig_ examples to use CIDR notation instead of netmasks, and also
add an example of a basic ifconfig_ entry that most users will be
interested in.
HP van Braam [Sat, 4 May 2024 14:40:15 +0000 (08:40 -0600)]
aic7xxx: make target mode enable a device hint
Previously it was only possible to enable target mode for these drivers
by rebuilding the kernel with AHC_TMODE_ENABLE or AHD_TMODE_ENABLE and a
bitmask of which units to statically enable for target mode.
There is no space-savings in the driver by not having AHC_TMODE_ENABLE
set, so in addition to the compile time option lets also introduce some
tunables:
HP van Braam [Sat, 4 May 2024 14:36:47 +0000 (08:36 -0600)]
aic7xxx: aicasm correct include file
aicasm just puts the value of the "-i" passed include file in the
generated file with quotes around it. This means that there are manual
edits made to aic7xxx_reg_print.c and aic79xx_reg_print.c
now we check to see if the value passed to '-i' starts with a '<', if it
does don't output the quotes.
Signed-off-by: HP van Braam <hp@tmm.cx>
Reviewed by: imp (minor code simplification)
Pull Request: https://github.com/freebsd/freebsd-src/pull/1209
When checking for the destructor pointer belonging to some still
loaded dso, do not limit the possible dso to the one instantiated the
destructor. For instance, dso could set up the dtr pointer to a function
from libcxx.
* VERSION (_MAKE_VERSION): 20240430
Merge with NetBSD make, pick up
o main.c: ensure '.include <makefile>' respects MAKESYSPATH.
Dir_FindFile will search .CURDIR first unless ".DOTLAST" is seen.
2024-04-28 Simon J Gerraty <sjg@beast.crufty.net>
* VERSION (_MAKE_VERSION): 20240428
Merge with NetBSD make, pick up
o simplify freeing of lists
o arch.c: trim pointless comments
o var.c: delay variable assignments until actually needed
don't reallocate memory after evaluating an expression, result is
almost always short-lived.
2024-04-26 Simon J Gerraty <sjg@beast.crufty.net>
* VERSION (_MAKE_VERSION): 20240426
Merge with NetBSD make, pick up
o job.c: in debug output, print the directory in which a job
failed at same time as failed target so it is more easily found in
build log.
2024-04-24 Simon J Gerraty <sjg@beast.crufty.net>
* VERSION (_MAKE_VERSION): 20240424
Merge with NetBSD make, pick up
o clean up comments, code and tests
2024-04-23 Simon J Gerraty <sjg@beast.crufty.net>
* VERSION (_MAKE_VERSION): 20240422
Merge with NetBSD make, pick up
o var.c: avoid LazyBuf for :*time modifiers.
LazyBuf's are not nul terminated so not suitable for passing to
functions that expect that. These modifiers are used sparingly so
an extra allocation is not a problem.
2024-04-20 Simon J Gerraty <sjg@beast.crufty.net>
* VERSION (_MAKE_VERSION): 20240420
Merge with NetBSD make, pick up
o provide more context information for parse/evaluate errors
2024-04-14 Simon J Gerraty <sjg@beast.crufty.net>
* VERSION (_MAKE_VERSION): 20240414
Merge with NetBSD make, pick up
o parse.c: print -dp debug info earlier so we see which
.if or .for line is being parsed.
2024-04-04 Simon J Gerraty <sjg@beast.crufty.net>
* VERSION (_MAKE_VERSION): 20240404
Merge with NetBSD make, pick up
o fix some unit tests for Cygwin
o parse.c: exit immediately after reading a null byte from a makefile
* fix generation of bmake.cat1
2024-03-19 Simon J Gerraty <sjg@beast.crufty.net>
* VERSION (_MAKE_VERSION): 20240314
Add/Improve support for Cygwin
o uname -s output isn't useful so allow configure to
set FORCE_MAKE_OS - to force the value of .MAKE.OS
and use Cygwin which matches uname -o
o fix some unit-tests for Cygwin
* configure.in: use_makefile=no for Cygwin et al.
NOTE: bmake does not support Cygwin and likely never will,
* init.mk: allow for _ as well as . to join V
and Q from QUALIFIED_VAR_LIST and VAR_QUALIFIER_LIST.
* progs.mk: avoid overlap between PROG_VARS and
init.mk's QUALIFIED_VAR_LIST since PROG would also
match its VAR_QUALIFIER_LIST,
libs.mk does not have the same issue.
* subdir.mk: _SUBDIRUSE for realinstall should run install
remove include of ${.CURDIR}/Makefile.inc that can be done via
local.subdir.mk where needed
Justin Hibbits [Mon, 13 Nov 2023 16:33:44 +0000 (11:33 -0500)]
tpm: Refactor TIS and add a SPI attachment
Summary:
Though mostly used in x86 devices, TPM can be used on others, with a
direct SPI attachment. Refactor the TPM 2.0 driver set to use an
attachment interface, and implement a SPI bus interface.
Test Plan:
Tested on a Raspberry Pi 4, with a GeeekPi TPM2.0 module (SLB9670
TPM) using security/tpm2-tools tpm2_getcaps for very light testing against the
spibus attachment.
Notable upstream pull request merges:
#15839 c3f2f1aa2 vdev probe to slow disk can stall mmp write checker
#15888 5044c4e3f Fast Dedup: ZAP Shrinking
#15996 db499e68f Overflowing refreservation is bad
#16118 67d13998b Make more taskq parameters writable
#16128 21bc066ec Fix updating the zvol_htable when renaming a zvol
#16130 645b83307 Improve write issue taskqs utilization
#16131 8fd3a5d02 Slightly improve dnode hash
#16134 a6edc0adb zio: try to execute TYPE_NULL ZIOs on the current task
#16141 b28461b7c Fix arcstats for FreeBSD after zfetch support
Warner Losh [Fri, 3 May 2024 15:08:03 +0000 (09:08 -0600)]
MINIMAL: Grow minimal to support ata, scsi and nvme
Until the boot loader automatically loads these things (including the
CAM dependency), we need to have them in the minimal kernel since they
are needed to boot. These aren't strictly required to be in the kernel,
since modules work, but are high enough demand items that until we sort
out boot loader automation, I'm adding them here. These devices are also
common in vm environments. The delta is relatively small in size. Once
the boot loader automation arrives, these and a lot of other things can
be trimmed. It's less than ideal, but is a good middle ground for the
moment.
Gleb Smirnoff [Fri, 3 May 2024 14:45:07 +0000 (07:45 -0700)]
tests/sendfile: test operation on unix/stream socket
Although there are already multiple tests in the tests collection
that utilize sendfile(2) support over unix/stream socket, they all
don't exercise the asynchronous part of the operation. This test
framework, however, uses a trick to toggle true async operation and
guarantee that pr_ready method of unix/stream is also tested.
Tijl Coosemans [Fri, 3 May 2024 13:27:29 +0000 (15:27 +0200)]
linuxkpi: Fix set_memory_*
set_memory_* is currently implemented using PHYS_TO_DMAP but not all
architectures have a DMAP. Looking at how this function is used the
given address isn't physical but virtual so the PHYS_TO_DMAP call can
simply be removed.
Also cast numpages before shifting it to avoid overflow.
Shawn Bayern [Fri, 3 May 2024 07:46:18 +0000 (00:46 -0700)]
Tighten boundary check in split(1) to prevent a potential buffer overflow.
Before increasing sufflen, make sure the current name plus two (including
the terminating NUL character and the to-be-added character) does not
exceed the fixed buffer length, and stop immediately if this would occur.
In worst case scenario the code would write an nul character beyond the
boundary, however it would be caught by open(2) and based on the memory
layout, we do not believe this would constitute a security vulnerability.
John Baldwin [Thu, 2 May 2024 23:35:40 +0000 (16:35 -0700)]
nvmfd: A simple userspace daemon for the NVMe over Fabrics controller
This daemon can operate as a purely userspace controller exporting one
or more simulated RAM disks or local block devices as NVMe namespaces
to a remote host. In this case the daemon provides a discovery
controller with a single entry for an I/O controller.
nvmfd can also offload I/O controller queue pairs to the nvmft.ko
in-kernel Fabrics controller when -K is passed. In this mode, nvmfd
still accepts connections and performs initial transport-specific
negotitation in userland. The daemon still provides a userspace-only
discovery controller with a single entry for an I/O controller.
However, queue pairs for the I/O controller are handed off to the CTL
NVMF frontend.
Eventually ctld(8) should be refactored to to provide an abstraction
for the frontend protocol and the discovery and the kernel mode of
this daemon should be merged into ctld(8). At that point this daemon
can be moved to tools/tools/nvmf as a debugging tool (mostly as sample
code for a userspace controller using libnvmf).
John Baldwin [Thu, 2 May 2024 23:35:32 +0000 (16:35 -0700)]
nvmfdd: A simple userspace NVMe over Fabrics host
This program uses libnvmf to connect to a remote Fabrics controller
and perform a single read or write operation. The write command reads
data from stdin to construct one or more NVM Write commands sent to
the remote namespace. The read command uses one or more NVM Read
commands to read blocks from a remote namespace writing the data to
stdout.
John Baldwin [Thu, 2 May 2024 23:34:45 +0000 (16:34 -0700)]
nvmft: The in-kernel NVMe over Fabrics controller
This is the server (target in SCSI terms) for NVMe over Fabrics.
Userland is responsible for accepting a new queue pair and receiving
the initial Connect command before handing the queue pair off via an
ioctl to this CTL frontend.
This frontend exposes CTL LUNs as NVMe namespaces to remote hosts.
Users can ask LUNS to CTL that can be shared via either iSCSI or
NVMeoF.
John Baldwin [Thu, 2 May 2024 23:34:26 +0000 (16:34 -0700)]
ctl: Add NVMF port type and ioctls
- Add CTL_PORT_NVMF as a new port type.
- Define a new CTL_NVMF ioctl for NVMF-specific operations similar to
CTL_ISCSI. This ioctl supports a command to handoff a single
queue pair, a command to enumerate active associations, and a
command to disconnect one or more active associations.
John Baldwin [Thu, 2 May 2024 23:33:50 +0000 (16:33 -0700)]
ctl_backend_ramdisk: Add support for NVMe
One known caveat is that the support for WRITE_UNCORRECTABLE is not
quite correct as reads from LBAs after a WRITE_UNCORRECTABLE will
return zeroes rather than an error. Fixing this would likely require
special handling for PG_ANCHOR for NVMe requests (or adding a new
PG_UNCORRECTABLE).
John Baldwin [Thu, 2 May 2024 23:32:41 +0000 (16:32 -0700)]
ctl: Add helper routines to populate NVMe namespace data IDs for a LUN
These will be used by the backends to populate the unique ID fields
like EUI64 in the NVMe namespace data (CNS == 0) and namespace
identification descriptor list (CNS == 3).
John Baldwin [Thu, 2 May 2024 23:32:09 +0000 (16:32 -0700)]
ctl: Support for NVMe commands
- Add support for queueing and executing NVMe admin and NVM commands
via ctl_run and ctl_queue. This requires fixing a few places that
were SCSI-specific to add NVME logic.
- NVMe has much simpler command ordering requirements than SCSI. In
particular, the HBA is not required to enforce any specific ordering
for requests with overlapping LBAs. The host is required to manage
that ordering. However, fused commands (currently only COMPARE and
WRITE NVM commands can be fused) are required to be executed
atomically.
To support fused commands, make the second half of a fused command
block on the first half, and have commands submitted after a fused
command pair block on the second half.
- Add handlers and command tables for admin and NVM commands that
operate on individual namespaces and will be passed down from an
NVMe over Fabrics controller to a CTL LUN.
John Baldwin [Thu, 2 May 2024 23:31:59 +0000 (16:31 -0700)]
ctl: Add assertions in SCSI-only paths
Assert that only SCSI I/O requests are passed in various places
that assume a SCSI I/O request (that is, places that access fields
in io->scsiio directly).
John Baldwin [Thu, 2 May 2024 23:31:44 +0000 (16:31 -0700)]
ctl: Update some core data paths to be protocol agnostic
- Add wrapper routines for invoking the be_move_done and io_continue
callbacks in SCSI and NVMe I/O requests.
- Use wrapper routines for access to shared fields between SCSI and
NVMe I/O requests.
- ctl_config_write_done is not fully updated since it resubmits SCSI
commands via ctl_scsiio. This will be completed in a subsequent
commit when ctl_nvmeio is added.
John Baldwin [Thu, 2 May 2024 23:30:20 +0000 (16:30 -0700)]
ctl: Avoid an upcast for calling ctl_scsi_path_string
Change the first argument of ctl_scsi_path_string to be the embedded
header structure instead of the union. Currently union ctl_io and
struct ctl_scsiio have the same alignment, but this changes on i386 if
a new union member is added that contains a uint64_t member (such as
an embedded struct nvme_command for NVMeoF). In that case, union
ctl_io requires stronger alignment, so the upcast from struct
ctl_scsiio to union ctl_io in ctl_scsi_sense_sbuf raises an increasing
alignment warning on i386.
Avoid the warning by passing struct ctl_io_hdr as the first argument
to ctl_scsi_path_string instead.
John Baldwin [Thu, 2 May 2024 23:30:10 +0000 (16:30 -0700)]
nvmecontrol: New commands to support Fabrics hosts
- discover: Connects to a remote Discovery controller, fetches its
Discovery Log Page, and enumerates the remote controllers described
in the log page.
The -v option can be used to display the Identify Controller data
structure for the Discovery controller. This is only really useful
for debugging.
- connect: Connects to a remote I/O controller and establishes an
association of an admin queue and a single I/O queue. The
association is handed off to the in-kernel host to create a new
nvmeX device.
- connect-all: Connects to a Discovery controller and attempts to
create an association with each I/O controller enumerated in the
Discovery controller's Discovery Log Page.
- reconnect: Establishes a new association with a remote I/O
controller for an existing nvmeX device. This can be used to
restore access to a remote I/O controller after the loss of a prior
association due to a transport error, controller reboot, etc.
- disconnect: Deletes one or more nvmeX devices after detaching its
namespaces and terminating any active associations. The devices to
delete can be identified by either a nvmeX device name or the NQN of
the remote controller.
- disconnect-all: Deletes all active associations with remote
controllers.
John Baldwin [Thu, 2 May 2024 23:29:37 +0000 (16:29 -0700)]
nvmf: The in-kernel NVMe over Fabrics host
This is the client (initiator in SCSI terms) for NVMe over Fabrics.
Userland is responsible for creating a set of queue pairs and then
handing them off via an ioctl to this driver, e.g. via the 'connect'
command from nvmecontrol(8). An nvmeX new-bus device is created
at the top-level to represent the remote controller similar to PCI
nvmeX devices for PCI-express controllers.
As with nvme(4), namespace devices named /dev/nvmeXnsY are created and
pass through commands can be submitted to either the namespace devices
or the controller device. For example, 'nvmecontrol identify nvmeX'
works for a remote Fabrics controller the same as for a PCI-express
controller.
nvmf exports remote namespaces via nda(4) devices using the new NVMF
CAM transport. nvmf does not support nvd(4), only nda(4).
John Baldwin [Thu, 2 May 2024 23:28:47 +0000 (16:28 -0700)]
nvmf_tcp: Add a TCP transport for NVMe over Fabrics
Structurally this is very similar to the TCP transport for iSCSI
(icl_soft.c). One key difference is that NVMeoF transports use a more
abstract interface working with NVMe commands rather than transport
PDUs. Thus, the data transfer for a given command is managed entirely
in the transport backend.
Similar to icl_soft.c, separate kthreads are used to handle transmit
and receive for each queue pair. On the transmit side, when a capsule
is transmitted by an upper layer, it is placed on a queue for
processing by the transmit thread. The transmit thread converts
command response capsules into suitable TCP PDUs where each PDU is
described by an mbuf chain that is then queued to the backing socket's
send buffer. Command capsules can embed data along with the NVMe
command.
On the receive side, a socket upcall notifies the receive kthread when
more data arrives. Once enough data has arrived for a PDU, the PDU is
handled synchronously in the kthread. PDUs such as R2T or data
related PDUs are handled internally, with callbacks invoked if a data
transfer encounters an error, or once the data transfer has completed.
Received capsule PDUs invoke the upper layer's capsule_received
callback.
struct nvmf_tcp_command_buffer manages a TCP command buffer for data
transfers that do not use in-capsule-data as described in the NVMeoF
spec. Data related PDUs such as R2T, C2H, and H2C are associated with
a command buffer except in the case of the send_controller_data
transport method which simply constructs one or more C2H PDUs from the
caller's mbuf chain.
John Baldwin [Thu, 2 May 2024 23:28:32 +0000 (16:28 -0700)]
nvmf: Add infrastructure kernel module for NVMe over Fabrics
nvmf_transport.ko provides routines for managing NVMeoF queue pairs
and capsules. It provides a glue layer between transports (such as
TCP or RDMA) and an NVMeoF host (initiator) and controller (target).
Unlike the synchronous API exposed to the host and controller by
libnvmf, the kernel's transport layer uses an asynchronous API built
on callbacks. Upper layers provide callbacks on queue pairs that are
invoked for transport errors (error_cb) or anytime a capsule is
received (receive_cb).
Data transfers for a command are usually associated with a callback
that is invoked once a transfer has finished either due to an error
or successful completion.
For an upper layer that is a host, command capsules are allocated and
populated with an NVMe SQE by calling nvmf_allocate_command. A data
buffer (described by a struct memdesc) can be associated with a
command capsule before it is transmitted via nvmf_capsule_append_data.
This function accepts a direction (send vs receive) as well as the
data transfer callback. The host then transmits the command via
nvmf_transmit_capsule. The host must ensure that the data buffer
described by the 'struct memdesc' remains valid until the data
transfer callback is called. The queue pair's receive_cb callback
should match received response capsules up with previously transmitted
commands.
For the controller, incoming commands are received via the queue
pair's receive_cb callback. nvmf_receive_controller_data is used to
retrieve any data from a command (e.g. the data for a WRITE command).
It can be called multiple times to split the data transfer into
smaller sizes. This function accepts an I/O completion callback that
is invoked once the data transfer has completed.
nvmf_send_controller_data is used to send data to a remote host in
response to a command. In this case a callback function is not used
but the status is returned synchronously. Finally, the controller can
allocate a response capsule via nvmf_allocate_response populated with
a supplied CQE and send the response via nvmf_transmit_capsule.
John Baldwin [Thu, 2 May 2024 23:28:16 +0000 (16:28 -0700)]
libnvmf: Add internal library to support NVMe over Fabrics
libnvmf provides APIs for transmitting and receiving Command and
Response capsules along with data associated with NVMe commands.
Capsules are represented by 'struct nvmf_capsule' objects.
Capsules are transmitted and received on queue pairs represented by
'struct nvmf_qpair' objects.
Queue pairs belong to an association represented by a 'struct
nvmf_association' object.
libnvmf provides additional helper APIs to assist with constructing
command capsules for a host, response capsules for a controller,
connecting queue pairs to a remote controller and optionally
offloading connected queues to an in-kernel host, accepting queue pair
connections from remote hosts and optionally offloading connected
queues to an in-kernel controller, constructing controller data
structures for local controllers, etc.
libnvmf also includes an internal transport abstraction as well as an
implementation of a userspace TCP transport.
libnvmf is primarily intended for ease of use and low-traffic use cases
such as establishing connections that are handed off to the kernel.
As such, it uses a simple API built on blocking I/O.
For a host, a consumer first populates an 'struct
nvmf_association_params' with a set of parameters shared by all queue
pairs for a single association such as whether or not to use SQ flow
control and header and data digests and creates a 'struct
nvmf_association' object. The consumer is responsible for
establishing a TCP socket for each queue pair. This socket is
included in the 'struct nvmf_qpair_params' passed to 'nvmf_connect' to
complete transport-specific negotiation, send a Fabrics Connect
command, and wait for the Connect reply. Upon success, a new 'struct
nvmf_qpair' object is returned. This queue pair can then be used to
send and receive capsules. A command capsule is allocated, populated
with an SQE and optional data buffer, and transmitted via
nvmf_host_transmit_command. The consumer can then wait for a reply
via nvmf_host_wait_for_response. The library also provides some
wrapper functions such as nvmf_read_property and nvmf_write_property
which send a command and wait for a response synchronously.
For a controller, a consumer uses a single association for a set of
incoming connections. A consumer can choose to use multiple
associations (e.g. a separate association for connections to a
discovery controller listening on a different port than I/O
controllers). The consumer is responsible for accepting TCP sockets
directly, but once a socket has been accepted it is passed to
nvmf_accept to perform transport-specific negotiation and wait for the
Connect command. Similar to nvmf_connect, nvmf_accept returns a newly
construct nvmf_qpair. However, in contrast to nvmf_connect,
nvmf_accept does not complete the Fabrics negotiation. The consumer
must explicitly send a response capsule before waiting for additional
command capsules to arrive. In particular, in the kernel offload
case, the Connect command and data are provided to the kernel
controller and the Connect response capsule is sent by the kernel once
it is ready to handle the new queue pair.
For userspace controller command handling, the consumer uses
nvmf_controller_receive_capsule to wait for a command capsule.
nvmf_receive_controller_data is used to retrieve any data from a
command (e.g. the data for a WRITE command). It can be called
multiple times to split the data transfer into smaller sizes.
nvmf_send_controller_data is used to send data to a remote host in
response to a command. It also sends a response capsule indicating
success, or an error if an internal error occurs. nvmf_send_response
is used to send a response without associated data. There are also
several convenience wrappers such as nvmf_send_success and
nvmf_send_generic_error.
John Baldwin [Thu, 2 May 2024 23:27:53 +0000 (16:27 -0700)]
nvmft: Add NVMeoF controller routines shared between kernel and userland
This includes functions to validate NVMe Qualified Names, compute an
initial value of the CAP property, validate changes to the CC
property, and populate the Identify Controller data structure for an
I/O controller.
John Baldwin [Thu, 2 May 2024 23:26:16 +0000 (16:26 -0700)]
nvmf_proto.h: NVMe over Fabrics protocol definitions
This is a copy of spdk/include/spdk/nvmf_spec.h as of commit 470e851852bb948334a272c9f8de495020fa082f from Intel's SPDK.
Subsequent commits will modify it to be suitable header for the
kernel, but importing the stock file first makes it easier to see
how the resulting header is derived from the original.
Rob N [Thu, 2 May 2024 22:18:35 +0000 (08:18 +1000)]
vdev_disk: disable flushes if device does not support it
If the underlying device doesn't have a write-back cache, the kernel
will just return a successful response. This doesn't hurt anything, but
it's extra work on the IO taskqs that are unnecessary. So, detect this
when we open the device for the first time.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16148
Warner Losh [Thu, 2 May 2024 21:58:55 +0000 (15:58 -0600)]
cam/iosched: Document latency buckets correctly.
Document how latency buckets are actually computed: They are a doubling
from 20us to 10.485s by default, but based at
kern.cam.iosched.bucket_base_us and increase with a ratio of
kern.cam.iosched.bucket_ration / 100 from one to the next.