Eric Joyner [Tue, 20 Aug 2019 20:15:32 +0000 (20:15 +0000)]
MFC various iflib fixes from head
Included revisions:
r347418 - iflib: use default ntxd and rxd when user value is not power of 2
r348372 - iflib: provide probe wrapper for vendor drivers
r350306 - iflib: fix dangling device softc pointer
r350507 - iflib: remove kobject class reference increment
r350509 - iflib: Prevent kernel panic caused by loading driver with a specific interrupt configuration
r351152 - iflib: add iflib_deregister to help cleanup on exit
John Baldwin [Tue, 20 Aug 2019 01:30:35 +0000 (01:30 +0000)]
MFC 348876: Add warnings to /dev/crypto for deprecated algorithms.
These algorithms are deprecated algorithms that will have no in-kernel
consumers in FreeBSD 13. Specifically, deprecate the following
algorithms:
- ARC4
- Blowfish
- CAST128
- DES
- 3DES
- MD5-HMAC
- Skipjack
John Baldwin [Tue, 20 Aug 2019 00:50:17 +0000 (00:50 +0000)]
MFC 348875:
Add warnings for Kerberos GSS algorithms deprecated in RFCs 6649 and 8429.
All of these algorithms are explicitly marked SHOULD NOT in one of these
RFCs.
Specifically, RFC 6649 deprecates all algorithms using DES as well as
the "export-friendly" variant of RC4. RFC 8429 deprecates Triple DES
and the remaining RC4 algorithms.
John Baldwin [Mon, 19 Aug 2019 22:31:04 +0000 (22:31 +0000)]
MFC 349467: Hold an explicit reference on the socket for the aiotx task.
Previously, the aiotx task relied on the aio jobs in the queue to hold
a reference on the socket. However, when the last job is completed,
there is nothing left to hold a reference to the socket buffer lock
used to check if the queue is empty. In addition, if the last job on
the queue is cancelled, the task can run with no queued jobs holding a
reference to the socket buffer lock the task uses to notice the queue
is empty.
Fix these races by holding an explicit reference on the socket when
the task is queued and dropping that reference when the task
completes.
John Baldwin [Mon, 19 Aug 2019 21:59:02 +0000 (21:59 +0000)]
MFC 348874: Remove an overly-aggressive assertion.
While it is true that the new vmspace passed to vmspace_switch_aio
will always have a valid reference due to the AIO job or the extra
reference on the original vmspace in the worker thread, it is not true
that the old vmspace being switched away from will have more than one
reference.
Specifically, when a process with queued AIO jobs exits, the exit hook
in aio_proc_rundown will only ensure that all of the AIO jobs have
completed or been cancelled. However, the last AIO job might have
completed and woken up the exiting process before the worker thread
servicing that job has switched back to its original vmspace. In that
case, the process might finish exiting dropping its reference to the
vmspace before the worker thread resulting in the worker thread
dropping the last reference.
Emmanuel Vadot [Fri, 16 Aug 2019 21:40:39 +0000 (21:40 +0000)]
MFC r343952, r344003, r344219, r344343, r344456
r343952 by ganbold:
Enable necessary bits when activating interrupts. This allows
reading some events from the interrupt status registers. These events
are reported to devd via system "PMU" and subsystem "Battery", "AC"
and "USB" such as plugged/unplugged, absent, charged and charging.
Reviewed by: manu
Differential Revision: https://reviews.freebsd.org/D19116
r344003 by ganbold:
Add sensors support for AXP803/AXP813. Sensor values such as
battery charging, charge state, voltage, charging current, discharging current,
battery capacity etc. can be obtained via sysctl.
Reviewed by: manu
Differential Revision: https://reviews.freebsd.org/D19145
r344219 by ganbold:
Add sysctl for setting battery charging current.
The charging current can be set using steps
from 0: 200mA to 13: 2800mA (200mA/step).
While there, fix battery charging current related
sensor descriptions.
Reviewed by: manu
Differential Revision: https://reviews.freebsd.org/D19212
r344343 by ganbold:
Clarify notifications when battery capacity ratio
reaches warning and shutdown thresholds.
r344456 by ganbold:
Add base to the warning threshold.
Kyle Evans [Fri, 16 Aug 2019 21:14:27 +0000 (21:14 +0000)]
MFC r351078, r351085, r351088: mostly a nop (two commits + revert of two)
This commit is mostly a nop, but ends up renumbering #4 clause to #3 in one
copy of quad.h... this is OK; stand/ situation in stable/11 is pretty murky
and the commit that renumbered the clause got lost somewhere. quad.h will be
disappearing in a not-so-distant future MFC.
r351078:
stand: kick out quad.h
Use quad.h from libc instead for the time being. This reduces the number of
nearly-identical-quad.h we have in tree to two with only minor changes.
Prototypes for some *sh*di3 have been added to match the copy in libkern.
The differences between the two are likely few enough that they can perhaps
be merged with little additional effort to bring us down to 1.
r351085:
libc quad.h: one last _STANDALONE correction
Kyle Evans [Fri, 16 Aug 2019 21:01:35 +0000 (21:01 +0000)]
MFC r350464: kern_shm_open: push O_CLOEXEC into caller control
The motivation for this change is to allow wrappers around shm to be written
that don't set CLOEXEC. kern_shm_open currently accepts O_CLOEXEC but sets
it unconditionally. kern_shm_open is used by the shm_open(2) syscall, which
is mandated by POSIX to set CLOEXEC, and CloudABI's sys_fd_create1().
Presumably O_CLOEXEC is intended in the latter caller, but it's unclear from
the context.
sys_shm_open() now unconditionally sets O_CLOEXEC to meet POSIX
requirements, and a comment has been dropped in to kern_fd_open() to explain
the situation and add a pointer to where O_CLOEXEC setting is maintained for
shm_open(2) correctness. CloudABI's sys_fd_create1() also unconditionally
sets O_CLOEXEC to match previous behavior.
This also has the side-effect of making flags correctly reflect the
O_CLOEXEC status on this fd for the rest of kern_shm_open(), but a
glance-over leads me to believe that it didn't really matter.
Emmanuel Vadot [Fri, 16 Aug 2019 20:56:35 +0000 (20:56 +0000)]
MFC r349596 by ganbold:
Extend simple_mfd driver to expose a syscon interface if
that node is also compatible with syscon. For instance,
Rockchip RK3399's GRF (General Register Files) is compatible
with simple-mfd as well as syscon and has devices like
usb2-phy, emmc-phy and pcie-phy etc. under it.
Pedro F. Giffuni [Fri, 16 Aug 2019 04:51:04 +0000 (04:51 +0000)]
MFC r350969:
Add deprecation notice to snd_ds1(4).
As suggested in:
https://wiki.freebsd.org/WhatsGoing/FreeBSD13
We will be dropping the snd_ds1 driver. The driver is known to be buggy
and no one has been working on it for years now.
Users of old Yamaha cards may have luck with the OSS drivers instead.
Kyle Evans [Thu, 15 Aug 2019 17:40:48 +0000 (17:40 +0000)]
MFC r350576: ipfw: fix jail option after r348215
r348215 changed jail_getid(3) to validate passed-in jids as active jails
(as the function is documented to return -1 if the jail does not exist).
This broke the jail option (in some cases?) as the jail historically hasn't
needed to exist at the time of rule parsing; jids will get stored and later
applied.
Fix this caller to attempt to parse *av as a number first and just use it
as-is to match historical behavior. jail_getid(3) must still be used in
order for name arguments to work, but it's strictly a fallback in case we
weren't given a number.
John Baldwin [Wed, 14 Aug 2019 23:31:53 +0000 (23:31 +0000)]
MFC 348695: Support MSI-X for passthrough devices with a separate PBA BAR.
pci_alloc_msix() requires both the table and PBA BARs to be allocated
by the driver. ppt was only allocating the table BAR so would fail
for devices with the PBA in a separate BAR. Fix this by allocating
the PBA BAR before pci_alloc_msix() if it is stored in a separate BAR.
While here, release BARs after calling pci_release_msi() instead of
before. Also, don't call bus_teardown_intr() in error handling code
if bus_setup_intr() has just failed.
John Baldwin [Wed, 14 Aug 2019 23:28:43 +0000 (23:28 +0000)]
MFC 348694: Don't simulate PBA access if the PBA is in a separate BAR.
bhyve has to virtualize the MSI-X table to trap reads and writes to
that table and map those to virtual interrupts that it maps real host
interrupts on to. For the pending-bit-array (PBA), bhyve passes
accesses from the guest directly to the hardware.
bhyve's virtualization of the MSI-X table is done by intercepting all
reads and writes to the BAR holding the MSI-X table. However, if the
PBA is stored in the same BAR as the MSI-X table, accesses to the PBA
portion of this BAR have to be forwarded to the real BAR.
However, in the case that the PBA was stored in a separate BAR and
it's offset in that separate BAR overlapped with the portion of the
MSI-X table BAR that the table used, the handlers for the table BAR
would incorrectly think that some accesses were PBA reads and writes.
This caused a crash in bhyve when it indirected a NULL pointer. Fix
this case by never trying to handle PBA access if the PBA lives in a
separate BAR.
John Baldwin [Wed, 14 Aug 2019 23:25:58 +0000 (23:25 +0000)]
MFC 347465: Apply r280991 to ip6_fragment.
This uses m_dup_pkthdr() to copy all of the metadata about a packet to
each of its fragments including VLAN tags, mbuf tags, etc. instead of
hand-copying a few fields.
John Baldwin [Wed, 14 Aug 2019 23:05:57 +0000 (23:05 +0000)]
MFC 346360: Push down INP_WLOCK slightly in tcp_ctloutput.
The inp lock is not needed for testing the V6 flag as that flag is set
once when the inp is created and never changes. For non-TCP socket
options the lock is immediately dropped after checking that flag.
This just pushes the lock down to only be acquired for TCP socket
options.
This isn't a hot-path, more a cosmetic cleanup I noticed while reading
the code.
Dimitry Andric [Wed, 14 Aug 2019 19:21:26 +0000 (19:21 +0000)]
MFC r350697:
Fix a possible segfault in wcsxfrm(3) and wcsxfrm_l(3).
If the length of the source wide character string, passed in via the
"size_t n" parameter, is set to zero, the function should only return
the required length for the destination wide character string. In this
case, it should *not* attempt to write to the destination, so the "dst"
parameter is permitted to be NULL.
However, when the internally called _collate_wxfrm() function returns an
error, such as when using the "C" locale, as a fallback wcscpy(3) or
wcsncpy(3) are used. But if the input length is zero, wcsncpy(3) will
be called with a length of -1! If the "dst" parameter is NULL, this
will immediately result in a segfault, or if "dst" is a valid pointer,
it will most likely result in unexpectedly overwritten memory.
Fix this by explicitly checking for an input length greater than zero,
before calling wcsncpy(3).
Note that a similar situation does not occur in strxfrm(3), the plain
character version of this function, as it uses strlcpy(3) for the error
case. The strlcpy(3) function does not write to the destination if the
input length is zero.
Andrew Turner [Wed, 14 Aug 2019 17:02:36 +0000 (17:02 +0000)]
MFC r350112, r350241, and r350242:
r350166:
arm64: Implement HWCAP
Add HWCAP support for arm64.
defines are the same as in Linux and a userland program can use
elf_aux_info to get the data.
We only save the common denominator for all cores in case the
big and little cluster have different support (this is known to
exists even if we don't support those SoCs in FreeBSD)
r350241:
Ensure the arm64 ID register fields are 64 bit types.
Previously only some of the ID register fields were 64 bit. To allow
for a script to generate these mark them all 64 bit. To allow for their
use in assembly we need to use the UINT64_C macro via a new UL macro
to stop the lines from being too long.
Andrew Turner [Wed, 14 Aug 2019 16:54:51 +0000 (16:54 +0000)]
MFC r345510:
Sort printing of the ID registers on arm64 to be identical to the
documentation. This will simplify checking new fields when they are added.
Alan Somers [Mon, 12 Aug 2019 20:31:12 +0000 (20:31 +0000)]
MFC r349248, r349391, r350088
r349248:
fcntl: fix overflow when setting F_READAHEAD
VOP_READ and VOP_WRITE take the seqcount in blocks in a 16-bit field.
However, fcntl allows you to set the seqcount in bytes to any nonnegative
31-bit value. The result can be a 16-bit overflow, which will be
sign-extended in functions like ffs_read. Fix this by sanitizing the
argument in kern_fcntl. As a matter of policy, limit to IO_SEQMAX rather
than INT16_MAX.
Also, fifos have overloaded the f_seqcount field for a completely different
purpose ever since r238936. Formalize that by using a union type.
Reviewed by: cem
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20710
r349391:
fcntl: style changes to r349248
Reported by: bde
MFC-With: 349248
Sponsored by: The FreeBSD Foundation
r350088:
F_READAHEAD: Fix r349248's overflow protection, broken by r349391
I accidentally broke the main point of r349248 when making stylistic changes
in r349391. Restore the original behavior, and also fix an additional
overflow that was possible when uio->uio_resid was nearly SSIZE_MAX.
Reported by: cem
Reviewed by: bde
MFC-With: 349248
Sponsored by: The FreeBSD Foundation
Alan Somers [Mon, 12 Aug 2019 20:21:36 +0000 (20:21 +0000)]
MFC r349231, r349233, r349280, r349478
r349231:
Add FIOBMAP2 ioctl
This ioctl exposes VOP_BMAP information to userland. It can be used by
programs like fragmentation analyzers and optimized cp implementations. But
I'm using it to test fusefs's VOP_BMAP implementation. The "2" in the name
distinguishes it from the similar but incompatible FIBMAP ioctls in NetBSD
and Linux. FIOBMAP2 differs from FIBMAP in that it uses a 64-bit block
number instead of 32-bit, and it also returns runp and runb.
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20705
r349233:
#include <sys/types.h> from sys/filio.h
This fixes world build after r349231
Reported by: Jenkins
MFC-With: 349231
Sponsored by: The FreeBSD Foundation
r349280:
Reduce namespace pollution from r349233
Define __daddr_t in _types.h and use it in filio.h
Reported by: ian, bde
Reviewed by: ian, imp, cem
MFC-With: 349233
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20715
r349478:
FIOBMAP2: inline vn_ioc_bmap2
Reported by: kib
Reviewed by: kib
MFC-With: 349238
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20783
Alexander Motin [Mon, 12 Aug 2019 19:44:57 +0000 (19:44 +0000)]
MFC r350652 (by imp): Fix mismerge.
I merged passthru.c from the wrong branch (it was a branch that went further in
a direction I wound up not taking). Fix the mismerge and turn passthru on.
Alexander Motin [Mon, 12 Aug 2019 19:43:25 +0000 (19:43 +0000)]
MFC r350563: Add `nvmecontrol sanitize` command.
It allows to delete all user data from NVM subsystem in one of 3 methods.
It is a close equivalent of SCSI SANITIZE command of `camcontrol sanitize`,
so I tried to keep arguments as close as possible.
While there, fix supported sanitize methods reporting in `identify`.
Alexander Motin [Mon, 12 Aug 2019 19:39:31 +0000 (19:39 +0000)]
MFC r350523, r350524: Add IOCTL to translate nvdX into nvmeY and NSID.
While very useful by itself, it also makes `nvmecontrol` not depend on
hardcoded device names parsing, that in its turn makes simple to take
nvdX (and potentially any other) device names as arguments.
Also added IOCTL bypass from nvdX to respective nvmeYnsZ makes them
interchangeable for management purposes.
This adds several previously missed but important subcommands to list
namespaces and controllers. It also fixes few previously added but
just found with real testing to be broken subcommands.
Also while there, add possibility to explicitly specify nsid for
`nvmecontrol identify` subcommand. It may be useful to specify nsids
not having own devices, for example 0xffffffff, or just newly created
ones.
Alexander Motin [Mon, 12 Aug 2019 18:57:46 +0000 (18:57 +0000)]
MFC r350333 (by imp): Widen the type for to.
The timeout field in the CAPS register is defined to be 8 bits, so its type was
uint8_t. We recently started adding 1 to it to cope with rogue devices that
listed 0 timeout time (which is impossible). However, in so doing, other devices
that list 0xff (for a 2 minute timeout) were broken when adding 1
overflowed. Widen the type to be uint32_t like its source register to avoid the
issue.
Alexander Motin [Mon, 12 Aug 2019 18:56:46 +0000 (18:56 +0000)]
MFC r350311 (by imp):
Fix the fix to the logic bug. Upon further testing, the bug is that we shadoow
opt.vendor with vendor. We shouldn't. Delete the latter and use the former
everywhere and restore the prior logic which is now correct.
Alexander Motin [Mon, 12 Aug 2019 18:55:36 +0000 (18:55 +0000)]
MFC r350147 (by imp): Keep track of the number of commands that exhaust their retry limit.
While we print failure messages on the console, sometimes logs are lost or
overwhelmed. Keeping a count of how many times we've failed retriable commands
helps get a magnitude of the problem.
Alexander Motin [Mon, 12 Aug 2019 18:54:58 +0000 (18:54 +0000)]
MFC r350146 (by imp): Keep track of the number of retried commands.
Retried commands can indicate a performance degredation of an nvme drive. Keep
track of the number of retries and report it out via sysctl, just like number of
commands an interrupts.
Alexander Motin [Mon, 12 Aug 2019 18:53:53 +0000 (18:53 +0000)]
MFC r350118 (by imp): Provide new tunable hw.nvme.verbose_cmd_dump
The nvme drive dumps only the most relevant details about a command when it
fails. However, there are times this is not sufficient (such as debugging weird
issues for a new drive with a vendor). Setting hw.nvme.verbose_cmd_dump=1
in loader.conf will enable more complete debugging information about each
command that fails.
Alexander Motin [Mon, 12 Aug 2019 18:53:01 +0000 (18:53 +0000)]
MFC r350114 (by imp):
Provide macros to extract the sub-fields of the CAP_LO and CAP_HI registers.
These macros make places where we extract these easier to read. The shift and
mask stuff is also a bit tedious and error prone. Start with the CAP_LO and
CAP_HI registers since their scope is somewhat constrained. This is style
chagne only, no functional changes.
Alexander Motin [Mon, 12 Aug 2019 18:51:35 +0000 (18:51 +0000)]
MFC r350068 (by imp): Assume that the timeout value from the capacity is 1-based
Neither the 1.3 or 1.4 standards say this number is 1's based, but adding 1
costs little and copes with those NVMe drives that report '0' in this field
cheaply. This is consistent with what the Linux driver does as well.
These are mostly compatible with Linux, with three exceptions.
1. We don't do metadata segment stuff. Our passthrough interface
doesn't cope. The code is there, but generates an error.
2. Linux lets you specify a namespace ID for the command. We current
do not: we get ours from the namespace device, or pass in a generic
one. Generally, this will lead to the same command, but FreeBSD's
is safer since you can't specify the wrong id.
3. --show-command outputs to stderr instead of stdout so you can both
see your command, and capture its output with a simple redirect.
Create a set of routines and structures to hold the data for the args
for a command. Use them to generate help and to parse args. Convert
all the current commands over to the new format. "comnd" is a hat-tip
to the TOPS-20 %COMND JSYS that (very) loosely inspired much of the
subsequent command line notions in the industry, but this is far
simpler (the %COMND man page is longer than this code) and not in the
kernel... Also, it implements today's de-facto
command [verb]+ [opts]* [args]*
format rather than the old, archaic TOPS-20 command format :)
This is a snapshot of a work in progress to get the nvme passthru
stuff committed. In time it will become a private library and used
by some other programs in the tree that conform to the above pattern.
Alexander Motin [Mon, 12 Aug 2019 18:49:32 +0000 (18:49 +0000)]
MFC r348495 (by imp):
Since a fatal trap can happen at aribtrary times, don't panic when the
completions are not in a consistent state. Cope with the different
places the normal I/O completion polling thread can be interrupted and
then re-entered during a kernel panic + dump.
Alexander Motin [Mon, 12 Aug 2019 18:47:40 +0000 (18:47 +0000)]
MFC r344955 (by imp):
Don't print all the I/O we abort on a reset, unless we're out of
retries.
When resetting the controller, we abort I/O. Prior to this fix, we
printed a ton of abort messages for I/O that we're going to
retry. This imparts no useful information. Stop printing them unless
our retry count is exhausted. Clarify code for when we don't retry,
and remove useless arg to a routine that's always called with it
as 'true'. All the other debug is still printed (including multiple
reset messages if we have multiple timeouts before the taskqueue
runs the actual reset) so that we know when we reset.
Move from using a linker set to a constructor function that's
called. This simplifies the code and is slightly more obvious. We now
keep a list of page decoders rather than having an array we managed
before. Commands will move to something similar in the future.
Alexander Motin [Mon, 12 Aug 2019 17:54:28 +0000 (17:54 +0000)]
MFC r342358 (by imp): Try the first 256 units with nvmecontrol devlist.
The nvmecontrol code that did the devlist assumed that we had a
tightly-packed allocation of units. Since pci writing exists, this
isn't the case. Loop over the first 256 units, which is a reasonable
number of possible units.
o Dynamically load all the .so files found in /libexec/nvmecontrol and
/usr/local/libexec/nvmecontrol.
o Link nvmecontrol -rdynamic so that its symbols are visible to the
libraries we load.
o Create concatinated linker sets that we dynamically expand.
o Add the linked-in top and logpage linker sets to the mirrors for them
and add those sets to the mirrors when we load a new .so.
o Add some macros to help hide the names of the linker sets.
o Update the man page.
Alexander Motin [Mon, 12 Aug 2019 17:41:57 +0000 (17:41 +0000)]
MFC r341415 (by imp): Delete the undocumented alias 'wds'.
This was a typo for wdc. Eliminate it since it was in error. People
should use either 'wdc' or 'hgst' for the vendor from now on. 'hgst'
works for all versions this functionality is present for.