hrs [Thu, 15 Mar 2018 13:46:28 +0000 (13:46 +0000)]
Make getnameinfo(3) salen requirement less strict and
document details of salen in getnameinfo(3) manual page.
getnameinfo(3) returned EAI_FAIL when salen was not equal to
the length corresponding to the value specified by sa->sa_family.
However, POSIX or RFC 3493 does not require it and RFC 4038
Sec.6.2.3 shows an example passing sizeof(struct sockaddr_storage)
to salen.
This change makes the requirement less strict by accepting
salen up to sizeof(struct sockaddr_storage). It also includes
two more changes: one is to fix return values because both SUSv4
and RFC 3493 require EAI_FAMILY when the address length is invalid,
another is to fix sa_len dependency in PF_LOCAL.
avg [Thu, 15 Mar 2018 12:47:34 +0000 (12:47 +0000)]
zfs test suite: support device paths with intermediate directories
The code assumed that disks (devices) used for testing are always named
like /dev/foo, but there is no reason for that restriction and we can
easily support paths like /dev/stripe/bar.
avg [Thu, 15 Mar 2018 12:40:43 +0000 (12:40 +0000)]
zfs test suite: align zfs_destroy_005_neg: with upstream
The change is to account for a different order in which the recursive
destroy may be attempted. If we first try a dataset that can be destroyed
then it will be destroyed, but if we first try a dataset that cannot be
destroyed then we will not attempt to destroy the other dataset.
avg [Thu, 15 Mar 2018 09:16:10 +0000 (09:16 +0000)]
g_access: deal with races created by geoms that drop the topology lock
The problem is that g_access() must be called with the GEOM topology
lock held. And that gives a false impression that the lock is indeed
held across the call. But this isn't always true because many classes,
ZVOL being one of the many, need to drop the lock. It's either to
perform an I/O on the first open or to acquire a different lock (like in
g_mirror_access).
That, of course, can break many assumptions. For example,
g_slice_access() adds an extra exclusive count on the first open. As
described above, an underlying geom may drop the topology lock and that
would open a race with another thread that would also request another
extra exclusive count. In general, two consumers may be granted
incompatible accesses.
To avoid this problem the code is changed to mark a geom with special
flag before calling its access method and clear the flag afterwards. If
another thread sees that flag, then it means that the topology lock has
been dropped (either by the geom in question or downstream from it), so
it is not safe to make another access call. So, the second thread would
use g_topology_sleep() to wait until the flag is cleared and only then
would it proceed with the access.
Also see http://docs.freebsd.org/cgi/mid.cgi?809d9254-ee56-59d8-69a4-08838e985cea
https://www.illumos.org/issues/9164
This issue has been reported by Alan Somers as
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225877
dmu_objset_refresh_ownership() first disowns a dataset (and releases
it) and then owns it again. There is an assert that the new dataset
object is the same as the old dataset object. When running ZFS Test
Suite on FreeBSD we see this panic from zpool_upgrade_007_pos test:
panic: solaris assert: newds == os->os_dsl_dataset (0xfffff80045f4c000
== 0xfffff80021ab4800)
I see that the old dataset has dsl_dataset_evict_async() pending in
ds_dbu.dbu_tqent and its ds_dbuf is NULL.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andriy Gapon <avg@FreeBSD.org>
imp [Wed, 14 Mar 2018 23:01:18 +0000 (23:01 +0000)]
When tearing down a queue pair, also delete the queue entries.
The NVME standard has required in section 7.2.6, since at least 1.1,
that a clean shutdown is signalled by deleting the subission and the
completion queues before setting the shutdown bit in CC. The 1.0
standard, apparently, did not and many of the early Intel cards didn't
care. Some newer cards care, at least one whose beta firmware can
scramble the card on an unclean shutdown. Linux has done this for some
time. To make it possible to move forward with an evaluation of this
pre-release card with wonky firmware, delete the queues on the card
when we delete the qpair structures.
cem [Wed, 14 Mar 2018 22:11:45 +0000 (22:11 +0000)]
vfs_bio.c: Apply cleanups motivated by Coverity analysis
It is believed that the conditions Coverity indicated were actually
impossible to hit. So this patch just adds a cleanup to only compute
v_mount once in brelse(), and in vfs_bio_getpages() always initializes error
to zero to appease the static analyzer.
smh [Wed, 14 Mar 2018 21:32:23 +0000 (21:32 +0000)]
Fix mps deadlock when handling panic
During shutdown mps waits for its SSU requests to complete however when
performing a reboot after handling a panic the scheduler is stopped so
getmicrotime which is used can be non-functional.
Switch to using the same method as shutdown_panic to ensure we actually
complete.
In addition reduce the timeout when RB_NOSYNC is set in howto as we expect
this to fail.
jhb [Wed, 14 Mar 2018 20:49:51 +0000 (20:49 +0000)]
Fix the check for an empty send socket buffer on a TOE TLS socket.
Compare sbavail() with the cached sb_off of already-sent data instead of
always comparing with zero. This will correctly close the connection and
send the FIN if the socket buffer contains some previously-sent data but
no unsent data.
jhb [Wed, 14 Mar 2018 20:46:25 +0000 (20:46 +0000)]
Remove TLS-related inlines from t4_tom.h to fix iw_cxgbe(4) build.
- Remove the one use of is_tls_offload() and the function. AIO special
handling only needs to be disabled when a TOE socket is actively doing
TLS offload on transmit. The TOE socket's mode (which affects receive
operation) doesn't matter, so remove the check for the socket's mode and
only check if a TOE socket has TLS transmit keys configured to determine
if an AIO write request should fall back to the normal socket handling
instead of the TOE fast path.
- Move can_tls_offload() into t4_tls.c. It is not used in critical paths,
so inlining isn't that important. Change return type to bool while here.
dteske [Wed, 14 Mar 2018 19:23:17 +0000 (19:23 +0000)]
Fix bad error messages from dpv(3)
Before = dpv: <__func__>: posix_spawnp(3): No such file or directory
After = dpv: <path/cmd>: No such file or directory
Most notably, show the 2nd argument being passed to posix_spawnp(3)
so we know what path/cmd failed.
Also, we don't need to have "posix_spawnp(3)" in the error message
nor the function because that can [a] change and [b] traversed using
a debugger if necessary.
nwhitehorn [Wed, 14 Mar 2018 18:07:40 +0000 (18:07 +0000)]
Fix fat-fingering ("optional standard") and move all the OF code to
being marked "standard", which is less confusing than having it conditional
on AIM CPUs here, and then picked up through options FDT from conf/files
on Book-E.
imp [Wed, 14 Mar 2018 17:53:37 +0000 (17:53 +0000)]
Create a sysctl kern.cam.{,a,n}da.X.invalidate
kern.cam.{,a,n}da.X.invalidate=1 forces *daX to detach by calling
cam_periph_invalidate on the underlying periph. This is for testing
purposes only. Include only with options CAM_TEST_FAILURE and rename
the former [AN]DA_TEST_FAILURE, and fix nda to compile with it set.
We're using it at work to harden geom and the buffer cache to be
resilient in the face of drive failure. Today, it far too often
results in a panic. While much work was done on SIM initiated removal
for the USB thumnb drive removal work, little has been done for periph
initiated removal. This simulates what *daerror() does for some errors
nicely: we get the same panics with it that we do with failing drives.
imp [Wed, 14 Mar 2018 16:44:50 +0000 (16:44 +0000)]
Implement trim collapsing in nda
When multiple trims are in the queue, collapse them as much as
possible. At present, this usually results in only a few trims being
collapsed together, but more work on that will make it possible to do
hundreds (up to some configurable max).
imp [Wed, 14 Mar 2018 16:44:16 +0000 (16:44 +0000)]
Allow NULL ccb to cam_iosched_bio_complete
When the ccb is NULL to cam_iosched_bio_complete, just update the
other statistics, but not the time. If many operations are collapsed
together, this is needed to keep stats properly for the grouped bp.
This should fix trim accounting.
nwhitehorn [Wed, 14 Mar 2018 16:16:25 +0000 (16:16 +0000)]
The expression (aim | fdt) is always true on PowerPC. The last PowerPC
platform that can run without a device tree (PS3) still uses the OF_*()
functions to check if one exists and OF_* is used unconditionally in
core parts of the system like powerpc/machdep.c. Reflect this reality
in files.powerpc, for example by changing occurrences of aim | fdt to
standard.
kevans [Wed, 14 Mar 2018 14:45:57 +0000 (14:45 +0000)]
pkgbase: Fix post-install script for kernel packages
kernel.ucl uses a hardcoded boot/kernel for kldxref, which is the incorrect
directory when we're installing extra kernels that aren't the "default"
kernel (placed at boot/kernel).
Fix this by instead using a new %KERNELDIR% that we now replace in
Makefile.inc1 with "kernel" for the default kernel and "kernel.${_kernel}"
for these extra kernels so that, e.g. /boot/kernel.SHIVA, will get properly
kldxref'd upon update and avoid outdated linker.hints.
cem [Wed, 14 Mar 2018 03:00:17 +0000 (03:00 +0000)]
Update to Zstandard 1.3.3
Includes patch to conditionalize use of __builtin_clz(ll) on __has_builtin().
The issue is tracked upstream at https://github.com/facebook/zstd/pull/884 .
Otherwise, these are vanilla Zstandard 1.3.3 files.
Note that the 1.3.4 release should be due out soon.
kevans [Wed, 14 Mar 2018 02:35:49 +0000 (02:35 +0000)]
ubldr: Bump heap size from 512K to 1M
lualoader in itself only uses another ~200K, but there seems to be no reason
not to bump it a little higher to give us some more wiggle room.
With this, I can boot using a menu-enabled lualoader, no problem and
reasonably fast. Some heap usage datapoints from the review:
forthloader, no menus, kernel loaded:
heap base at 0x1203d5b0, top at 0x1208e000, used 330320
lualoader, no menus, kernel loaded:
heap base at 0x42050028, top at 0x420ab000, used 372696
lualoader, menus, kernel loaded:
heap base at 0x42050028, top at 0x420d5000, used 544728
Since then, the no menu case for lualoader should have decreased slightly as
I've made some changes to make sure that it no longer loads any of th emenu
bits with beastie disabled.
While here, split heap size out into a HEAP_SIZE macro.
jhb [Tue, 13 Mar 2018 23:05:51 +0000 (23:05 +0000)]
Support for TLS offload of TOE connections on T6 adapters.
The TOE engine in Chelsio T6 adapters supports offloading of TLS
encryption and TCP segmentation for offloaded connections. Sockets
using TLS are required to use a set of custom socket options to upload
RX and TX keys to the NIC and to enable RX processing. Currently
these socket options are implemented as TCP options in the vendor
specific range. A patched OpenSSL library will be made available in a
port / package for use with the TLS TOE support.
TOE sockets can either offload both transmit and reception of TLS
records or just transmit. TLS offload (both RX and TX) is enabled by
setting the dev.t6nex.<x>.tls sysctl to 1 and requires TOE to be
enabled on the relevant interface. Transmit offload can be used on
any "normal" or TLS TOE socket by using the custom socket option to
program a transmit key. This permits most TOE sockets to
transparently offload TLS when applications use a patched SSL library
(e.g. using LD_LIBRARY_PATH to request use of a patched OpenSSL
library). Receive offload can only be used with TOE sockets using the
TLS mode. The dev.t6nex.0.toe.tls_rx_ports sysctl can be set to a
list of TCP port numbers. Any connection with either a local or
remote port number in that list will be created as a TLS socket rather
than a plain TOE socket. Note that although this sysctl accepts an
arbitrary list of port numbers, the sysctl(8) tool is only able to set
sysctl nodes to a single value. A TLS socket will hang without
receiving data if used by an application that is not using a patched
SSL library. Thus, the tls_rx_ports node should be used with care.
For a server mostly concerned with offloading TLS transmit, this node
is not needed as plain TOE sockets will fall back to software crypto
when using an unpatched SSL library.
New per-interface statistics nodes are added giving counts of TLS
packets and payload bytes (payload bytes do not include TLS headers or
authentication tags/MACs) offloaded via the TOE engine, e.g.:
TLS transmit work requests are constructed by a new variant of
t4_push_frames() called t4_push_tls_records() in tom/t4_tls.c.
TLS transmit work requests require a buffer containing IVs. If the
IVs are too large to fit into the work request, a separate buffer is
allocated when constructing a work request. This buffer is associated
with the transmit descriptor and freed when the descriptor is ACKed by
the adapter.
Received TLS frames use two new CPL messages. The first message is a
CPL_TLS_DATA containing the decryped payload of a single TLS record.
The handler places the mbuf containing the received payload on an
mbufq in the TOE pcb. The second message is a CPL_RX_TLS_CMP message
which includes a copy of the TLS header and indicates if there were
any errors. The handler for this message places the TLS header into
the socket buffer followed by the saved mbuf with the payload data.
Both of these handlers are contained in tom/t4_tls.c.
A few routines were exposed from t4_cpl_io.c for use by t4_tls.c
including send_rx_credits(), a new send_rx_modulate(), and
t4_close_conn().
TLS keys for both transmit and receive are stored in onboard memory
in the NIC in the "TLS keys" memory region.
In some cases a TLS socket can hang with pending data available in the
NIC that is not delivered to the host. As a workaround, TLS sockets
are more aggressive about sending CPL_RX_DATA_ACK messages anytime that
any data is read from a TLS socket. In addition, a fallback timer will
periodically send CPL_RX_DATA_ACK messages to the NIC for connections
that are still in the handshake phase. Once the connection has
finished the handshake and programmed RX keys via the socket option,
the timer is stopped.
A new function select_ulp_mode() is used to determine what sub-mode a
given TOE socket should use (plain TOE, DDP, or TLS). The existing
set_tcpddp_ulp_mode() function has been renamed to set_ulp_mode() and
handles initialization of TLS-specific state when necessary in
addition to DDP-specific state.
Since TLS sockets do not receive individual TCP segments but always
receive full TLS records, they can receive more data than is available
in the current window (e.g. if a 16k TLS record is received but the
socket buffer is itself 16k). To cope with this, just drop the window
to 0 when this happens, but track the overage and "eat" the overage as
it is read from the socket buffer not opening the window (or adding
rx_credits) for the overage bytes.
jhb [Tue, 13 Mar 2018 21:42:38 +0000 (21:42 +0000)]
Simplify error handling in t4_tom.ko module loading.
- Change t4_ddp_mod_load() to return void instead of always returning
success. This avoids having to pretend to have proper support for
unloading when only part of t4_tom_mod_load() has run.
- If t4_register_uld() fails, don't invoke t4_tom_mod_unload() directly.
The module handling code in the kernel invokes MOD_UNLOAD on a module
whose MOD_LOAD fails with an error already.
Reviewed by: np (part of a larger patch)
MFC after: 1 month
Sponsored by: Chelsio Communications
brooks [Tue, 13 Mar 2018 20:39:06 +0000 (20:39 +0000)]
Don't overflow the kernel struct mdio in the MDIOCLIST ioctl.
Always terminate the list with -1 and document the ioctl behavior.
This preserves existing behavior as seen from userspace with the
addition of the unconditional termination which will not be seen by
working consumers of MDIOCLIST.
Because this ioctl can only be performed by root (in default
configurations) and is not used in the base system this bug is not
deemed to warrant either a security advisory or an eratta notice.
brooks [Tue, 13 Mar 2018 19:56:10 +0000 (19:56 +0000)]
Fix ISP_FC_LIP and ISP_RESCAN on big-endian 64-bit systems.
For _IO() ioctls, addr is a pointer to uap->data which is a caddr_t.
When the caddr_t stores an int, dereferencing addr as an (int *) results
in truncation on little-endian 64-bit systems and corruption (owing to
extracting top bits) on big-endian 64-bit systems. In practice the
value of chan was probably always zero on systems of the latter type as
all such FreeBSD platforms use a register-based calling convention.
kib [Tue, 13 Mar 2018 18:27:23 +0000 (18:27 +0000)]
Revert the chunk from r330410 in vm_page_reclaim_run().
There, the pages freed might be managed but the page's lock is not
owned. For KPI correctness, the page lock is requried around the call
to vm_page_free_prep(), which is asserted. Reclaim loop already did
the work which could be done by vm_page_free_prep(), so the lock is
not needed and the only consequence of not owning it is the assert
trigger.
Instead of adding the locking to satisfy the assert, revert to the
code that calls vm_page_free_phys() directly.
Reported by: pho
Discussed with: jeff
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
kevans [Tue, 13 Mar 2018 17:10:52 +0000 (17:10 +0000)]
EFIRT: SetVirtualAddressMap with 1:1 mapping after exiting boot services
This fixes a problem encountered on the Lenovo Thinkpad X220/Yoga 11e where
runtime services would try to inexplicably jump to other parts of memory
where it shouldn't be when attempting to enumerate EFI vars, causing a
panic.
The virtual mapping is enabled by default and can be disabled by setting
efi_disable_vmap in loader.conf(5).
imp [Tue, 13 Mar 2018 16:33:00 +0000 (16:33 +0000)]
Prefer uintXX_t to u_intXX_t
A foolish consistency is the hobgoblin of little minds, adored by
little statesmen and philosophers and divines. With consistency a
great soul has simply nothing to do. -- Ralph Waldo Emerson
nwhitehorn [Tue, 13 Mar 2018 15:03:58 +0000 (15:03 +0000)]
Execute PowerPC64/AIM kernel from direct map region when possible.
When the kernel can be in real mode in early boot, we can execute from
high addresses aliased to the kernel's physical memory. If that high
address has the first two bits set to 1 (0xc...), those addresses will
automatically become part of the direct map. This reduces page table
pressure from the kernel and it sets up the kernel to be used with
radix translation, for which it has to be up here.
This is accomplished by exploiting the fact that all PowerPC kernels are
built as position-independent executables and relocate themselves
on start. Before this patch, the kernel runs at 1:1 VA:PA, but that
VA/PA is random and set by the bootloader. Very early, it processes
its ELF relocations to operate wherever it happens to find itself.
This patch uses that mechanism to re-enter and re-relocate the kernel
a second time witha new base address set up in the early parts of
powerpc_init().
kevans [Tue, 13 Mar 2018 15:01:23 +0000 (15:01 +0000)]
efirtc: Pass a dummy tmcap pointer to efi_get_time_locked
As noted in the comment, UEFI spec claims the capabilities pointer is
optional, but some implementations will choke and attempt to dereference it
without checking. This specific problem was found on a Lenovo Thinkpad X220
that would panic in efirtc_identify.
royger [Tue, 13 Mar 2018 09:42:33 +0000 (09:42 +0000)]
at_rtc: check in ACPI FADT boot flags if the RTC is present
Or else disable the device. Note that the detection can be bypassed by
setting the hw.atrtc.enable option in the loader configuration file.
More information can be found on atrtc(4).
Sponsored by: Citrix Systems R&D
Reviewed by: ian
Differential revision: https://reviews.freebsd.org/D14399
brooks [Mon, 12 Mar 2018 23:02:01 +0000 (23:02 +0000)]
Reject ioctls to SCSI enclosures from 32-bit compat processes.
The ioctl objects contain pointers and require translation and some
refactoring of the infrastructure to work. For now prevent opertion
on garbage values. This is very slightly overbroad in that ENCIOC_INIT
is safe.
brooks [Mon, 12 Mar 2018 22:58:07 +0000 (22:58 +0000)]
Reject CAMIOGET and CAMIOQUEUE ioctl's on pass(4) in 32-bit compat mode.
These take a union ccb argument which is full of kernel pointers.
Substantial translation efforts would be required to make this work.
By rejecting the request we avoid processing or returning entierly
wrong data.
imp [Mon, 12 Mar 2018 21:39:49 +0000 (21:39 +0000)]
Use the actual struct devdesc at the start of all *_devdesc structs
The current system is fragile and requires very careful layout of all
*_devdesc structures. It also makes it hard to change the base
devdesc. Take a page from CAM and put the 'header' in all the derived
classes and adjust the code to match.
For OFW, move the iHandle h_handle out of a slot conflicting with
d_opendata. Due to quirks in the alignment rules, this worked.
However changing the code to use d_opendata storage now that it's a
pointer is hard, so just have a separate field for it.
All other cleanups were to make the *_devdesc structures match where
they'd taken some liberties that were none-the-less compatible enough
to work.
imp [Mon, 12 Mar 2018 21:39:38 +0000 (21:39 +0000)]
We can't use d_opendata for blkio storage.
open_disk uses d_opendata for it's own purpse. We can't store blkio
there. Fortunately, blkio is stored elsewhere and we never actually
retrieve blkio from d_opendata. Eliminate it as a source of confusion.
Eliminate all stores of d_opendata in efi since this layer doesn't own
that field.
imp [Mon, 12 Mar 2018 21:39:27 +0000 (21:39 +0000)]
Minor cosmetic changes.
Make sure { on the same line as struct for all struct *devdesc. Move
some type definitions to next to the dv_type define, since that's what
sets the d_type.
tsoome [Mon, 12 Mar 2018 17:05:53 +0000 (17:05 +0000)]
e1000g: this statement may fall through
The gcc 7 does check for switch statement fall through cases, and if legit,
such complaint can besilenced by /* FALLTHROUGH */ comment. Unfortunately
such comment is quite limited, but will still notify the reader.
This patch is backport from illumos, see
https://www.illumos.org/rb/r/941/
imp [Mon, 12 Mar 2018 15:17:16 +0000 (15:17 +0000)]
Tighten up periph lock to avoid some races
Make sure the periph lock is held around rmw access to softc data,
espeically flags, including work flags in iosched.
Add asserts for the periph lock where it should be held.