MFC r344936: MFV/ZoL: Disable LBA weighting on files and SSDs
The LBA weighting makes sense on rotational media where the outer tracks
have twice the bandwidth of the inner tracks. However, it is detrimental
on nonrotational media such as solid state disks, where the only effect
is to ensure that metaslabs enter the best-fit allocation behavior
sooner, which is detrimental to performance. It also makes no sense on
files where the underlying filesystem can arrange things however it
wants.
Author: Richard Yao <ryao@gentoo.org> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3712
zfsonlinux/zfs@fb40095f5f0853946f8150481ca22602d1334dfe
To reduce code divergence this merge replaces equivalent but different
FreeBSD code detecting non-rotating medium vdevs.
MFC r344934, r345014: Add separate aggregation limit for non-rotating media.
Before sequential scrub patches ZFS never aggregated I/Os above 128KB.
Sequential scrub bumped that to 1MB, which motivation I understand for
spinning disks, since it should reduce number of head seeks. But for
SSDs it makes much less sense to me, especially on FreeBSD, where due
to MAXPHYS limitation device will likely still see bunch of 128KB I/Os
instead of one large. Having more strict aggregation limit allows to
avoid allocation of large memory buffer and memcpy to/from it, that is
a serious problem when bandwidth reaches few GB/s.
Update the bounds checking for zfs_vdev_aggregation_limit so that
it has a floor of zero and a maximum value of the supported block
size for the pool.
Additionally add an early return when zfs_vdev_aggregation_limit
equals zero to disable aggregation. For very fast solid state or
memory devices it may be more expensive to perform the aggregation
than to issue the IO immediately.
Commit 8542ef8 allowed optional IOs to be aggregated beyond
the specified aggregation limit. Since the aggregation limit
was also used to enforce the maximum block size, setting
`zfs_vdev_aggregation_limit=16777216` could result in an
attempt to allocate an ABD larger than 16M.
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6259
Closes #6270
zfsonlinux/zfs@2d678f779aba26a93314c8ee1142c3985fa25cb6
mm [Wed, 10 Apr 2019 21:46:06 +0000 (21:46 +0000)]
MFC r345497:
Sync libarchive with vendor.
Relevant vendor changes:
PR #1153: fixed 2 bugs in ZIP reader [1]
PR #1143: ensure archive_read_disk_entry_from_file() uses ARCHIVE_READ_DISK
Changes to file flags code, support more file flags on FreeBSD:
UF_OFFLINE, UF_READONLY, UF_SPARSE, UF_REPARSE, UF_SYSTEM
UF_ARCHIVE is not supported by intention (yet)
MFC r344161: stand: dev_net: correct net_open's interpretation of params
net_open previously casted the first vararg to a char * and this was
half-OK: at first, it is passed to netif_open, which would cast it back to
the struct devdesc * that it really is and use it properly. It is then
strdup()d and used as the netdev_name, which is objectively wrong.
Correct it so that the first vararg is properly casted to a struct devdesc *
and the netdev_name gets set properly to make it more clear at a glance that
it's not doing something horribly wrong.
freebsd32: fix padding of computed control message length for recvmsg()
Each control message region must be aligned on a 4-byte boundary on 32-bit
architectures. The 32-bit compat shim for recvmsg() gets the actual layout
right, but doesn't pad the payload length when computing msg_controllen for
the output message header. If a control message contains an unaligned
payload, such as the 1-byte TTL field in the example attached to PR 236737,
this can produce control message payload boundaries that extend beyond
the boundary reported by msg_controllen.
Backport fixes from FreeBSD-12 to help the random(4) device thread
not overwhelm the OS:
a) Use the correct symbolic constant when calculating 10'ths of a
second. This means that expensive reseeds happen at ony 1/10 Hz,
not some kHz.
b) Rate limit internal high-rate harveting efforts. This stops the
harvesting thread from total overkilling the high-grade entropy-
gathering work, while still being very conservatively safe.
PR: 230808
Reported by: danilo,eugen
Tested by: eugen
Approved by: so (blanket permission granted as I am the authour of this code)
Relnotes: Yes
MFC r344243, r345517-r345518: lualoader: More intelligent screen clearing
r344243:
lualoader: only clear the screen before first password prompt
This was previously an unconditional screen clear, regardless of whether or
not we would be prompting for any passwords. This is pointless, given that
the screen clear is only there to put our screen into a consistent state
before we draw the prompts and do cursor manipulation.
This is also the only screen clear besides that to draw the menu. One can
now see early pre-loader and loader output with the menu disabled, which may
be useful for diagnostics.
r345517:
lualoader: Clear the screen before prompting for password
Assuming that the autoboot sequence was interrupted, we've done enough
cursor manipulation that the prompt for the password will be sufficiently
obscured a couple of lines up. Clear the screen and reset the cursor
position here, too.
r345518:
lualoader: Fix up some luacheck concerns
- Garbage collect an unused (removed because it was useless) constant
- Don't bother with vararg notation if args will not be used
Highlights:
- Bugfix for order in which /delete-node/ and /delete-property/ are
processed [0]
- /omit-if-no-ref/ support has been added (used only by U-Boot at this
point, in theory)
- GPL dtc compat version bumped to 1.4.7
- Various small fixes and compatibility improvements
MFC r344677: patch(1): Exit successfully if we're fed a 0-length patch
This change is made in the name of GNU patch compatibility. If GNU patch is
fed a zero-length patch, it will exit successfully with no output. This is
used in at least one port to date (comms/wsjtx), and we break on this usage.
It seems unlikely that anyone relies on patch(1) calling their completely
empty patch garbage and failing, and GNU compatibility is a plus if it helps
with porting, so make the switch.
Teach jedec_dimm(4) to be more forgiving of non-fatal errors.
It looks like some DIMMs claim to have a TSOD, but actually don't. Some
claim they weren't able to change the SPD page, but they did. Neither of
those should be fatal errors.
Add descriptions for sysctls in kern_mib.c and sysctl.3 which lack them.
r343532 noted the difference between "hw.realmem" and "hw.physmem", which I
was previously unaware of. I discovered that neither sysctl had a
description visible via `sysctl -d', so I found where they were defined and
added suitable descriptions. While in the file, I went ahead and added
descriptions for all the others which lacked them. I also updated sysctl.3
accordingly.
MFC r345292:
Convert allocation of bpf_if in bpfattach2 from M_NOWAIT to M_WAITOK
and remove possible panic condition.
It is already allowed to sleep in bpfattach[2], since BPF_LOCK was
converted to SX lock in r332388. Also move KASSERT() to the top of
function and make full initialization before bpf_if will be linked
to BPF's list of interfaces.
kp [Fri, 29 Mar 2019 14:34:50 +0000 (14:34 +0000)]
MFC r345177:
pf :Use counter(9) in pf tables.
The counters of pf tables are updated outside the rule lock. That means state
updates might overwrite each other. Furthermore allocation and
freeing of counters happens outside the lock as well.
Use counter(9) for the counters, and always allocate the counter table
element, so that the race condition cannot happen any more.
kp [Fri, 29 Mar 2019 11:59:54 +0000 (11:59 +0000)]
MFC r345178:
bridge: Fix panic if the STP root is removed
If the spanning tree root interface is removed from the bridge we panic
on the next 'ifconfig'.
While the STP code is notified whenever a bridge member interface is
removed from the bridge it does not clear the bs_root_port. This means
bs_root_port can still point at an bridge_iflist which has been free()d.
The next access to it will panic.
Explicitly check if the interface we're removing in bstp_destroy() is
the root, and if so re-assign the roles, which clears bs_root_port.
Note that this requires a modified OpenSSL library.
330040:
Fetch TLS key parameters from the firmware.
The parameters describe how much of the adapter's memory is reserved for
storing TLS keys. The 'meminfo' sysctl now lists this region of adapter
memory as 'TLS keys' if present.
330041:
Move ccr_aes_getdeckey() from ccr(4) to the cxgbe(4) driver.
This routine will also be used by the TOE module to manage TLS keys.
330079:
Move #include for rijndael.h out of x86-specific region.
The #include was added inside of the conditional by accident and the lack
of it broke non-x86 builds.
330884:
Support for TLS offload of TOE connections on T6 adapters.
The TOE engine in Chelsio T6 adapters supports offloading of TLS
encryption and TCP segmentation for offloaded connections. Sockets
using TLS are required to use a set of custom socket options to upload
RX and TX keys to the NIC and to enable RX processing. Currently
these socket options are implemented as TCP options in the vendor
specific range. A patched OpenSSL library will be made available in a
port / package for use with the TLS TOE support.
TOE sockets can either offload both transmit and reception of TLS
records or just transmit. TLS offload (both RX and TX) is enabled by
setting the dev.t6nex.<x>.tls sysctl to 1 and requires TOE to be
enabled on the relevant interface. Transmit offload can be used on
any "normal" or TLS TOE socket by using the custom socket option to
program a transmit key. This permits most TOE sockets to
transparently offload TLS when applications use a patched SSL library
(e.g. using LD_LIBRARY_PATH to request use of a patched OpenSSL
library). Receive offload can only be used with TOE sockets using the
TLS mode. The dev.t6nex.0.toe.tls_rx_ports sysctl can be set to a
list of TCP port numbers. Any connection with either a local or
remote port number in that list will be created as a TLS socket rather
than a plain TOE socket. Note that although this sysctl accepts an
arbitrary list of port numbers, the sysctl(8) tool is only able to set
sysctl nodes to a single value. A TLS socket will hang without
receiving data if used by an application that is not using a patched
SSL library. Thus, the tls_rx_ports node should be used with care.
For a server mostly concerned with offloading TLS transmit, this node
is not needed as plain TOE sockets will fall back to software crypto
when using an unpatched SSL library.
New per-interface statistics nodes are added giving counts of TLS
packets and payload bytes (payload bytes do not include TLS headers or
authentication tags/MACs) offloaded via the TOE engine, e.g.:
TLS transmit work requests are constructed by a new variant of
t4_push_frames() called t4_push_tls_records() in tom/t4_tls.c.
TLS transmit work requests require a buffer containing IVs. If the
IVs are too large to fit into the work request, a separate buffer is
allocated when constructing a work request. This buffer is associated
with the transmit descriptor and freed when the descriptor is ACKed by
the adapter.
Received TLS frames use two new CPL messages. The first message is a
CPL_TLS_DATA containing the decryped payload of a single TLS record.
The handler places the mbuf containing the received payload on an
mbufq in the TOE pcb. The second message is a CPL_RX_TLS_CMP message
which includes a copy of the TLS header and indicates if there were
any errors. The handler for this message places the TLS header into
the socket buffer followed by the saved mbuf with the payload data.
Both of these handlers are contained in tom/t4_tls.c.
A few routines were exposed from t4_cpl_io.c for use by t4_tls.c
including send_rx_credits(), a new send_rx_modulate(), and
t4_close_conn().
TLS keys for both transmit and receive are stored in onboard memory
in the NIC in the "TLS keys" memory region.
In some cases a TLS socket can hang with pending data available in the
NIC that is not delivered to the host. As a workaround, TLS sockets
are more aggressive about sending CPL_RX_DATA_ACK messages anytime that
any data is read from a TLS socket. In addition, a fallback timer will
periodically send CPL_RX_DATA_ACK messages to the NIC for connections
that are still in the handshake phase. Once the connection has
finished the handshake and programmed RX keys via the socket option,
the timer is stopped.
A new function select_ulp_mode() is used to determine what sub-mode a
given TOE socket should use (plain TOE, DDP, or TLS). The existing
set_tcpddp_ulp_mode() function has been renamed to set_ulp_mode() and
handles initialization of TLS-specific state when necessary in
addition to DDP-specific state.
Since TLS sockets do not receive individual TCP segments but always
receive full TLS records, they can receive more data than is available
in the current window (e.g. if a 16k TLS record is received but the
socket buffer is itself 16k). To cope with this, just drop the window
to 0 when this happens, but track the overage and "eat" the overage as
it is read from the socket buffer not opening the window (or adding
rx_credits) for the overage bytes.
330946:
Remove TLS-related inlines from t4_tom.h to fix iw_cxgbe(4) build.
- Remove the one use of is_tls_offload() and the function. AIO special
handling only needs to be disabled when a TOE socket is actively doing
TLS offload on transmit. The TOE socket's mode (which affects receive
operation) doesn't matter, so remove the check for the socket's mode and
only check if a TOE socket has TLS transmit keys configured to determine
if an AIO write request should fall back to the normal socket handling
instead of the TOE fast path.
- Move can_tls_offload() into t4_tls.c. It is not used in critical paths,
so inlining isn't that important. Change return type to bool while here.
330947:
Fix the check for an empty send socket buffer on a TOE TLS socket.
Compare sbavail() with the cached sb_off of already-sent data instead of
always comparing with zero. This will correctly close the connection and
send the FIN if the socket buffer contains some previously-sent data but
no unsent data.
331649:
Use the offload transmit queue to set flags on TLS connections.
Requests to modify the state of TLS connections need to be sent on the
same queue as TLS record transmit requests to ensure ordering.
However, in order to use the offload transmit queue in t4_set_tcb_field(),
the function needs to be updated to do proper flow control / credit
management when queueing a request to an offload queue. This required
passing a pointer to the toepcb itself to this function, so while here
remove the 'tid' and 'iqid' parameters and obtain those values from the
toepcb in t4_set_tcb_field() itself.
333068:
Use the correct key address when renegotiating the transmit key.
Previously, get_keyid() was returning the address of the receive key
instead of the transmit key when renegotiating the transmit key. This
could either hang the card (if a connection was only offloading TLS TX
and thus had a receive key address of -1) or cause the connection to
fail by overwriting the wrong key (if both RX and TX TLS were
offloaded).
333810:
Be more robust against garbage input on a TOE TLS TX socket.
If a socket is closed or shutdown and a partial record (or what
appears to be a partial record) is waiting in the socket buffer,
discard the partial record and close the connection rather than
waiting forever for the rest of the record.
337722:
Whitespace nit in t4_tom.h
340466:
Move the TLS key map into the adapter softc so non-TOE code can use it.
340468:
Change the quantum for TLS key addresses to 32 bytes.
The addresses passed when reading and writing keys are always shifted
right by 5 as the memory locations are addressed in 32-byte chunks, so
the quantum needs to be 32, not 8.
340469:
Remove bogus roundup2() of the key programming work request header.
The key context is always placed immediately after the work request
header. The total work request length has to be rounded up by 16
however.
340473:
Restore the <sys/vmem.h> header to fix build of cxgbe(4) TOM.
vmem's are not just used for TLS memory in TOM and the #include actually
predates the TLS code so should not have been removed when the TLS vmem
moved in r340466.
avos [Thu, 28 Mar 2019 09:50:25 +0000 (09:50 +0000)]
MFC r344990:
Fix ieee80211_radiotap(9) usage in wireless drivers:
- Alignment issues:
* Add missing __packed attributes + padding across all drivers; in
most places there was an assumption that padding will be always
minimally suitable; in few places - e.g., in urtw(4) / rtwn(4) -
padding was just missing.
* Add __aligned(8) attribute for all Rx radiotap headers since they can
contain 64-bit TSF timestamp; it cannot appear in Tx radiotap headers, so
just drop the attribute here. Refresh ieee80211_radiotap(9) man page
accordingly.
- Since net80211 automatically updates channel frequency / flags in
ieee80211_radiotap_chan_change() drop duplicate setup for these fields
in drivers.
ngie [Wed, 27 Mar 2019 19:59:36 +0000 (19:59 +0000)]
MFC r344066:
Add rc.resume(8) alias for rc(8) to fix the manpage cross references
This issue was noticed when running `make manlint` as part of MFCing r342597 to
^/stable/11:
```
$ make -C share/man/man8 rc.8lint
mandoc -Tascii -Tlint rc.8
mandoc: rc.8:548:6: STYLE: referenced manual not found: Xr rc.resume 8
$
```
wulf [Wed, 27 Mar 2019 19:17:42 +0000 (19:17 +0000)]
MFC: r344982, r345022
atrtc(4): install ACPI RTC/CMOS operation region handler
FreeBSD base system does not provide an ACPI handler for the PC/AT RTC/CMOS
device with PnP ID PNP0B00; on some HP laptops, the absence of this handler
causes suspend/resume and poweroff(8) to hang or fail [1], [2]. On these
laptops EC _REG method queries the RTC date/time registers via ACPI
before suspending/powering off. The handler should be registered before
acpi_ec driver is loaded.
This change adds handler to access CMOS RTC operation region described in
section 9.15 of ACPI-6.2 specification [3]. It is installed only for ACPI
version of atrtc(4) so it should not affect old ACPI-less i386 systems.
It is possible to disable the handler with loader tunable:
debug.acpi.disabled=atrtc
Informational debugging printf can be enabled by setting hw.acpi.verbose=1
in loader.conf
PR: 207419, 213039
Submitted by: Anthony Jenkins <Scoobi_doo@yahoo.com>
Reviewed by: ian
Discussed on: acpi@, 2013-2015, several threads
Differential Revision: https://reviews.freebsd.org/D19314
jilles [Tue, 26 Mar 2019 22:34:07 +0000 (22:34 +0000)]
MFC r344502: sh: Add set -o pipefail
The pipefail option allows checking the exit status of all commands in a
pipeline more easily, at a limited cost of complexity in sh itself. It works
similarly to the option in bash, ksh93 and mksh.
Like ksh93 and unlike bash and mksh, the state of the option is saved when a
pipeline is started. Therefore, even in the case of commands like
A | B &
a later change of the option does not change the exit status, the same way
(A | B) &
works.
Since SIGPIPE is not handled specially, more work in the script is required
for a proper exit status for pipelines containing commands such as head that
may terminate successfully without reading all input. This can be something
like
(
cmd1
r=$?
if [ "$r" -gt 128 ] && [ "$(kill -l "$r")" = PIPE ]; then
exit 0
else
exit "$r"
fi
) | head
hselasky [Tue, 26 Mar 2019 13:41:27 +0000 (13:41 +0000)]
MFC r345010:
Improve support for switching to and from command polling mode in mlx4core.
Make sure the enter and leave polling routines can be called multiple times
with same setting. Ignore setting polling or event mode twice. This fixes a
deadlock during shutdown if polling mode was already selected.
hselasky [Tue, 26 Mar 2019 13:38:49 +0000 (13:38 +0000)]
MFC r344920:
Teardown ifnet after stopping port in the mlx4en(4) driver.
mlx4_en_stop_port() calls mlx4_en_put_qp() which can refer the link level
address of the network interface, which in turn will be freed by the
network interface detach function. Make sure the port is stopped
before detaching the network interface.
dab [Mon, 25 Mar 2019 17:04:14 +0000 (17:04 +0000)]
MFC r345009:
Fix a scribbler in the PMS driver.
The ESGL bit was left uninitialized when executing the REPORT LUNS
ioctl. This could allow a zeroed data buffer to be treated as a
scatter/gather list. The firmware would eventually walk past the end
of the data buffer, potentially find what looked like a valid
address/length pair, and write the result to semi-random memory.
marcel [Sat, 23 Mar 2019 03:10:23 +0000 (03:10 +0000)]
MFC 344790:
Revert revision 254095
In revision 254095, gpt_entries is not set to match the on-disk
hdr_entries, but rather is computed based on available space.
There are 2 problems with this:
1. The GPT backend respects hdr_entries and only reads and writes
that number of partition entries. On top of that, CRC32 is
computed over the table that has hdr_entries elements. When
the common code works on what is possibly a larger number, the
behaviour becomes inconsistent and problematic. In particular,
it would be possible to add a new partition that on a reboot
isn't there anymore.
2. The calculation of gpt_entries is based on flawed assumptions.
The GPT specification does not dictate that sectors are layed
out in a particular way that the available space can be
determined by looking at LBAs. In practice, implementations
do the same thing, because there's no reason to do it any
other way. Still, GPT allows certain freedoms that can be
exploited in some form or shape if the need arises.
Eliminate trailing whitespace on inet, inet6, and groups lines. I think the
"list txpower" command will still show some, but I'm not able to test that.
PR: 153731 Reported-by: Nikolay Denev <ndenev@gmail.com>
Differential Revision: https://reviews.freebsd.org/D19004
asomers [Thu, 21 Mar 2019 22:23:52 +0000 (22:23 +0000)]
MFC r341390, r341392, r341667
r341390:
Remove some dead code from the geli tests
This is detritus in the Makefile, leftover from 327662.
r341392:
Unbreak geli/gmirror testcases if their geom classes cannot be loaded
The problem with the logic prior to this commit was twofold:
1. The wrong set of idioms (TAP-compatible) were being applied to the ATF
testcases when run, resulting in confusing ATF failure results on setup.
2. The cleanup subroutines were broken when the geom classes could not be
loaded as they exited with 0 unexpectedly.
This commit changes the test code to source the class-specific configuration
(conf.sh) once globally, instead of sourcing it per testcase and per cleanup
subroutine, and to call the ATF-specific setup subroutine(s) inline in
the testcases.
The refactoring done is effectively a no-op for the TAP testcases, modulo
any refactoring done to create common code between the ATF and TAP
testcases.
This unbreaks the geli testcases converted to ATF in r327662 and r327683,
and the gmirror testcases added in r327780, respectively, when the geom
class could not be loaded.
tests/sys/geom/class/mirror/...
While here, ignore errors when turning debug failpoint sysctl off, which
could occur if the gmirror class was not loaded.
r341667:
geom tests: Fix cleanup of ATF tests since r341392
r341392 changed common test cleanup routines in a way that allowed them to
be used by TAP tests as well as ATF tests. However, a late change made
during code review resulted in cleanup being broken for ATF tests, which
source geom_subr.sh separately during the body and cleanup phases of the
test. The result was that md(4) devices wouldn't get cleaned up.
asomers [Thu, 21 Mar 2019 22:18:22 +0000 (22:18 +0000)]
MFC r340988:
vfs_aio.c: rename "physio" symbols to "bio".
aio has two paths: an asynchronous "physio" path and a synchronous path.
Confusingly, physio(9) isn't actually used by the "physio" path, and never
has been. In fact, it may even be called by the synchronous path! Rename
the "physio" path to the "bio" path to reflect what it actually does:
directly compose BIOs and send them to character devices.