Alexander Motin [Thu, 11 Apr 2019 13:20:48 +0000 (13:20 +0000)]
MFC r344936: MFV/ZoL: Disable LBA weighting on files and SSDs
The LBA weighting makes sense on rotational media where the outer tracks
have twice the bandwidth of the inner tracks. However, it is detrimental
on nonrotational media such as solid state disks, where the only effect
is to ensure that metaslabs enter the best-fit allocation behavior
sooner, which is detrimental to performance. It also makes no sense on
files where the underlying filesystem can arrange things however it
wants.
Author: Richard Yao <ryao@gentoo.org> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3712
zfsonlinux/zfs@fb40095f5f0853946f8150481ca22602d1334dfe
To reduce code divergence this merge replaces equivalent but different
FreeBSD code detecting non-rotating medium vdevs.
Alexander Motin [Thu, 11 Apr 2019 13:19:26 +0000 (13:19 +0000)]
MFC r344934, r345014: Add separate aggregation limit for non-rotating media.
Before sequential scrub patches ZFS never aggregated I/Os above 128KB.
Sequential scrub bumped that to 1MB, which motivation I understand for
spinning disks, since it should reduce number of head seeks. But for
SSDs it makes much less sense to me, especially on FreeBSD, where due
to MAXPHYS limitation device will likely still see bunch of 128KB I/Os
instead of one large. Having more strict aggregation limit allows to
avoid allocation of large memory buffer and memcpy to/from it, that is
a serious problem when bandwidth reaches few GB/s.
Update the bounds checking for zfs_vdev_aggregation_limit so that
it has a floor of zero and a maximum value of the supported block
size for the pool.
Additionally add an early return when zfs_vdev_aggregation_limit
equals zero to disable aggregation. For very fast solid state or
memory devices it may be more expensive to perform the aggregation
than to issue the IO immediately.
Commit 8542ef8 allowed optional IOs to be aggregated beyond
the specified aggregation limit. Since the aggregation limit
was also used to enforce the maximum block size, setting
`zfs_vdev_aggregation_limit=16777216` could result in an
attempt to allocate an ABD larger than 16M.
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6259
Closes #6270
zfsonlinux/zfs@2d678f779aba26a93314c8ee1142c3985fa25cb6
Guangyuan Yang [Thu, 11 Apr 2019 00:41:07 +0000 (00:41 +0000)]
MFC r345887:
Rewrite intro(4) man page.
- Remove issues that no longer apply thanks to devfs
- Add language pointing out devfs's role and referencing its config
- Add a "historical notes" section and move discussion of block vs character devs to it, including pointing out the removal of block devs
- Modernize some examples
Martin Matuska [Wed, 10 Apr 2019 21:45:23 +0000 (21:45 +0000)]
MFC r345497:
Sync libarchive with vendor.
Relevant vendor changes:
PR #1153: fixed 2 bugs in ZIP reader [1]
PR #1143: ensure archive_read_disk_entry_from_file() uses ARCHIVE_READ_DISK
Changes to file flags code, support more file flags on FreeBSD:
UF_OFFLINE, UF_READONLY, UF_SPARSE, UF_REPARSE, UF_SYSTEM
UF_ARCHIVE is not supported by intention (yet)
Enji Cooper [Tue, 9 Apr 2019 16:35:23 +0000 (16:35 +0000)]
MFC r344662:
Remove references to pdwait4(2) and `CAP_PDWAIT` from rights(4)
@cem removed references to pdwait4(2) (a nonexistent syscall) in
r320058.
This change removes references to pdwait4(2) and `CAP_PDWAIT` in
rights(4) to not mislead the user into thinking that pdwait4(2)/`CAP_PDWAIT` is
actually implemented in the stock FreeBSD kernel.
The goal of this functionality was to simplify monitoring/manipulating
processes started with `pdfork`, et al, and avoid races with waiting on pids.
The syscall was never completed though--just discussed on the capsicum mailing
list back in 2015:
https://lists.cam.ac.uk/pipermail/cl-capsicum-discuss/2015-May/msg00012.html
. That being said, there are members of the project (@rwatson, etc) who
have longterm goals to implement this syscall to better secure pdfork(2)
calls.
MFC r344161: stand: dev_net: correct net_open's interpretation of params
net_open previously casted the first vararg to a char * and this was
half-OK: at first, it is passed to netif_open, which would cast it back to
the struct devdesc * that it really is and use it properly. It is then
strdup()d and used as the netdev_name, which is objectively wrong.
Correct it so that the first vararg is properly casted to a struct devdesc *
and the netdev_name gets set properly to make it more clear at a glance that
it's not doing something horribly wrong.
r343065:
With the sync from Dragonfly BSD in r318216 a bug slipped in (also still present
upstream it seems).
The tlv variable was changed to a pointer but the advancement of the data pointer
was left as sizeof(tlv). While the sizeof the (now) pointer equals the
sizeof 2 x uint32_t (size of the struct) on 64bit platforms, on 32bit platforms
the size of the advancement of the data pointer was wrong leading to
firmware load issues.
Correctly advance the data pointer by the size of the structure and not by
the size of a pointer.
r343373:
if_iwm - Update firmware rs table, instead of indexing the table in tx cmds.
* Rather than providing a non-zero index into the firmware RS table,
we should always use index 0 and update the firmware RS table whenever
our chosen tx rate for data-frames changes.
* Send IWM_LQ_CMD updates when the tx rate gets updated by the net80211
rate control (which is after we tell the tx status to the net80211
rate-control in iwm_mvm_rx_tx_cmd_single()).
* Disregard frames transferred with a different tx rate than the currently
selected rate for the rate-control calculations. This way we avoid
counting management frames (which are sent at a slow, and fixed rate),
as well as frames we added to the tx queue just before a new IWM_LQ_CMD
update took effect.
r343374:
if_iwm - The iwm_prepare_card_hw() in iwm_attach() is only needed on 8K hw.
* Doing the iwm_prepare_card_hw() call in iwm_attach() only on Family 8000
hardware matches the code in Linux iwlwifi.
* While there remove DEFAULT_MAX_TX_POWER definition which is unused, and
has a value different from IWL_DEFAULT_MAX_TX_POWER in iwlwifi.
r343375:
if_iwm - Move iwm_read_firmware() call into iwm_attach().
* We should load the firmware exactly once before the driver really
initializes the hardware the first time, and unload it at detach time.
There is no need to retrieve the firmware during execution of
iwm_mvm_load_ucode_wait_alive(), we should make sure we already have the
firmware data at hand before that.
* The existing sc_preinit_hook code fails to deal with the case where
if_iwm is loaded by the loader (or is statically linked) and the
firmware needs to be loaded from disk. So we can just call
iwm_read_firmware() from iwm_attach() directly.
* A separate solution will have to be added to properly defer the firmware
loading during bootup, until the necessary filesystem is mounted.
r343376:
if_iwm - Check sc->sc_attached flag in suspend/resume callbacks.
* There is (almost) nothing to do in suspend/resume if if_iwm has failed
during initialization (e.g. because of firmware load failure) and was
already uninitialized by iwm_detach_local().
r343377:
iwm - Reduce gratuitous differences with Linux iwlwifi in struct naming.
* Rename some structs and struct members for firmware handling.
r343380:
if_iwm - Add firmware API definitions for TX power commands.
* While there remove unused IWM_UCODE_TLV_CAPA_LMAC_UPLOAD definition,
which isn't defined in iwlwifi.
Taken-From: Linux iwlwifi
r343381:
iwm - Track firmware state better, and improve handling in iwm_newstate().
* This avoids firmware resets in all the cases in iwm_newstate(). Instead
iwm_bring_down_firmware() is called, which tears down all the STA
connection state, according to the sc->sc_firmware_state value.
* Improve the behaviour of the LED blinking a bit, so it only blinks when
there really is a wireless scan going on.
* Print the newstate arg in debug output of iwm_newstate(), to help in
debugging.
r343477:
Fix logic errors in iwm_pcie_load_firmware_chunk introduced in r314065.
* There's no reason to have a while() loop here, because:
- if msleep returns 0, that means we were woken up by the interrupt handler,
and we are going to exit immediately as sc_fw_chunk_done will now be 1
(there is nothing else that sleeps on sc_fw.)
- if msleep doesn't return 0 (i.e. it returned ETIMEDOUT) then we will
exit immediately because of the if-test.
So, just use a single msleep() and then check sc_fw_chunk_done as before.
* The comment said we were sleeping for 5 seconds, but the msleep was only
for 1. Before r314065, this was 1 second and so was the comment,
and in that commit the comment was changed and the function call wasn't.
Possibly fixes failures to initialize uCode on certain devices.
MFC r343255: awg: fix soft reset failure with no link
U-Boot will leave the ephy reset de-asserted and the MAC soft reset will
fail on these boards with internal PHY and no link established. Toggle reset
again before proceeding to attach/init.
Previously, we directly used libzfs_core's lzc_receive to import to a
temporary snapshot, then cloned the snapshot and setup the properties. This
failed when attempting to import replication streams with questionable
error.
libzfs's zfs_receive is a much better fit here, so we now use it instead
with the destination dataset and let libzfs take care of the dirty details.
be_import is greatly simplified as a result.
r343977:
libbe(3): Add a destroy option for removing the origin
Currently origin snapshots are left behind when a BE is destroyed, whether
it was an auto-created snapshot or explicitly specified via, for example,
`bectl create -e be@mysnap ...`.
Removing it automatically could be argued as a POLA violation in some
circumstances, so provide a flag to be_destroy for it. An accompanying
option will be added to bectl(8) to utilize this.
Some minor style/consistency nits in the affected areas also addressed.
r343993:
bectl(8): Add -o flag to destroy to clean up the origin snapshot of BE
We can't predict when destruction of origin is needed, and currently we have
a precedent for not prompting for things. Leave the decision up to the user
of bectl(8) if they want the origin snapshot to be destroyed or not.
Emits a warning when -o isn't used and an origin snapshot is left to be
cleaned up, for the time being. This is handy when one drops the -o flag but
really did want to clean up the origin.
A couple of -e ignore's have been sprinkled around the test suite for places
that we don't care that the origin's not been cleaned up. -o functionality
tests will be added in the future, but are omitted for now to reduce
conflicts with work in flight to fix bits of the tests.
r343994:
bectl(8): commit missing test modifications from r343993
r344034:
libbe(3): Belatedly note the BE_DESTROY_ORIGIN option added in r343977
r344084:
libbe(3): Fix be_destroy behavior w.r.t. deep BE snapshots and -o
be_destroy is documented to recursively destroy a boot environment. In the
case of snapshots, one would take this to mean that these are also
recursively destroyed. However, this was previously not the case.
be_destroy would descend into the be_destroy callback and attempt to
zfs_iter_children on the top-level snapshot, which is bogus.
Our alternative approach is to take note of the snapshot name and iterate
through all of fs children of the BE to try destruction in the children.
The -o option is also fixed to work properly with deep BEs. If the BE was
created with `bectl create -e otherDeepBE newDeepBE`, for instance, then a
recursive snapshot of otherDeepBE would have been taken for construction of
newDeepBE but a subsequent destroy with BE_DESTROY_ORIGIN set would only
clean up the snapshot at the root of otherDeepBE: ${BEROOT}/otherDeepBE@...
The most recent iteration instead pretends not to know how these things
work, verifies that the origin is another BE and then passes that back
through be_destroy to DTRT when snapshots and deep BEs may be in play.
r345302:
bectl(8): change jail command to execute jail(8)
The jail(8) command provides a variety of jail pseudo-parameters that are
useful to consumers of bectl, mount.devfs being the most-often-requested
paramater by bectl users.
command, exec.start, nopersist, and persist may not be specified via -o to
bectl. The command/exec.start remains passed as it always has at the end of
bectl, and persistence is dictated by -b/-U bectl jail arguments.
'be_destroy' can destroy a boot environment (by name) or a given snapshot.
If the target to be destroyed is a dataset, check if it's mounted. We don't
want to check if the origin dataset is mounted when destroying a snapshot.
MFC r345848: libbe(3): Add a serial to the generated snapshot names
To use bectl in an example, when one creates a new boot environment with
either `bectl create <be>` or `bectl create -e <otherbe> <be>`, libbe will
take a snapshot of the original boot environment to clone. Previously, this
used %F-%T date format as the snapshot name, but this has some limitations-
attempting to create multiple boot environments in quick succession may
collide if done within the same second.
Tack a serial onto it to reduce the chances of a collision... we could still
collide if multiple processes/threads are creating boot environments at the
same time, but this is likely not a big concern as this has only been
reported as occurring in freebsd-ci setup.
freebsd32: fix padding of computed control message length for recvmsg()
Each control message region must be aligned on a 4-byte boundary on 32-bit
architectures. The 32-bit compat shim for recvmsg() gets the actual layout
right, but doesn't pad the payload length when computing msg_controllen for
the output message header. If a control message contains an unaligned
payload, such as the 1-byte TTL field in the example attached to PR 236737,
this can produce control message payload boundaries that extend beyond
the boundary reported by msg_controllen.
Fix regression in top(1) after r344381, causing informational messages
to no longer be displayed. This was because the reimplementation of
setup_buffer() did not copy the previous contents into any reallocated
buffer.
Reported by: James Wright <james.wright@jigsawdezign.com>
PR: 236947
MFC r344243, r345517-r345518: lualoader: More intelligent screen clearing
r344243:
lualoader: only clear the screen before first password prompt
This was previously an unconditional screen clear, regardless of whether or
not we would be prompting for any passwords. This is pointless, given that
the screen clear is only there to put our screen into a consistent state
before we draw the prompts and do cursor manipulation.
This is also the only screen clear besides that to draw the menu. One can
now see early pre-loader and loader output with the menu disabled, which may
be useful for diagnostics.
r345517:
lualoader: Clear the screen before prompting for password
Assuming that the autoboot sequence was interrupted, we've done enough
cursor manipulation that the prompt for the password will be sufficiently
obscured a couple of lines up. Clear the screen and reset the cursor
position here, too.
r345518:
lualoader: Fix up some luacheck concerns
- Garbage collect an unused (removed because it was useless) constant
- Don't bother with vararg notation if args will not be used
Highlights:
- Bugfix for order in which /delete-node/ and /delete-property/ are
processed [0]
- /omit-if-no-ref/ support has been added (used only by U-Boot at this
point, in theory)
- GPL dtc compat version bumped to 1.4.7
- Various small fixes and compatibility improvements
MFC r344677: patch(1): Exit successfully if we're fed a 0-length patch
This change is made in the name of GNU patch compatibility. If GNU patch is
fed a zero-length patch, it will exit successfully with no output. This is
used in at least one port to date (comms/wsjtx), and we break on this usage.
It seems unlikely that anyone relies on patch(1) calling their completely
empty patch garbage and failing, and GNU compatibility is a plus if it helps
with porting, so make the switch.
Ed Maste [Wed, 3 Apr 2019 13:19:47 +0000 (13:19 +0000)]
MFC r343764 (jchandra): arm, acpi: increase size of memory region arrays
Bump up MAX_HWCNT and MAX_EXCNT to 32 when ACPI is enabled. These are
the sizes of the hwregions and exregions arrays respectively. ACPI
firmware typically has more memory regions and the current value of
16 is not sufficient for some platforms.
This commit fixes a failure seen with AMI firmware on Cavium's Sabre
ThunderX2 reference platform. This platform needs 21 physical memory
regions and 18 excluded regions to boot correctly with the current
firmware release.
Ravi Pokala [Wed, 3 Apr 2019 06:36:41 +0000 (06:36 +0000)]
MFC r345611:
Teach jedec_dimm(4) to be more forgiving of non-fatal errors.
It looks like some DIMMs claim to have a TSOD, but actually don't. Some
claim they weren't able to change the SPD page, but they did. Neither of
those should be fatal errors.
Ravi Pokala [Wed, 3 Apr 2019 03:30:14 +0000 (03:30 +0000)]
MFC r345457:
Add descriptions for sysctls in kern_mib.c and sysctl.3 which lack them.
r343532 noted the difference between "hw.realmem" and "hw.physmem", which I
was previously unaware of. I discovered that neither sysctl had a
description visible via `sysctl -d', so I found where they were defined and
added suitable descriptions. While in the file, I went ahead and added
descriptions for all the others which lacked them. I also updated sysctl.3
accordingly.
MFC r345292:
Convert allocation of bpf_if in bpfattach2 from M_NOWAIT to M_WAITOK
and remove possible panic condition.
It is already allowed to sleep in bpfattach[2], since BPF_LOCK was
converted to SX lock in r332388. Also move KASSERT() to the top of
function and make full initialization before bpf_if will be linked
to BPF's list of interfaces.
Mark Johnston [Mon, 1 Apr 2019 14:19:09 +0000 (14:19 +0000)]
Fix if_(m)addr_rlock().
The use of a per-ifnet epoch context meant that these KPIs were not
reentrant. This was fixed in head in r340413, but the change cannot
be MFCed because it breaks the KBI by modifying struct thread. This
is a direct commit to stable/12 which uses a per-CPU mutex to fix
the problem without changing the KBI.
PR: 236846
Submitted by: hselasky
Reported and tested by: Viktor Dukhovni <ietf-dane@dukhovni.org>
Reviewed by: hselasky (previous version)
Differential Revision: https://reviews.freebsd.org/D19764
Some applications forward from/to host rings most or all the
traffic received or sent on a physical interface. In this
cases it is desirable to have more than a pair of RX/TX host
rings, and use multiple threads to speed up forwarding.
This change adds support for multiple host rings. On registering
a netmap port, the user can specify the number of desired receive
and transmit host rings in the nr_host_tx_rings and nr_host_rx_rings
fields of the nmreq_register structure.
Kristof Provost [Fri, 29 Mar 2019 14:34:51 +0000 (14:34 +0000)]
MFC r345177:
pf :Use counter(9) in pf tables.
The counters of pf tables are updated outside the rule lock. That means state
updates might overwrite each other. Furthermore allocation and
freeing of counters happens outside the lock as well.
Use counter(9) for the counters, and always allocate the counter table
element, so that the race condition cannot happen any more.
Kristof Provost [Fri, 29 Mar 2019 11:59:53 +0000 (11:59 +0000)]
MFC r345178:
bridge: Fix panic if the STP root is removed
If the spanning tree root interface is removed from the bridge we panic
on the next 'ifconfig'.
While the STP code is notified whenever a bridge member interface is
removed from the bridge it does not clear the bs_root_port. This means
bs_root_port can still point at an bridge_iflist which has been free()d.
The next access to it will panic.
Explicitly check if the interface we're removing in bstp_destroy() is
the root, and if so re-assign the roles, which clears bs_root_port.
MFC r344990:
Fix ieee80211_radiotap(9) usage in wireless drivers:
- Alignment issues:
* Add missing __packed attributes + padding across all drivers; in
most places there was an assumption that padding will be always
minimally suitable; in few places - e.g., in urtw(4) / rtwn(4) -
padding was just missing.
* Add __aligned(8) attribute for all Rx radiotap headers since they can
contain 64-bit TSF timestamp; it cannot appear in Tx radiotap headers, so
just drop the attribute here. Refresh ieee80211_radiotap(9) man page
accordingly.
- Since net80211 automatically updates channel frequency / flags in
ieee80211_radiotap_chan_change() drop duplicate setup for these fields
in drivers.
Toomas Soome [Thu, 28 Mar 2019 08:38:31 +0000 (08:38 +0000)]
MFC: r344248,r344387
cd9660: dirmatch fails to unmatch when name is prefix for directory record
Loader does fail to properly match the file name in directory record and
does open file based on prefix match.
The cd9660_open() does pass whole path to dirmatch() and we need to
compare only the current path component, not full path.
Additinally, skip over duplicate / (if any) and check if the last component
in the path was meant to be directory (having trailing /). If it is in fact
a file, error out.
r343264:
cxgbe(4): Use a truncated firmware header for version checks. All the
version numbers are towards the begining of the header.
Sponsored by: Chelsio Communications
r343269:
cxgbe(4): Allow negative values in hw.cxgbe.fw_install and take them to
mean that the driver should taste the firmware in the KLD and use that
firmware's version for all its fw_install checks.
The driver gets firmware version information from compiled-in values by
default and this change allows custom (or older/newer) firmware modules
to be used with the stock driver.
There is no change in default behavior.
Sponsored by: Chelsio Communications
r345083:
cxgbe(4): Update T4/5/6 firmwares to 1.23.0.0.
Navdeep Parhar [Wed, 27 Mar 2019 22:21:09 +0000 (22:21 +0000)]
MFC r344524:
cxgbe(4): Updates to the default and hashfilter configurations.
- Do not use nvf = 4 as it is not really supported by the firmware.
Firmwares 1.23.3.0 and above will ignore it silently.
- Increase PF4's share of the VIs and let it use all of the RSS table.
Navdeep Parhar [Wed, 27 Mar 2019 21:50:07 +0000 (21:50 +0000)]
MFC r343233:
cxgbe(4): Clear the reply-pending status of a hashfilter when the reply
indicates an error. Also, do not remove it twice from the hf list in
this case.