pfg [Wed, 28 Feb 2018 02:37:59 +0000 (02:37 +0000)]
MFC r329846:
getpeereid(3): Fix behavior on failure to match documentation.
According to the getpeereid(3) documentation, on failure the value -1 is
returned and the global variable errno is set to indicate the error. We
were returning the error instead.
rpokala [Wed, 28 Feb 2018 00:29:52 +0000 (00:29 +0000)]
MFC r329682:
mountd: Return proper errno values in a few error paths
When attempting to mount a non-directory which exists, return ENOTDIR
instead of ENOENT. If stat() or statfs() failed, don't pass part of the
invalid (struct statfs) to ex_search(). In that same case, preserve the
value of "bad" rather than overwriting with EACCES.
kevans [Tue, 27 Feb 2018 19:24:06 +0000 (19:24 +0000)]
MFC r318304: getusershell: don't write paste end of buffer reading shells
_local_initshells did not reset cp to the beginning of the line buffer for
every iteration that it called fgets(3), leading to writing past the end of
line with fairly long /etc/shells or excessively long line lengths. Correct
this by properly resetting cp.
jhb [Tue, 27 Feb 2018 01:28:19 +0000 (01:28 +0000)]
MFC 328134: Update various statements in vmstat(8) to match reality.
- The process stats are actually thread counts rather than process
counts.
- Simplify various descriptions to remove mention of stats that are
updated every 5 seconds (all VM related stats are now "instant",
only the load average is updated every 5 seconds).
- Don't make any mention of special treatment for processes that have
been active in the last 20 seconds. We don't track that stat.
- Rework the description of active virtual memory. Call it mapped
virtual memory and explicitly point out it is not the same as the
active page queue (which corresponds to "Active" in top(1)), and
also hint at the possible bogusness of the value (e.g. if a process
maps a single page out of a multiple GB file, the entire file's size
is considered mapped).
- Simplify a few descriptions that implied their output was a value
per interval. All of the "rate" values are per-second rates scaled
across the interval.
- Update a few comments for 'struct vmtotal' along similar lines.
hselasky [Sun, 25 Feb 2018 10:48:52 +0000 (10:48 +0000)]
MFC r329703:
Allow LinuxKPI character devices to receive mmap() calls from the Linux
binary mode user-space emulation layer. This is a regression issue after
r328436, when LinuxKPI character devices started to use DTYPE_DEV in
the "f_type" field of the associated file structure(s).
Found by: Johannes Lundberg <johalun0@gmail.com>
Sponsored by: Mellanox Technologies
hselasky [Sun, 25 Feb 2018 10:44:47 +0000 (10:44 +0000)]
MFC r329509:
Update the ktime type in the LinuxKPI to be a signed 64-bit integer similarly
to Linux, to avoid compilation issues. Implement ktime_get_real_seconds().
hselasky [Sun, 25 Feb 2018 10:40:41 +0000 (10:40 +0000)]
MFC r329825:
Return correct error code to user-space when a system call receives a
signal in the LinuxKPI.
The read(), write() and mmap() system calls can return either EINTR or
ERESTART upon receiving a signal. Add code to figure out the correct
return value by temporarily storing the return code from the relevant
FreeBSD kernel APIs in the Linux task structure.
hselasky [Sun, 25 Feb 2018 10:37:07 +0000 (10:37 +0000)]
MFC r329519:
Implement support for radix_tree_for_each_slot() and radix_tree_exception()
in the LinuxKPI and use unsigned long type for the radix tree index.
hselasky [Sun, 25 Feb 2018 10:30:36 +0000 (10:30 +0000)]
MFC r329510:
Refactor dentry structure into its own header file in the LinuxKPI similary
to Linux. No functional change. Implement d_inode() helper function.
hselasky [Sun, 25 Feb 2018 10:26:44 +0000 (10:26 +0000)]
MFC r329470:
Add support for printk_ratelimit() function macro and improve the existing
printk_ratelimited() function macro to return a boolean stating if there
was a printout, true, or not, false.
Submitted by: Johannes Lundberg <johalun0@gmail.com>
Sponsored by: Mellanox Technologies
hselasky [Sun, 25 Feb 2018 10:22:27 +0000 (10:22 +0000)]
MFC r329464:
Add checks for valid IRQ tag before setting up or tearing down an interrupt
handler in the LinuxKPI. This is needed when the interrupt handler is disabled
before freeing the interrupt.
Submitted by: Johannes Lundberg <johalun0@gmail.com>
Sponsored by: Mellanox Technologies
cy [Sat, 24 Feb 2018 18:16:28 +0000 (18:16 +0000)]
MFC r329361:
Document memset_s(3). memset_s(3) is defined in
C11 standard (ISO/IEC 9899:2011) K.3.7.4.1 The memset_s function
(p: 621-622)
Fix memset(3) portion of the man page by replacing the first argument
(destination) "b" with "dest", which is more descriptive than "b".
This also makes it consistent with the term used in the memset_s()
portion of the man page.
See also http://en.cppreference.com/w/c/string/byte/memset.
rpokala [Fri, 23 Feb 2018 16:45:59 +0000 (16:45 +0000)]
MFC r323508:
When doing a non-interactive installation, don't display an interactive
warning about a filesystem which doesn't have a mountpoint. Presumably, the
person who wrote the install script knew what they were doing.
rpokala [Thu, 22 Feb 2018 19:39:44 +0000 (19:39 +0000)]
MFC r329295:
Panasas discovered that ioctl(SIOCGLAGGPORT) returns ENOTTY for mxge(4) when
the NIC is not a member of a lagg. This came as a surprise, because the
SIOCGLAGGPORT handler in if_lagg.c only returns ENOENT (if run against the
laggX interface, rather than a physical port) or EINVAL (if run against a
non-member physical port). This behavior was not seen with other drivers,
such as bge(4), igb(4), and cxl(4). When I compared their respective ioctl
handlers, I found that they all called ether_ioctl() for the default (i.e.
unhandled) case; by contrast, mxge(4) only calls ether_ioctl() for two
specific cases, and returns ENOTTY for the default case.
Remove the two cases which explicitly call ether_ioctl(), and let the
default case call it instead. This matches what the vast majority of the NIC
drivers do.
asomers [Thu, 22 Feb 2018 02:16:44 +0000 (02:16 +0000)]
MFC r328605:
zfsd: Don't spare a vdev that's being replaced
If a zfs pool contains a replacing vdev (either created manually by "zpool
replace" or by zfsd(8) via autoreplace by physical path) and then new spares
get added to the pool, zfsd shouldn't use one to replace the drive that is
already being replaced. That's a waste of resources that just slows down
the rebuild.
asomers [Thu, 22 Feb 2018 02:14:43 +0000 (02:14 +0000)]
MFC r328266:
mlock(2): correct documentation for error conditions.
The man page is years out of date regarding errors. Our implementation _does_
allow unaligned addresses, and it _does_not_ check for negative lengths,
because the length is unsigned. It checks for overflow instead.
jhb [Thu, 22 Feb 2018 00:53:05 +0000 (00:53 +0000)]
MFC 323889: Place the AAD before the plaintext/ciphertext for CIOCRYPTAEAD.
Software crypto implementations don't care how the buffer is laid out,
but hardware implementations may assume that the AAD is always before
the plain/cipher text and that the hash/tag is immediately after the end
of the plain/cipher text.
In particular, this arrangement matches the layout of both IPSec packets
and TLS frames. Linux's crypto framework also assumes this layout for
AEAD requests.
Add smn(4) driver for AMD System Management Network
AMD Family 17h CPUs have an internal network used to communicate between
the host CPU and the PSP and SMU coprocessors. It exposes a simple
32-bit register space.
amdtemp(4): Add support for Family 17h temperature sensor
The sensor value is formatted similarly to previous models (same
bitfield sizes, same units), but must be read off of the internal
System Management Network (SMN) from the System Management Unit (SMU)
co-processor.
PR: 218264
Reported and tested by: Nils Beyer <nbe AT renzel.net>
Reviewed by: avg (no +1), mjoras, truckman
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12217
gonzo [Tue, 20 Feb 2018 18:12:07 +0000 (18:12 +0000)]
MFC r325410:
Increase TX_MAX_SEGS from 10 to 20 for the if_awg.c driver
Under certain traffic pattern awg driver does not recover from TX queue
full condition. The actual source of the problem is not identified yet
but jmcneill@ agreed that bumping TX_MAX_SEGS to 20 is OK as a workaround
for the problem (NetBSD has it set to 128).
Also add some diagnostic printfs to prevent silent failure of bus_dma
functions in the future
PR will be kept open until root cause of the issue is identified and fixed
PR: 219927
Submitted by: Tom Vijlbrief <tvijlbrief@gmail.com>
Approved by: jmcneill
vangyzen [Mon, 19 Feb 2018 15:56:33 +0000 (15:56 +0000)]
MFC r329181
Update the MTU in affected routes when IPv6 RA changes the MTU
ip6_calcmtu() only looks at the interface MTU if neither the TCP hostcache
nor the route provides an MTU. Update the routes so they do not provide
stale MTUs.
This fixes UNH IPv6 conformance test cases v6LC_4_1_08 and v6LC_4_1_09,
which use a RA to reduce the link MTU from 1500 to 1280.
vangyzen [Mon, 19 Feb 2018 15:54:26 +0000 (15:54 +0000)]
MFC r329053
Fix ICMPv6 redirects
icmp6_redirect_input() validates that a redirect packet came from the
current gateway for the respective destination. To do this, it compares
the source address, which has an embedded scope zone id, to the next-hop
address, which does not. If the address is link-local, which should be
the case, the comparison fails and the redirect is ignored.
Insert the scope zone id into the next-hop address so the comparison
is accurate.
Unsurprisingly, this fixes 35 UNH IPv6 conformance test cases.
ae [Mon, 19 Feb 2018 10:34:30 +0000 (10:34 +0000)]
MFC r328541:
Do not skip scope zone violation check, when mbuf has M_FASTFWD_OURS flag.
When mbuf has M_FASTFWD_OURS flag, this means that a destination address
is our local, but we still need to pass scope zone violation check,
because protocol level expects that IPv6 link-local addresses have
embedded scope zone indexes. This should fix the problem, when ipfw is
used to forward packets to local address and source address of a packet
is IPv6 LLA.
ae [Mon, 19 Feb 2018 10:30:34 +0000 (10:30 +0000)]
MFC r328540:
Assign IPv6 link-local address to loopback interfaces whith unit > 0.
When an interface has IFF_LOOPBACK flag in6_ifattach() tries to assing
IPv6 loopback address to this interface. It uses in6ifa_ifpwithaddr()
to check, that interface doesn't already have given address and then
uses in6_ifattach_loopback(). If in6_ifattach_loopback() fails, it just
exits and thus skips assignment of IPv6 LLA.
Fix this using in6ifa_ifwithaddr() function. If IPv6 loopback address is
already assigned in the system, do not call in6_ifattach_loopback().
wulf [Sun, 18 Feb 2018 22:04:42 +0000 (22:04 +0000)]
MFC r328864:
psm(4): Fix panic occuring soon after PS/2 packet has been rejected by
synaptics or elantech sanity checker.
After packet has been rejected contents of packet buffer is not cleared
with setting of inputbytes counter to 0. So when this packet buffer is
filled again being an element of circular queue, new data appends to old
data rather than overwrites it. This leads to packet buffer overflow
after 10 rounds.
Fix it with setting of packet's inputbytes counter to 0 after rejection.
https://www.illumos.org/issues/8972:
'zfs holds -H' does not properly output content in scripted mode. It uses a
tab instead of two spaces, but it still pads column widths with spaces when
it should not.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Allan Jude <allanjude@freebsd.org>
https://www.illumos.org/issues/8835:
Sequential reads not aligned to block size are not detected by ZFS
prefetcher as sequential, killing prefetch and severely hurting
performance. It is caused by dmu_zfetch() in case of misaligned
sequential accesses being called with overlap of one block.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Alexander Motin <mav@FreeBSD.org>
https://www.illumos.org/issues/8652:
Clang and GCC prefer to use unsigned ints to store enums. With Clang, that
causes tautological comparison warnings when comparing a zfs_prop_t or
zpool_prop_t variable to the macro ZPROP_INVAL. It's likely that error
handling code is being silently removed as a result.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Alan Somers <asomers@gmail.com>
https://www.illumos.org/issues/8641:
"zpool clear" and "zinject -d" can both operate on specific vdevs, either
leaf or interior. However, due to an oversight, neither works on a "spare"
or "replacing" vdev. For example:
sudo zpool create foo raidz1 c1t5000CCA000081D61d0 c1t5000CCA000186235d0 spare c
1t5000CCA000094115d0
sudo zpool replace foo c1t5000CCA000186235d0 c1t5000CCA000094115d0
$ zpool status foo pool: foo
state: ONLINE
scan: resilvered 81.5K in 0h0m with 0 errors on Fri Sep 8 10:53:03 2017
config:
NAME STATE READ WRITE CKSUM
foo ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
c1t5000CCA000081D61d0 ONLINE 0 0 0
spare-1 ONLINE 0 0 0
c1t5000CCA000186235d0 ONLINE 0 0 0
c1t5000CCA000094115d0 ONLINE 0 0 0
spares
c1t5000CCA000094115d0 INUSE currently in use
$ sudo zinject -d spare-1 -A degrade foo
cannot find device 'spare-1' in pool 'foo'
$ sudo zpool clear foo spare-1
cannot clear errors for spare-1: no such device in pool
Even though there was nothing to clear, those commands shouldn't have
reported an error. by contrast, trying to clear "raidz1-0" works just fine:
$ sudo zpool clear foo raidz1-0
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Alan Somers <asomers@gmail.com>
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Sean Eric Fagan <sef@ixsystems.com>
https://www.illumos.org/issues/8856:
arc_cksum_is_equal() calls zio_push_transform() that requires abd_t*
(second arg), but a void* is passed.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Roman Strashkin <roman.strashkin@nexenta.com>
https://www.illumos.org/issues/8898:
# zfs create -o checksum=skein rpool/test
internal error: Result too large
Abort (core dumped)
Not a big deal per se, but should be handled correctly.
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Yuri Pankov <yuri.pankov@nexenta.com>
https://www.illumos.org/issues/8897:
# zpool online -e test mirror-1
Assertion failed: nvlist_lookup_string(tgt, "path", &pathname) == 0, file ../common/libzfs_pool.c, line 2558, function zpool_vdev_online
Abort (core dumped)
Not a big deal per se, but should be handled gracefully, same way as 'offline' and 'online' without '-e'.
Also reported as: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221408
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Yuri Pankov <yuri.pankov@nexenta.com>
https://www.illumos.org/issues/8930:
We normally remove an unlinked node when its last user goes away and the
node becomes inactive. However, we should not do that if the filesystem
is mounted read-only including the case where it has its readonly
property set. The node will remain on the unlinked queue, so it will
not be leaked.
One particular scenario is when we receive an incremental stream into a
mounted read-only filesystem and that stream contains an unlinked file
(still on the unlinked queue). If that file is opened before the
receive and some time later after the receive it becomes inactive we
would remove it and, thus, modify the read-only filesystem. As a
result, the filesystem would diverge from its source and further
incremental receives would not be possible (without forcing a rollback).
Another related scenario, that may or may not be possible depending on an
OS / VFS policy, is when an open file is unlinked, then the filesystem is
remounted read-only, and then the file is closed.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Andriy Gapon <avg@FreeBSD.org>
https://www.illumos.org/issues/8909:
There's a race condition that exists if `zil_free_lwb` races with either
`zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`.
Here's an example panic due to this bug:
> ::status
debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40
operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc)
image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513
panic message:
BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in mo
dule "zfs" due to a NULL pointer dereference
dump content: kernel pages only
If there's an outstanding lwb that's in `zil_commit_waiter_timeout`
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the `zil_commit_waiter_t` structure of the
thread waiting on the waiter's CV is used.
A similar situation can occur if an lwb is issued to disk, and thus in
the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
`zil_free_lwb`, which will result in a use-after-free situation when the
lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.
This race condition is prevented in `zil_close` by calling `zil_commit`
before `zil_free_lwb` is called, which will ensure all outstanding (i.e.
all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states)
reach the `LWB_STATE_DONE` state before the lwb's are freed
(`zil_commit` will not return untill all the lwb's are
`LWB_STATE_DONE`).
Further, this race condition is prevented in `zil_sync` by only calling
`zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set.
All lwb's not in the `LWB_STATE_DONE` state will have a non-null value
for this pointer; the pointer is only cleared in
`zil_lwb_flush_vdevs_done`, at which point the lwb's state will be
changed to `LWB_STATE_DONE`.
This race is present in `zil_suspend`, leading to this bug.
At first glance, it would appear as though this would not be true
because `zil_suspend` will call `zil_commit`, just like `zil_close`, but
the problem is that `zil_suspend` will set the zilog's `zl_suspend`
field prior to calling `zil_commit`. Further, in `zil_commit`, if
`zl_suspend` is set, `zil_commit` will take a special branch of logic
and use `txg_wait_synced` instead of performing the normal `zil_commit`
logic.
This call to `txg_wait_synced` might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the `LWB_STATE_DONE` state before it returns.
This is because, if there's an lwb "stuck" in
`zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will
maintain a non-null value for it's `lwb_buf` field and thus `zil_sync`
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnesseary.
So, after `zil_commit` is called from `zil_suspend`, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used `zil_commit_waiter_timeout`.
Reviewed by: John Kennedy <jwk404@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
https://www.illumos.org/issues/8603:
To help make the ZIL's code more understandable, it was suggested that
the zilog_t's "zl_writer_lock" field should be renamed to "zl_issuer_lock".
Reviewed by: C Fraire <cfraire@me.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
https://www.illumos.org/issues/8677
We want to be able to run channel programs outside of synching context.
This would greatly improve performance of channel program that just gather
information, as we won't have to wait for synching context anymore.
This feature should introduce the following:
- A new command line flag in "zfs program" to specify our intention to
run in open context.
- A new flag/option within the channel program ioctl which selects the
context.
- Appropriate error handling whenever we try a channel program in
open-context that contains zfs.sync* expressions.
- Documentation for the new feature in the manual pages.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
mav [Sat, 17 Feb 2018 23:54:59 +0000 (23:54 +0000)]
MFC r323002 (by emaste): zfs: do not advertise edonr which is not yet supported
illumos 4185 ("add new cryptographic checksums to ZFS: SHA-512,
Skein, Edon-R") was intentionally merged only partially in r289422,
without adding support for skein, sha512 and edonr on FreeBSD.
Support for skein and sha512 was added later on, but edonr is still not
implemented in FreeBSD.
Prior to this commit zfs(8) correctly rejected edonr, but with an error
message that claimed support:
fk@r500 ~ $zfs set checksum=edonr tank
cannot set property for 'tank': 'checksum' must be one of 'on | off | fletcher2 | fletcher4 | sha256 | sha512 | skein | edonr'
mav [Sat, 17 Feb 2018 23:51:15 +0000 (23:51 +0000)]
MFC r321104 (by jhibbits): Make ZFS not crash on mount on 32-bit systems
ZPL_VERSION is unsigned long long, not an int. With this change, a zpool
can be created on a 32-bit system (tested on powerpcspe) and mounted
correctly.