Alexander Motin [Tue, 31 Jul 2018 00:37:45 +0000 (00:37 +0000)]
MFV r336952: 9192 explicitly pass good_writes to vdev_uberblock/label_sync
Currently vdev_label_sync and vdev_uberblock_sync take a zio_t and assume
that its io_private is a pointer to the good_writes count. They should
instead accept this argument explicitly.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
Alexander Motin [Tue, 31 Jul 2018 00:34:39 +0000 (00:34 +0000)]
9192 explicitly pass good_writes to vdev_uberblock/label_sync
Currently vdev_label_sync and vdev_uberblock_sync take a zio_t and assume
that its io_private is a pointer to the good_writes count. They should
instead accept this argument explicitly.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
Alexander Motin [Tue, 31 Jul 2018 00:25:39 +0000 (00:25 +0000)]
MFV r336950: 9290 device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk failure
and silent damage (e.g. some data on disk is not right, but ZFS hasn't
detected the whole device as being broken). However, the current device
removal implementation bypasses some of the mirror's redundancy.
Alexander Motin [Tue, 31 Jul 2018 00:13:04 +0000 (00:13 +0000)]
9290 device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk failure
and silent damage (e.g. some data on disk is not right, but ZFS hasn't
detected the whole device as being broken). However, the current device
removal implementation bypasses some of the mirror's redundancy.
Alexander Motin [Tue, 31 Jul 2018 00:02:42 +0000 (00:02 +0000)]
MFV r336948: 9112 Improve allocation performance on high-end systems
On high-end systems running async sequential write workloads, especially
NUMA systems with flash or NVMe storage, one significant performance
bottleneck is selecting a metaslab to do allocations from. This process
can be parallelized, providing significant performance increases for
these workloads.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Paul Dagnelie <pcd@delphix.com>
Alexander Motin [Mon, 30 Jul 2018 23:53:25 +0000 (23:53 +0000)]
9112 Improve allocation performance on high-end systems
On high-end systems running async sequential write workloads, especially
NUMA systems with flash or NVMe storage, one significant performance
bottleneck is selecting a metaslab to do allocations from. This process
can be parallelized, providing significant performance increases for
these workloads.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Paul Dagnelie <pcd@delphix.com>
Alexander Motin [Mon, 30 Jul 2018 23:47:38 +0000 (23:47 +0000)]
MFV r336946: 9238 ZFS Spacemap Encoding V2
The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
device removal, zpool checkpoint) we've started imposing limits in the
vdevs that can be used with them based on the maximum addressable offset
(currently 64PB for a top-level vdev).
The new remains backwards compatible with the old one. The introduced
two-word entry format, besides extending the limits imposed by the single-entry
layout, also includes a vdev field and some extra padding after its prefix.
The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps.
One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and
we wanted to fit the vdev field in the space map. In addition that gives us
some extra bits in dva_t.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
Alexander Motin [Mon, 30 Jul 2018 22:56:24 +0000 (22:56 +0000)]
9238 ZFS Spacemap Encoding V2
The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
device removal, zpool checkpoint) we've started imposing limits in the
vdevs that can be used with them based on the maximum addressable offset
(currently 64PB for a top-level vdev).
The new remains backwards compatible with the old one. The introduced
two-word entry format, besides extending the limits imposed by the single-entry
layout, also includes a vdev field and some extra padding after its prefix.
The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps.
One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and
we wanted to fit the vdev field in the space map. In addition that gives us
some extra bits in dva_t.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
Alexander Motin [Mon, 30 Jul 2018 22:39:30 +0000 (22:39 +0000)]
MFV r336944: 9286 want refreservation=auto
When a ZFS volume is created with zfs create -V (but without -s), the
refreservation property is set to a value that is volsize plus the maximum
size of metadata. If refreservation is ever set to another value, it is
impossible to set it back to the automatically determined value. There are
other cases where refreservation may be wrong. These include receiving a
volume that was sent without properties and zfs clone.
We need:
zfs set refreservation=auto <volume>
zfs clone -o refreservation=auto <volume>
Each one would use the same function used by zfs create -V to determine the
proper value for refreservation.
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Mike Gerdts <mike.gerdts@joyent.com>
Alexander Motin [Mon, 30 Jul 2018 22:10:15 +0000 (22:10 +0000)]
9286 want refreservation=auto
When a ZFS volume is created with zfs create -V (but without -s), the
refreservation property is set to a value that is volsize plus the maximum
size of metadata. If refreservation is ever set to another value, it is
impossible to set it back to the automatically determined value. There are
other cases where refreservation may be wrong. These include receiving a
volume that was sent without properties and zfs clone.
We need:
zfs set refreservation=auto <volume>
zfs clone -o refreservation=auto <volume>
Each one would use the same function used by zfs create -V to determine the
proper value for refreservation.
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Mike Gerdts <mike.gerdts@joyent.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
Michael Tuexen [Mon, 30 Jul 2018 21:27:26 +0000 (21:27 +0000)]
Allow implicit TCP connection setup for TCP/IPv6.
TCP/IPv4 allows an implicit connection setup using sendto(), which
is used for TTCP and TCP fast open. This patch adds support for
TCP/IPv6.
While there, improve some tests for detecting multicast addresses,
which are mapped.
Michael Tuexen [Mon, 30 Jul 2018 21:13:42 +0000 (21:13 +0000)]
Send consistent SEG.WIN when using timewait codepath for TCP.
When sending TCP segments from the timewait code path, a stored
value of the last sent window is used. Use the same code for
computing this in the timewait code path as in the main code
path used in tcp_output() to avoiv inconsistencies.
Michael Tuexen [Mon, 30 Jul 2018 20:35:50 +0000 (20:35 +0000)]
Fix some TCP fast open issues.
The following issues are fixed:
* Whenever a TCP server with TCP fast open enabled, calls accept(),
recv(), send(), and close() before the TCP-ACK segment has been received,
the TCP connection is just dropped and the reception of the TCP-ACK
segment triggers the sending of a TCP-RST segment.
* Whenever a TCP server with TCP fast open enabled, calls accept(), recv(),
send(), send(), and close() before the TCP-ACK segment has been received,
the first byte provided in the second send call is not transferred.
* Whenever a TCP client with TCP fast open enabled calls sendto() followed
by close() the TCP connection is just dropped.
Rick Macklem [Mon, 30 Jul 2018 20:25:32 +0000 (20:25 +0000)]
Silence newer gcc warnings.
Newer versions of gcc generate "set, but not used" warnings.
Add __unused macros to silence these warnings.
Although the variables are not being used, they are values parsed from
arguments to callback RPCs that might be needed in the future.
The CORB and RIRB buffers exist in DMA memory, but the device reads them as
little-endian only. Read and write as LE into the DMA memory block, to work on
BE platforms.
Alexander Motin [Mon, 30 Jul 2018 19:44:14 +0000 (19:44 +0000)]
9284 arc_reclaim_thread has 2 jobs
`arc_reclaim_thread()` calls `arc_adjust()` after calling
`arc_kmem_reap_now()`; `arc_adjust()` signals `arc_get_data_buf()` to
indicate that we may no longer be `arc_is_overflowing()`.
The problem is, `arc_kmem_reap_now()` can take several seconds to
complete, has no impact on `arc_is_overflowing()`, but due to how the
code is structured, can impact how long the ARC will remain in the
`arc_is_overflowing()` state.
The fix is to use seperate threads to:
1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which
improves `arc_is_overflowing()`
2. keep enough free memory in the system, by calling
`arc_kmem_reap_now()` plus `arc_shrink()`, which improves
`arc_available_memory()`.
Follow up to r336919 and r336921: s/efi.rt_disabled/efi.rt.disabled/
The latter matches the rest of the tree better [0]. The UPDATING entry has
been updated to reflect this, and the new tunable is now documented in
loader(8) [1].
As noted in UDPATING, the new loader tunable efi.rt_disabled may be used to
disable EFIRT at runtime. It should have no effect if you are not booted via
UEFI boot.
efirt: Add tunable to allow disabling EFI Runtime Services
Leading up to enabling EFIRT in GENERIC, allow runtime services to be
disabled with a new tunable: efi.rt_disabled. This makes it so that EFIRT
can be disabled easily in case we run into some buggy UEFI implementation
and fail to boot.
Alan Somers [Mon, 30 Jul 2018 15:46:40 +0000 (15:46 +0000)]
Make timespecadd(3) and friends public
The timespecadd(3) family of macros were imported from NetBSD back in
r35029. However, they were initially guarded by #ifdef _KERNEL. In the
meantime, we have grown at least 28 syscalls that use timespecs in some
way, leading many programs both inside and outside of the base system to
redefine those macros. It's better just to make the definitions public.
Our kernel currently defines two-argument versions of timespecadd and
timespecsub. NetBSD, OpenBSD, and FreeDesktop.org's libbsd, however, define
three-argument versions. Solaris also defines a three-argument version, but
only in its kernel. This revision changes our definition to match the
common three-argument version.
Bump _FreeBSD_version due to the breaking KPI change.
Reuse of the index variable in two nested loops resulted in only the first
argument in the list being used (fine for gzip, not fine for zstd). Also
add tests for xz and zstd, and fix the COMPRESS_SUFFIX_MAXLEN macro.
snd_hda: Print error codes in decimal, rather than hex
It's easy to confuse the error code as naked it looks decimal (EINVAL is
reported as error 16, instead of error 22, so first reading looks like EBUSY).
Ed Maste [Mon, 30 Jul 2018 15:10:06 +0000 (15:10 +0000)]
Revert accidental change from r336908
By default ld.lld should be the bootstrap linker (only) on i386 right
now. Once the i386 exp-run with LLD_IS_LD has a good result this will
also be enabled by default.
Andrew Turner [Mon, 30 Jul 2018 15:05:07 +0000 (15:05 +0000)]
As with DPCPU_DEFINE_STATIC make VNET_DEFINE_STATIC non-static on arm64 in
modules. It also fails in the same way, we are unable to relocate static
variables as the compiler uses PC-relative loads with nothing for the
kernel linker to relocate.
Andrew Turner [Mon, 30 Jul 2018 14:25:17 +0000 (14:25 +0000)]
Ensure the DPCPU and VNET module spaces are aligned to hold a pointer.
Previously they may have been aligned to a char, leading to misaligned
DPCPU and VNET variables.
David Bright [Mon, 30 Jul 2018 14:21:49 +0000 (14:21 +0000)]
Correct possible misleading error message in kqtest.
ian@ pointed out that in the test_abstime() function time(NULL) is
used twice; once in an "if" test and again in the enclosed error
message. If the true branch was taken and the process got preempted
before the second time(NULL) call, by the time the error message was
generated enough time could have elapsed that the message could claim
that the event came "too early" but print an event time that was after
the expected timeout. Correct by making the time(NULL) call only once
and using that returned time in both the "if" test and the error
message.
Ed Maste [Mon, 30 Jul 2018 12:38:08 +0000 (12:38 +0000)]
Enable ld.lld as bootstrap linker by default on i386
Akin to r327783 for amd64. lld has been usable for amd64 for quite some
time, but a couple of issues remained that affected i386. These were
recently addressed upstream in lld and merged into FreeBSD or addressed
directly in FreeBSD (r326831, r326879, r326897, r326957, r333401,
r334626, r336664).
Similarly to the intial amd64 commit this change enables lld only as the
bootstrap linker (used to link the kernel and userland libraries and
executables), while GNU ld.bfd is still installed as /usr/bin/ld and
used for ports builds. That will be changed shortly, after an exp-run.
This is a recommit of r327823 after additional lld fixes.
PR: 225128 (exp-run)
Relnotes: Yes
Sponsored by: The FreeBSD Foundation
This fixes the panic caused by deadlocking when grant-table free
callbacks are used.
The cause of the recursion is: check_free_callbacks() is always called
with the lock gnttab_list_lock held. In turn the callback function is
also called with the lock held. Then when the client uses any of the grant
reference methods which also attempt the lock the gnttab_list_lock
mutex from within the free callback a deadlock happens.
Fix this by making the gnttab_list_lock recursive.
xen-blkfront: fix memory leak in xbd_connect error path
If gnttab_grant_foreign_access() fails for any of the indirection
pages, the code breaks out of both the loops without freeing the local
variable indirectpages, causing a memory leak.
Randall Stewart [Mon, 30 Jul 2018 10:23:29 +0000 (10:23 +0000)]
This fixes a hole where rack could end up
sending an invalid segment into the reassembly
queue. This would happen if you enabled the
data after close option.
Alan Cox [Mon, 30 Jul 2018 01:54:25 +0000 (01:54 +0000)]
Prepare for adding psind == 1 support to armv6's pmap_enter().
Precompute the new PTE before entering the critical section.
Eliminate duplication of the pmap and pv list unlock operations in
pmap_enter() by implementing a single return path. Otherwise, the
duplication will only increase with the upcoming support for psind == 1.
Ifuncs selectors dispatch copyin(9) family to the suitable variant, to
set rflags.AC around userspace access. Rflags.AC bit is cleared in
all kernel entry points unconditionally even on machines not
supporting SMAP.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D13838
Alan Somers [Sun, 29 Jul 2018 18:22:26 +0000 (18:22 +0000)]
getrusage(2): fix return value under 32-bit emulation
According to the man page, getrusage(2) should return EFAULT if the rusage
argument lies outside of the process's address space. But due to an
oversight in r100384, that's never been the case during 32-bit emulation.
Fix it.
Will Andrews [Sun, 29 Jul 2018 01:44:26 +0000 (01:44 +0000)]
beinstall: perform pre-installworld steps.
Since all post-installkernel steps are assumed to operate in the updated
installation, it's necessary to chroot all of the followup steps in the new
boot environment. Set up and mount the source and object directories at the
same paths inside the BE root, and clean up to the extent changes were made.
This commit fixes upgrading using beinstall past the new ntpd user change.
Improve testability of changes to this script while I'm here.
Don Lewis [Sun, 29 Jul 2018 00:30:06 +0000 (00:30 +0000)]
Fix the long term ULE load balancer so that it actually works. The
initial call to sched_balance() during startup is meant to initialize
balance_ticks, but does not actually do that since smp_started is
still zero at that time. Since balance_ticks does not get set,
there are no further calls to sched_balance(). Fix this by setting
balance_ticks in sched_initticks() since we know the value of
balance_interval at that time, and eliminate the useless startup
call to sched_balance(). We don't need to randomize the intial
value of balance_ticks.
Since there is now only one call to sched_balance(), we can hoist
the tests at the top of this function out to the caller and avoid
the overhead of the function call when running a SMP kernel on UP
hardware.
Important vendor changes:
PR #993: Chdir to -C directory for metalog processing
OSS-Fuzz #4969: Check size of the extended time field in zip archives
PR #973: Record informational compression level in gzip header
Important vendor changes:
PR #993: Chdir to -C directory for metalog processing
OSS-Fuzz #4969: Check size of the extended time field in zip archives
PR #973: Record informational compression level in gzip header
This produce a generic sdcard image using armv7 GENERIC kernel that
just need some u-boot (or none if the board have u-boot or a SPI flash
for example).
Brad Davis [Sat, 28 Jul 2018 20:31:03 +0000 (20:31 +0000)]
Whitespace only change, no functional change intended.
The padding makes it much easier to read, but occasionally means that commits
like this one have to be done to follow up. I intentionally kept this
separate from r336841 to try and make things easier to follow later on.
Rick Macklem [Sat, 28 Jul 2018 20:21:04 +0000 (20:21 +0000)]
Modify the NFSv4.1 server so that it allows ReclaimComplete as done by ESXi 6.7.
I believe that a ReclaimComplete with rca_one_fs == TRUE is only
to be used after a file system has been transferred to a different
file server. However, RFC5661 is somewhat vague w.r.t. this and
the ESXi 6.7 client does both a ReclaimComplete with rca_one_fs == TRUE
and one with ReclaimComplete with rca_one_fs == FALSE.
Therefore, just ignore the rca_one_fs == TRUE operation and return
NFS_OK without doing anything instead of replying NFS4ERR_NOTSUPP.
This allows the ESXi 6.7 NFSv4.1 client to do a mount.
After discussion on the NFSv4 IETF working group mailing list, doing this
along with setting a flag to note that a ReclaimComplete with rca_one_fs TRUE
was an appropriate way to handle this.
The flag that indicates that a ReclaimComplete with rca_one_fs == TRUE was
done may be used to disable replies of NFS4ERR_GRACE for non-reclaim
state operations in a future commit.
This patch along with r332790, r334492 and r336357 allow ESXi 6.7 NFSv4.1 mounts
work ok. ESX 6.5 NFSv4.1 mounts do not work well, due to what I believe are
violations of RFC-5661 and should not be used.
Warner Losh [Sat, 28 Jul 2018 19:44:20 +0000 (19:44 +0000)]
Be more conservative about setting hw.uart.console
Note when we've found a 8250 PNP node. Only try to set hw.uart.console
if we see one (otherwise ignore serial hints). The 8250 is the only
one known to have I/O ports, so limit the guessing to when we've
positively seen one. And limit this to x86 since that's the only
platform where we have I/O ports. Otherwise, we'd set the serial port
to something crazy for the platform and fall off the cliff early in
boot.
I think I recall there being a reason for a specific list of ciphers on GCE
at the time, but I do not recall what it was, and cannot find any
current GCE documentation of such a list.
So, just revert the explicit configuration and use sane openssh defaults.
PR: 230092
Submitted by: Gustavo Scalet <gustavo.scalet AT collabora.com>
MFC after: 3 days
Security: yes
Clean up execl*(3) manual page prototype formatting
Rendering of execle was missing a comma between the NULL argument and envp.
For unclear reasons, POSIX' definition of these routines comments out the
mandatory trailing NULL argument. That seems unnecessary and probably
(reasonably) confuses mdoc.
For unclear reasons, POSIX' definition of these routines spells NULL as
"(char *)0." This is needlessly unclear. One guess might be that POSIX
targets more exotic computer architectures than FreeBSD does. Fortunately,
there is no such problem on any reasonable platform for FreeBSD to support.
Spell NULL as NULL.
The comma was probably removed in r117204 while the comment and creative
spelling of NULL were added in r116537 (both 15 years ago).
Andrew Turner [Sat, 28 Jul 2018 17:21:34 +0000 (17:21 +0000)]
Use the cp15 functions to read cp15 registers rather than using assembly
functions. The former are static inline functions so will compile to a
single instruction.