Fabien Thomas [Fri, 25 Nov 2016 14:44:49 +0000 (14:44 +0000)]
IPsec RFC6479 support for replay window sizes up to 2^32 - 32 packets.
Since the previous algorithm, based on bit shifting, does not scale
with large replay windows, the algorithm used here is based on
RFC 6479: IPsec Anti-Replay Algorithm without Bit Shifting.
The replay window will be fast to be updated, but will cost as many bits
in RAM as its size.
The previous implementation did not provide a lock on the replay window,
which may lead to replay issues.
Fabien Thomas [Fri, 25 Nov 2016 13:49:33 +0000 (13:49 +0000)]
In a dual processor system (2*6 cores) during IPSec throughput tests,
we see a lot of contention on the arc4 lock, used to generate the IV
of the ESP output packets.
The idea of this patch is to split this mutex in order to reduce the
contention on this lock.
Ed Maste [Fri, 25 Nov 2016 13:15:28 +0000 (13:15 +0000)]
Add WITH_LLD_AS_LD build knob
If set it installs LLD as /usr/bin/ld. LLD (as of version 3.9) is not
capable of linking the world and kernel, but can self-host and link many
substantial applications. GNU ld continues to be used for the world and
kernel build, regardless of how this knob is set.
It is on by default for arm64, and off for all other CPU architectures.
Dimitry Andric [Thu, 24 Nov 2016 22:54:55 +0000 (22:54 +0000)]
Upgrade our copies of clang, llvm, lldb, compiler-rt and libc++ to 3.9.0
release, and add lld 3.9.0. Also completely revamp the build system for
clang, llvm, lldb and their related tools.
Please note that from 3.5.0 onwards, clang, llvm and lldb require C++11
support to build; see UPDATING for more information.
Release notes for llvm, clang and lld are available here:
<http://llvm.org/releases/3.9.0/docs/ReleaseNotes.html>
<http://llvm.org/releases/3.9.0/tools/clang/docs/ReleaseNotes.html>
<http://llvm.org/releases/3.9.0/tools/lld/docs/ReleaseNotes.html>
Thanks to Ed Maste, Bryan Drewery, Andrew Turner, Antoine Brodin and Jan
Beich for their help.
virtio_console: handle short writes to an Unix domain socket gracefully.
writev() can do a short write. Retrying it results in a very convoluted
and complex code, so we iterate over iovec and do regular stream_write()
instead.
Andriy Gapon [Thu, 24 Nov 2016 21:32:04 +0000 (21:32 +0000)]
virtio_pci: fix announcement of MSI-X interrupts for queues
Queues that do not need interrupts - for instance, output queues - do
not have a corresponding entry in vtpci_msix_vq_interrupts.
So, it was wrong to increment a pointer into that array when iterating
over such a queue.
I ran into this bug while trying to use virtio_console(4) that allocates
a lot of queues with every other being an output queue without an
interrupt handler (if MultiplePorts feature is negotiated).
Justin Hibbits [Thu, 24 Nov 2016 20:31:46 +0000 (20:31 +0000)]
Fix the build post-r309017 for MPC85XX/MPC85XXSPE
r309017 removed two fields from struct vmmeter, which is embedded in struct
pcpu. This caused the struct size to change, triggering the CTASSERT in
sys/pcpu.h. Add the extra 8 bytes back in as padding.
Add a warning against modifying this code without understanding it, and
an example of how not to make it more portable. I've had this lying
around uncommitted since 2009...
https://www.illumos.org/issues/7181
zfsvfs_setup() is called in both zfs_mount and zfs_resume_fs paths.
dmu_objset_set_user(zfsvfs->z_os, zfsvfs) is called early in zfsvfs_setup()
before the setup is actually completed,
thus an under-constructed zfsvfs becomes visible.
Additionally, there is nothing to serialize the two call paths. As a result two
threads can step on each other's toes.
assertion failed: zilog->zl_clean_taskq == NULL, file:
../../common/fs/zfs/zil.c, line: 1772
https://www.illumos.org/issues/7199
dsl_dataset_rollback_sync may try to free already freed blocks when it calls
dsl_destroy_head_sync_impl to destroy a temporary clone.
That happens if a snapshot to which we are rolling back and from which the
clone is created has some ZIL records.
https://www.illumos.org/issues/7200
No new blocks must be born in a dataset in the same TXG after a snapshot of the
dataset is taken.
Those blocks would have the same blk_birth as the dataset's ds_prev_snap_txg
and as such they would be presumed to belong o the snapshot while in fact they
do not.
All the datasets must be clean before sync tasks are run, so the described
scenario may happen only if one of the sync tasks dirties the dataset and
another sync task takes its snapshot.
Then, there will be another sync pass because of the dirty data and the new
blocks will be born in the same TXG when the data is written out.
It seems that almost all of the existing sync tasks modify only MOS and do not
dirty any objsets.
The only exception that I've been able to identify so far is the rollback which
can modify an objset when it zeroes out the objset's ZIL.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
https://www.illumos.org/issues/7180
If a filesystem is not unmounted while the rename is being performed, then, for
example, a concurrect zfs rollback may call zfs_suspend_fs followed by
zfs_resume_fs on the same filesystem.
The latter takes the filesystem's name as an argument. If the filesystem name
changes as a result of the rename, then dmu_objset_hold(osname, zfsvfs, &os)
call in zfs_resume_fs would fail resulting in a kernel panic.
So far I have been able to reproduce this problem on FreeBSD where zfs rename
has -u option that skips the unmounting before doing the renaming.
But I think that in theory the same problem can occur on illumos as well,
because the unmounting is done in userland before invoking the rename ioctl and
there could be a race with, e.g., zfs mount.
panic: solaris assert: dmu_objset_hold(osname, zfsvfs, &zfsvfs->z_os) == 0 (0x2
== 0x0), file: /usr/devel/svn/head/sys/cddl/contrib/opensolaris/uts/common/fs/
zfs/zfs_vfsops.c, line: 2210
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe004df30710
vpanic() at vpanic+0x182/frame 0xfffffe004df30790
panic() at panic+0x43/frame 0xfffffe004df307f0
assfail3() at assfail3+0x2c/frame 0xfffffe004df30810
zfs_resume_fs() at zfs_resume_fs+0xb9/frame 0xfffffe004df30860
zfs_ioc_rollback() at zfs_ioc_rollback+0x61/frame 0xfffffe004df308a0
zfsdev_ioctl() at zfsdev_ioctl+0x65c/frame 0xfffffe004df30940
devfs_ioctl_f() at devfs_ioctl_f+0x156/frame 0xfffffe004df309a0
kern_ioctl() at kern_ioctl+0x246/frame 0xfffffe004df30a00
sys_ioctl() at sys_ioctl+0x171/frame 0xfffffe004df30ae0
amd64_syscall() at amd64_syscall+0x2db/frame 0xfffffe004df30bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe004df30bf0
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Andriy Gapon [Thu, 24 Nov 2016 09:47:56 +0000 (09:47 +0000)]
firewire: initialize tag label to -1 in fw_xfer_alloc()
Zero can be confused for a potentially valid value.
For example, if I load and unload sbp driver I get a lot of messages
like the following:
fw_tl_free: the xfer is not in the queue (tlabel=0, flag=0x0)
send: dst=0x00 tl=0x00 rt=0 tcode=0x0 pri=0x0 src=0x000
recv: dst=0x01 tl=0x21 rt=1 tcode=0x1 pri=0x0 src=0xffc0
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe04464407e0
fw_tl_free() at fw_tl_free+0x18d/frame 0xfffffe0446440820
fw_xfer_unload() at fw_xfer_unload+0xca/frame 0xfffffe0446440840
fw_xferlist_remove() at fw_xferlist_remove+0x2f/frame 0xfffffe0446440870
sbp_detach() at sbp_detach+0x1e0/frame 0xfffffe04464408e0
device_detach() at device_detach+0x80/frame 0xfffffe0446440900
devclass_driver_deleted() at devclass_driver_deleted+0x6a/frame 0xfffffe0446440940
devclass_delete_driver() at devclass_delete_driver+0x7d/frame 0xfffffe0446440980
driver_module_handler() at driver_module_handler+0xff/frame 0xfffffe04464409d0
module_unload() at module_unload+0x32/frame 0xfffffe04464409f0
linker_file_unload() at linker_file_unload+0x24b/frame 0xfffffe0446440a40
kern_kldunload() at kern_kldunload+0xbc/frame 0xfffffe0446440a70
amd64_syscall() at amd64_syscall+0x314/frame 0xfffffe0446440bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0446440bf0
Andriy Gapon [Thu, 24 Nov 2016 09:43:42 +0000 (09:43 +0000)]
fwohci: report whether PhysicalUpperBound register is implemented
Please see section 5.15 of 1394 OHCI Specification.
If the register is not implemented, then the physical response unit is
limited to the first 4GB of the physical memory.
In that case the non-cooperative debugging over firewire (using /dev/fwmem)
can not be expected to work if a target has more RAM than that.
The method is described in gdb.4 and the Developer's Handbook.
It seems that most of the consumer hardware does not implement
PhysicalUpperBound register.
Sepherosa Ziehau [Thu, 24 Nov 2016 07:35:16 +0000 (07:35 +0000)]
hyperv/hn: Fix primary channel revocation
Since hypervisor will not drain the TX bufring, once the channels are
revoked:
- Setup vmbus orphan handler properly.
- Make sure that suspension will not wait the TX bufring draining
forever.
- GC the pending TX descs on detach path, before freeing the busdma
stuffs.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8559
Sepherosa Ziehau [Thu, 24 Nov 2016 06:01:29 +0000 (06:01 +0000)]
hyperv/vmbus: Fix the multi-channel revoking on vmbus side.
- Reference count the sub-channel when channel offer message is
processed, so that immediate rescind message on the same channel
will not race sub-channel open on driver side.
- Drop the above reference when sub-channel is closed, this closely
mimics the hypervisor's reaction when primary channel is closed
on the VM side. No drivers use sub-channel after primary channel
is closed.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8546
Dexuan Cui [Thu, 24 Nov 2016 05:52:28 +0000 (05:52 +0000)]
share/man/man4/Makefile: Only install Hyper-V man pages on amd64 and i386
We shouldn't install them on the architectures not supported by Hyper-V.
And, hv_ata_pci_disengage.4.gz should be removed from all architectures:
1) It should have only applied to Hyper-V;
2) For Hyper-V platforms (amd64 and i386), the related driver was removed by
r306426 | sephe | 2016-09-29 09:41:52 +0800 (Thu, 29 Sep 2016),
because now we have a better mechanism to disble the ata driver for hard
disks when the VM runs on Hyper-V.
Reviewed by: sephe, andrew, jhb
Approved by: sephe (mentor)
MFC after: 3 days
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8572
Sepherosa Ziehau [Thu, 24 Nov 2016 05:18:45 +0000 (05:18 +0000)]
hyperv/vmbus: Fix the primary channel revoking on vmbus side.
Drivers can now use vmbus_chan_{is_revoked,set_orphan,unset_orphan}() and
vmbus_xact_ctx_orphan() to fix their attach/detach DEVMETHODs for revoked
primary channels.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8545
[rpi3] Move SOC_BRCM_BCM2837 from UP config to SMP one
Now that BCM283x source are buildable with SMP option it cam be moved to
GENERIC SMP config. SMP itself does not work on RPi3 yet due to lack of
PSCI monitor which is work in progress at the moment
John Baldwin [Wed, 23 Nov 2016 20:21:53 +0000 (20:21 +0000)]
Fix _mips_rtld_bind() to handle ELF filters.
MIPS does not use the common _rtld_bind() to handle runtime binding.
Instead, it uses a private _mips_rtld_bind(). Update _mips_rtld_bind()
to include the changes made to _rtld_bind() in r216695 and r218476 to
support upgrading the read-locked rtld_bind_lock to a write lock when
an object with a filter is encountered.
While here, add a 'where' variable to track the location of the fixup
in the GOT to make the code flow more closely match _rtld_bind().
Mateusz Guzik [Wed, 23 Nov 2016 19:50:12 +0000 (19:50 +0000)]
cache: ensure that the number of bucket locks does not exceed hash size
The size can be changed by side effect of modifying kern.maxvnodes.
Since numbucketlocks was not modified, setting a sufficiently low value
would give more locks than actual buckets, which would then lead to
corruption.
Emmanuel Vadot [Wed, 23 Nov 2016 18:31:34 +0000 (18:31 +0000)]
Enable UEXT related nodes for Olimex A20 SOM
UEXT are Universal EXTension connector from Olimex. They embed i2c, spi
and uart pins along power in one connector and are found on most,
if not all, Olimex boards.
The Olimex A20 SOM EVB have two UEXT connector so enable the nodes found on
those two connectors.
Patch has been applied upstream, in the meantime add the nodes to our custom
DTS.
Mark Johnston [Wed, 23 Nov 2016 17:53:07 +0000 (17:53 +0000)]
Release laundered vnode pages to the head of the inactive queue.
The swap pager enqueues laundered pages near the head of the inactive queue
to avoid another trip through LRU before reclamation. This change adds
support for this behaviour to the vnode pager and makes use of it in UFS and
ext2fs. Some ioflag handling is consolidated into a common subroutine so
that this support can be easily extended to other filesystems which make use
of the buffer cache. No changes are needed for ZFS since its putpages
routine always undirties the pages before returning, and the laundry
thread requeues the pages appropriately in this case.
Julian Elischer [Wed, 23 Nov 2016 07:57:52 +0000 (07:57 +0000)]
This little BSD licensed library has been kicking around for years.
It allows one to trivially convert an absolute path to a relative path
and the reverse. The test programs themselves are very useful in scripts
but the real use comes shortly with the -r and -a arguments to ln.
These are sometimes known as the --relative and --absolute flags and
can force a symlink to be relative when you only have an absolue path.
Another place these are sometimes used is to add -a and -r args to 'realpath'.
Incredibly useful in Makefiles.
I was going to just add the files in with 'ln' but a library makes more sense.
The test programs may come out in their own right some day for scripting.
released under a BSD 2-clause:
* Copyright (c) 1997 Shigio Yamaguchi. All rights reserved.
* Copyright (c) 1999 Tama Communications Corporation. All rights reserved.
The test directry does not conform to any framework.
Not connected to build.
doc people may want to play with the manual pages.
Brooks Davis [Tue, 22 Nov 2016 22:45:15 +0000 (22:45 +0000)]
Allocate a struct ifreq rather than using a (wrong) computed size for
the BIOCSETIF ioctl.
The kernel always copies an entire struct ifreq and IPv4 addresses will
always fit in an ifreq.
On systems with pointers larger than 64-bits, the computed size will be
less than the size of struct ifreq, potentially resulting in the kernel
attempting to copyin memory from outside the allocation.
Jilles Tjoelker [Tue, 22 Nov 2016 22:30:55 +0000 (22:30 +0000)]
open(2): Clarify non-POSIX error when opening a symlink with O_NOFOLLOW.
We return [EMLINK] instead of [ELOOP] when trying to open a symlink with
O_NOFOLLOW, so that the original case of [ELOOP] can be distinguished. Code
like cmp -h and xz takes advantage of this.
Alan Cox [Tue, 22 Nov 2016 18:13:46 +0000 (18:13 +0000)]
Remove PG_CACHED-related fields from struct vmmeter, because they are no
longer used. More precisely, they are always zero because the code that
decremented and incremented them no longer exists.
Andrew Turner [Tue, 22 Nov 2016 18:13:04 +0000 (18:13 +0000)]
Only build acpi_timer.c on x86, it fails on arm64 as it attempts to access
an invalid address. It is also unneeded on arm64 as we use the ARM Generic
Timer driver.
Obtained from: ABT Systems Ltd
Sponsored by: The FreeBSD Foundation
Adrian Chadd [Tue, 22 Nov 2016 17:36:16 +0000 (17:36 +0000)]
[net80211] high oops on the high seas, or "god damnit compilers, it's 2016 and you're supposed to save me from this."
TODO:
* drink real coffee before committing in the morning, or there's a high
risk of more obviously self-evident commits being turned into attempts
at humour.
https://www.illumos.org/issues/7181
zfsvfs_setup() is called in both zfs_mount and zfs_resume_fs paths.
dmu_objset_set_user(zfsvfs->z_os, zfsvfs) is called early in zfsvfs_setup()
before the setup is actually completed,
thus an under-constructed zfsvfs becomes visible.
Additionally, there is nothing to serialize the two call paths. As a result two
threads can step on each other's toes.
assertion failed: zilog->zl_clean_taskq == NULL, file:
../../common/fs/zfs/zil.c, line: 1772
https://www.illumos.org/issues/6428
Scenario:
$ zfs create rpool/p
$ zfs set canmount=noauto rpool/p
$ zfs umount rpool/p
$ zfs create rpool/p/c
$ zfs get -r mounted,canmount rpool/p
NAME PROPERTY VALUE SOURCE
rpool/p mounted no -
rpool/p canmount noauto local
rpool/p/c mounted yes -
rpool/p/c canmount on default
In another shell ensure that rpool/p/c is in use, for example:
$ cd /rpool/p/c
Then:
$ zfs set canmount=off rpool/p
cannot unmount '/rpool/p/c': Device busy
But there is no reason to try to unmount rpool/p/c in this scenario.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
https://www.illumos.org/issues/7199
dsl_dataset_rollback_sync may try to free already freed blocks when it calls
dsl_destroy_head_sync_impl to destroy a temporary clone.
That happens if a snapshot to which we are rolling back and from which the
clone is created has some ZIL records.
https://www.illumos.org/issues/7200
No new blocks must be born in a dataset in the same TXG after a snapshot of the
dataset is taken.
Those blocks would have the same blk_birth as the dataset's ds_prev_snap_txg
and as such they would be presumed to belong o the snapshot while in fact they
do not.
All the datasets must be clean before sync tasks are run, so the described
scenario may happen only if one of the sync tasks dirties the dataset and
another sync task takes its snapshot.
Then, there will be another sync pass because of the dirty data and the new
blocks will be born in the same TXG when the data is written out.
It seems that almost all of the existing sync tasks modify only MOS and do not
dirty any objsets.
The only exception that I've been able to identify so far is the rollback which
can modify an objset when it zeroes out the objset's ZIL.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
https://www.illumos.org/issues/7180
If a filesystem is not unmounted while the rename is being performed, then, for
example, a concurrect zfs rollback may call zfs_suspend_fs followed by
zfs_resume_fs on the same filesystem.
The latter takes the filesystem's name as an argument. If the filesystem name
changes as a result of the rename, then dmu_objset_hold(osname, zfsvfs, &os)
call in zfs_resume_fs would fail resulting in a kernel panic.
So far I have been able to reproduce this problem on FreeBSD where zfs rename
has -u option that skips the unmounting before doing the renaming.
But I think that in theory the same problem can occur on illumos as well,
because the unmounting is done in userland before invoking the rename ioctl and
there could be a race with, e.g., zfs mount.
panic: solaris assert: dmu_objset_hold(osname, zfsvfs, &zfsvfs->z_os) == 0 (0x2
== 0x0), file: /usr/devel/svn/head/sys/cddl/contrib/opensolaris/uts/common/fs/
zfs/zfs_vfsops.c, line: 2210
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe004df30710
vpanic() at vpanic+0x182/frame 0xfffffe004df30790
panic() at panic+0x43/frame 0xfffffe004df307f0
assfail3() at assfail3+0x2c/frame 0xfffffe004df30810
zfs_resume_fs() at zfs_resume_fs+0xb9/frame 0xfffffe004df30860
zfs_ioc_rollback() at zfs_ioc_rollback+0x61/frame 0xfffffe004df308a0
zfsdev_ioctl() at zfsdev_ioctl+0x65c/frame 0xfffffe004df30940
devfs_ioctl_f() at devfs_ioctl_f+0x156/frame 0xfffffe004df309a0
kern_ioctl() at kern_ioctl+0x246/frame 0xfffffe004df30a00
sys_ioctl() at sys_ioctl+0x171/frame 0xfffffe004df30ae0
amd64_syscall() at amd64_syscall+0x2db/frame 0xfffffe004df30bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe004df30bf0
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
https://www.illumos.org/issues/6412
It seems that zfs receive -F -u would mount a received filesystem after
receiving a full stream if a destination filesystem already existed (and, thus,
got destroyed and re-created) and was mounted.
How to reproduce:
$ zfs create rpool/sandbox
$ zfs create rpool/sandbox/from
$ zfs create rpool/sandbox/to
$ zfs snap rpool/sandbox/from@snap
$ zfs send rpool/sandbox/from@snap | zfs recv -v -F -u rpool/sandbox/to
receiving full stream of rpool/sandbox/from@snap into rpool/sandbox/to@snap
received 41.7KB stream in 1 seconds (41.7KB/sec)
$ zfs get mounted rpool/sandbox/to
NAME PROPERTY VALUE SOURCE
rpool/tmp/sandbox/to mounted yes -
This behavior can be problematic if the mountpoint property changes either
because it had a non-inherited value or the stream contains properties because
it has been generated with either -R or -p.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
The pager, due to its construction, implements clustering for the
page-ins. In particular, buildworld load demonstrates reduction of
the READ RPCs from 39k down to 24k. No change in real or CPU time was
observed.
Discussed with, and measured by: bde
No objections from: rmacklem
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week