https://www.illumos.org/issues/7181
zfsvfs_setup() is called in both zfs_mount and zfs_resume_fs paths.
dmu_objset_set_user(zfsvfs->z_os, zfsvfs) is called early in zfsvfs_setup()
before the setup is actually completed,
thus an under-constructed zfsvfs becomes visible.
Additionally, there is nothing to serialize the two call paths. As a result two
threads can step on each other's toes.
assertion failed: zilog->zl_clean_taskq == NULL, file:
../../common/fs/zfs/zil.c, line: 1772
https://www.illumos.org/issues/7199
dsl_dataset_rollback_sync may try to free already freed blocks when it calls
dsl_destroy_head_sync_impl to destroy a temporary clone.
That happens if a snapshot to which we are rolling back and from which the
clone is created has some ZIL records.
https://www.illumos.org/issues/7200
No new blocks must be born in a dataset in the same TXG after a snapshot of the
dataset is taken.
Those blocks would have the same blk_birth as the dataset's ds_prev_snap_txg
and as such they would be presumed to belong o the snapshot while in fact they
do not.
All the datasets must be clean before sync tasks are run, so the described
scenario may happen only if one of the sync tasks dirties the dataset and
another sync task takes its snapshot.
Then, there will be another sync pass because of the dirty data and the new
blocks will be born in the same TXG when the data is written out.
It seems that almost all of the existing sync tasks modify only MOS and do not
dirty any objsets.
The only exception that I've been able to identify so far is the rollback which
can modify an objset when it zeroes out the objset's ZIL.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
https://www.illumos.org/issues/7180
If a filesystem is not unmounted while the rename is being performed, then, for
example, a concurrect zfs rollback may call zfs_suspend_fs followed by
zfs_resume_fs on the same filesystem.
The latter takes the filesystem's name as an argument. If the filesystem name
changes as a result of the rename, then dmu_objset_hold(osname, zfsvfs, &os)
call in zfs_resume_fs would fail resulting in a kernel panic.
So far I have been able to reproduce this problem on FreeBSD where zfs rename
has -u option that skips the unmounting before doing the renaming.
But I think that in theory the same problem can occur on illumos as well,
because the unmounting is done in userland before invoking the rename ioctl and
there could be a race with, e.g., zfs mount.
panic: solaris assert: dmu_objset_hold(osname, zfsvfs, &zfsvfs->z_os) == 0 (0x2
== 0x0), file: /usr/devel/svn/head/sys/cddl/contrib/opensolaris/uts/common/fs/
zfs/zfs_vfsops.c, line: 2210
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe004df30710
vpanic() at vpanic+0x182/frame 0xfffffe004df30790
panic() at panic+0x43/frame 0xfffffe004df307f0
assfail3() at assfail3+0x2c/frame 0xfffffe004df30810
zfs_resume_fs() at zfs_resume_fs+0xb9/frame 0xfffffe004df30860
zfs_ioc_rollback() at zfs_ioc_rollback+0x61/frame 0xfffffe004df308a0
zfsdev_ioctl() at zfsdev_ioctl+0x65c/frame 0xfffffe004df30940
devfs_ioctl_f() at devfs_ioctl_f+0x156/frame 0xfffffe004df309a0
kern_ioctl() at kern_ioctl+0x246/frame 0xfffffe004df30a00
sys_ioctl() at sys_ioctl+0x171/frame 0xfffffe004df30ae0
amd64_syscall() at amd64_syscall+0x2db/frame 0xfffffe004df30bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe004df30bf0
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
avg [Thu, 24 Nov 2016 09:47:56 +0000 (09:47 +0000)]
firewire: initialize tag label to -1 in fw_xfer_alloc()
Zero can be confused for a potentially valid value.
For example, if I load and unload sbp driver I get a lot of messages
like the following:
fw_tl_free: the xfer is not in the queue (tlabel=0, flag=0x0)
send: dst=0x00 tl=0x00 rt=0 tcode=0x0 pri=0x0 src=0x000
recv: dst=0x01 tl=0x21 rt=1 tcode=0x1 pri=0x0 src=0xffc0
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe04464407e0
fw_tl_free() at fw_tl_free+0x18d/frame 0xfffffe0446440820
fw_xfer_unload() at fw_xfer_unload+0xca/frame 0xfffffe0446440840
fw_xferlist_remove() at fw_xferlist_remove+0x2f/frame 0xfffffe0446440870
sbp_detach() at sbp_detach+0x1e0/frame 0xfffffe04464408e0
device_detach() at device_detach+0x80/frame 0xfffffe0446440900
devclass_driver_deleted() at devclass_driver_deleted+0x6a/frame 0xfffffe0446440940
devclass_delete_driver() at devclass_delete_driver+0x7d/frame 0xfffffe0446440980
driver_module_handler() at driver_module_handler+0xff/frame 0xfffffe04464409d0
module_unload() at module_unload+0x32/frame 0xfffffe04464409f0
linker_file_unload() at linker_file_unload+0x24b/frame 0xfffffe0446440a40
kern_kldunload() at kern_kldunload+0xbc/frame 0xfffffe0446440a70
amd64_syscall() at amd64_syscall+0x314/frame 0xfffffe0446440bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0446440bf0
avg [Thu, 24 Nov 2016 09:43:42 +0000 (09:43 +0000)]
fwohci: report whether PhysicalUpperBound register is implemented
Please see section 5.15 of 1394 OHCI Specification.
If the register is not implemented, then the physical response unit is
limited to the first 4GB of the physical memory.
In that case the non-cooperative debugging over firewire (using /dev/fwmem)
can not be expected to work if a target has more RAM than that.
The method is described in gdb.4 and the Developer's Handbook.
It seems that most of the consumer hardware does not implement
PhysicalUpperBound register.
sephe [Thu, 24 Nov 2016 07:35:16 +0000 (07:35 +0000)]
hyperv/hn: Fix primary channel revocation
Since hypervisor will not drain the TX bufring, once the channels are
revoked:
- Setup vmbus orphan handler properly.
- Make sure that suspension will not wait the TX bufring draining
forever.
- GC the pending TX descs on detach path, before freeing the busdma
stuffs.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8559
sephe [Thu, 24 Nov 2016 06:01:29 +0000 (06:01 +0000)]
hyperv/vmbus: Fix the multi-channel revoking on vmbus side.
- Reference count the sub-channel when channel offer message is
processed, so that immediate rescind message on the same channel
will not race sub-channel open on driver side.
- Drop the above reference when sub-channel is closed, this closely
mimics the hypervisor's reaction when primary channel is closed
on the VM side. No drivers use sub-channel after primary channel
is closed.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8546
dexuan [Thu, 24 Nov 2016 05:52:28 +0000 (05:52 +0000)]
share/man/man4/Makefile: Only install Hyper-V man pages on amd64 and i386
We shouldn't install them on the architectures not supported by Hyper-V.
And, hv_ata_pci_disengage.4.gz should be removed from all architectures:
1) It should have only applied to Hyper-V;
2) For Hyper-V platforms (amd64 and i386), the related driver was removed by
r306426 | sephe | 2016-09-29 09:41:52 +0800 (Thu, 29 Sep 2016),
because now we have a better mechanism to disble the ata driver for hard
disks when the VM runs on Hyper-V.
Reviewed by: sephe, andrew, jhb
Approved by: sephe (mentor)
MFC after: 3 days
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8572
sephe [Thu, 24 Nov 2016 05:18:45 +0000 (05:18 +0000)]
hyperv/vmbus: Fix the primary channel revoking on vmbus side.
Drivers can now use vmbus_chan_{is_revoked,set_orphan,unset_orphan}() and
vmbus_xact_ctx_orphan() to fix their attach/detach DEVMETHODs for revoked
primary channels.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8545
gonzo [Thu, 24 Nov 2016 00:45:52 +0000 (00:45 +0000)]
[rpi3] Move SOC_BRCM_BCM2837 from UP config to SMP one
Now that BCM283x source are buildable with SMP option it cam be moved to
GENERIC SMP config. SMP itself does not work on RPi3 yet due to lack of
PSCI monitor which is work in progress at the moment
jhb [Wed, 23 Nov 2016 20:21:53 +0000 (20:21 +0000)]
Fix _mips_rtld_bind() to handle ELF filters.
MIPS does not use the common _rtld_bind() to handle runtime binding.
Instead, it uses a private _mips_rtld_bind(). Update _mips_rtld_bind()
to include the changes made to _rtld_bind() in r216695 and r218476 to
support upgrading the read-locked rtld_bind_lock to a write lock when
an object with a filter is encountered.
While here, add a 'where' variable to track the location of the fixup
in the GOT to make the code flow more closely match _rtld_bind().
mjg [Wed, 23 Nov 2016 19:50:12 +0000 (19:50 +0000)]
cache: ensure that the number of bucket locks does not exceed hash size
The size can be changed by side effect of modifying kern.maxvnodes.
Since numbucketlocks was not modified, setting a sufficiently low value
would give more locks than actual buckets, which would then lead to
corruption.
manu [Wed, 23 Nov 2016 18:31:34 +0000 (18:31 +0000)]
Enable UEXT related nodes for Olimex A20 SOM
UEXT are Universal EXTension connector from Olimex. They embed i2c, spi
and uart pins along power in one connector and are found on most,
if not all, Olimex boards.
The Olimex A20 SOM EVB have two UEXT connector so enable the nodes found on
those two connectors.
Patch has been applied upstream, in the meantime add the nodes to our custom
DTS.
markj [Wed, 23 Nov 2016 17:53:07 +0000 (17:53 +0000)]
Release laundered vnode pages to the head of the inactive queue.
The swap pager enqueues laundered pages near the head of the inactive queue
to avoid another trip through LRU before reclamation. This change adds
support for this behaviour to the vnode pager and makes use of it in UFS and
ext2fs. Some ioflag handling is consolidated into a common subroutine so
that this support can be easily extended to other filesystems which make use
of the buffer cache. No changes are needed for ZFS since its putpages
routine always undirties the pages before returning, and the laundry
thread requeues the pages appropriately in this case.
julian [Wed, 23 Nov 2016 07:57:52 +0000 (07:57 +0000)]
This little BSD licensed library has been kicking around for years.
It allows one to trivially convert an absolute path to a relative path
and the reverse. The test programs themselves are very useful in scripts
but the real use comes shortly with the -r and -a arguments to ln.
These are sometimes known as the --relative and --absolute flags and
can force a symlink to be relative when you only have an absolue path.
Another place these are sometimes used is to add -a and -r args to 'realpath'.
Incredibly useful in Makefiles.
I was going to just add the files in with 'ln' but a library makes more sense.
The test programs may come out in their own right some day for scripting.
released under a BSD 2-clause:
* Copyright (c) 1997 Shigio Yamaguchi. All rights reserved.
* Copyright (c) 1999 Tama Communications Corporation. All rights reserved.
The test directry does not conform to any framework.
Not connected to build.
doc people may want to play with the manual pages.
brooks [Tue, 22 Nov 2016 22:45:15 +0000 (22:45 +0000)]
Allocate a struct ifreq rather than using a (wrong) computed size for
the BIOCSETIF ioctl.
The kernel always copies an entire struct ifreq and IPv4 addresses will
always fit in an ifreq.
On systems with pointers larger than 64-bits, the computed size will be
less than the size of struct ifreq, potentially resulting in the kernel
attempting to copyin memory from outside the allocation.
jilles [Tue, 22 Nov 2016 22:30:55 +0000 (22:30 +0000)]
open(2): Clarify non-POSIX error when opening a symlink with O_NOFOLLOW.
We return [EMLINK] instead of [ELOOP] when trying to open a symlink with
O_NOFOLLOW, so that the original case of [ELOOP] can be distinguished. Code
like cmp -h and xz takes advantage of this.
alc [Tue, 22 Nov 2016 18:13:46 +0000 (18:13 +0000)]
Remove PG_CACHED-related fields from struct vmmeter, because they are no
longer used. More precisely, they are always zero because the code that
decremented and incremented them no longer exists.
andrew [Tue, 22 Nov 2016 18:13:04 +0000 (18:13 +0000)]
Only build acpi_timer.c on x86, it fails on arm64 as it attempts to access
an invalid address. It is also unneeded on arm64 as we use the ARM Generic
Timer driver.
Obtained from: ABT Systems Ltd
Sponsored by: The FreeBSD Foundation
adrian [Tue, 22 Nov 2016 17:36:16 +0000 (17:36 +0000)]
[net80211] high oops on the high seas, or "god damnit compilers, it's 2016 and you're supposed to save me from this."
TODO:
* drink real coffee before committing in the morning, or there's a high
risk of more obviously self-evident commits being turned into attempts
at humour.
kib [Tue, 22 Nov 2016 10:58:24 +0000 (10:58 +0000)]
Use buffer pager for NFS.
The pager, due to its construction, implements clustering for the
page-ins. In particular, buildworld load demonstrates reduction of
the READ RPCs from 39k down to 24k. No change in real or CPU time was
observed.
Discussed with, and measured by: bde
No objections from: rmacklem
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
adrian [Tue, 22 Nov 2016 01:22:54 +0000 (01:22 +0000)]
[net80211] flesh out more IBSS 11n support
* Pepper comments around which describe what state(s) we're in when faking
up 11n nodes.
* By default don't fake it up as 11n until we properly negotiate the 11n
capabilities using probe request/response frames.
* Send a probe request with our HT information, as the 802.11-2012 spec
suggests.
* Reassociate with the driver if we've been promoted.
This is done because although learning a peer via beacons can learn 11n
state, learning peers via hearing probe frames and broadcast frames
does not. Thus, sometimes you end up with an 11n peer in the peer
table and sometimes you don't.
Note that the probe request/response exchange may not actually succeed.
Ideally we'd put the peer into some blocking state until we've exchanged
probe request/reponse to learn capabilities, or we timeout and just
stay non-11n.
This is more an experiment to get 11n IBSS nodes actually discovering
each other and be able to transmit. There are other issues that creep
up which I'll attempt to address in future commits.
jhb [Tue, 22 Nov 2016 01:02:59 +0000 (01:02 +0000)]
Initialize 'ticks' earlier in boot after 'hz' is set.
This avoids the time-warp after kthreads have started running and the
required fixup to td_slptick and td_blktick in the EARLY_AP_STARTUP
case. Now, 'ticks' is initialized before any kthreads are created or
any context switches are performed.
rwatson [Tue, 22 Nov 2016 00:41:24 +0000 (00:41 +0000)]
Audit 'fd' and 'cmd' arguments to fcntl(2), and when generating BSM,
always audit the file-descriptor number and vnode information for all
fnctl(2) commands, not just locking-related ones. This was likely an
oversight in the original adaptation of this code from XNU.
gjb [Mon, 21 Nov 2016 23:29:28 +0000 (23:29 +0000)]
Set the 'vital' flag on the runtime and jail packages.
The default pkg(8) from pkg.freebsd.org requires libjail.so,
so mark the jail package as vital along with the runtime
package to avoid errors when libjail.so is removed. This is
a no-op for systems with WITHOUT_JAIL in src.conf(5) and pkg(8)
built from the Ports Collection.
In order to make this work without marking packages such as
the jail-lib32, for example, the jail.ucl file needed to be
split out into separate files similarly to the runtime-*.ucl
files.
Glanced at by: brd
MFC after: 5 days
Sponsored by: The FreeBSD Foundation
hiren [Mon, 21 Nov 2016 20:53:11 +0000 (20:53 +0000)]
For RTT calculations mid-session, we explicitly ignore ACKs with tsecr of 0 as
many borken middle-boxes tend to do that. But during 3whs, in syncache_expand(),
we don't do that which causes us to send a RST to such a client. Relax this
constraint by only using tsecr to compare against timestamp that we sent when it
is not 0. As a result, we'd now accept the final ACK of 3whs with tsecr of 0.
mizhka [Mon, 21 Nov 2016 19:26:22 +0000 (19:26 +0000)]
[etherswitch] add ukswitch hint that is phy offset at mdio register
This patch allows to specify PHY register offset for ukswitch. For instance,
switch MAICREL KS8995XA connected via MDIO to SoC, but PHY register starts
at 1. So hint for this case is: hint.ukswitch.0.phyoffset=1
andrew [Mon, 21 Nov 2016 18:24:05 +0000 (18:24 +0000)]
To allow for an ACPI attachment to the generic PCIe driver split off the
FDT attachment to a new file. A separate ACPI attachment will then be added
to allow arm64 servers with ACPI to use it over FDT.
This should also help with merging this with the ofwpci driver, with
further work needed to remove restrictions this driver places on resource
allocation.
Obtained from: ABT Systems Ltd
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D7319
kib [Mon, 21 Nov 2016 14:13:57 +0000 (14:13 +0000)]
Adjust r308689 to make rtld compilable with either in-tree or
(hopefully) stock gcc 4.2.1 on i386 and other arches.
In particular:
- Do not use %ebx in the asm constraints on i386, since rtld is
compiled with -fPIC and gcc cannot handle GOT-base register reload
(clang and newer gcc can).
- Avoid direct use of [static N] construct in the function
declaration/definion. In-tree gcc was patched to support this, but
stock 4.2.1 cannot handle the feature.
Requested by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
andrew [Mon, 21 Nov 2016 11:27:14 +0000 (11:27 +0000)]
Add an arm64 specific uart cpu driver. As arm64 may use ACPI to find the
uart we need to handle both it and FDT, and as such we need to have an
architecture specific driver.
Obtained from: ABT Systems Ltd
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D7796
andrew [Mon, 21 Nov 2016 11:18:00 +0000 (11:18 +0000)]
Add accelerated AES with using the ARMv8 crypto instructions. This is based
on the AES-NI code, and modified as needed for use on ARMv8. When loaded
the driver will check the appropriate field in the id_aa64isar0_el1
register to see if AES is supported, and if so the probe function will
signal the driver should attach.
With this I have seen up to 2000Mb/s from the cryptotest test with a single
thread on a ThunderX Pass 2.0.
Reviewed by: imp
Obtained from: ABT Systems Ltd
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D8297
vangyzen [Sun, 20 Nov 2016 20:13:22 +0000 (20:13 +0000)]
Fix error reporting from wcstof()
When wcstof() skipped initial space and then parsing failed, it set
endptr to the first non-space character. Fix it to correctly report
failure by setting endptr to the beginning of the input string.
The fix is from theraven@, who fixed this bug in wcstod() and
wcstold() in r227753.
While I'm here:
Move assignments out of declarations in wcstod() and wcstold().
This is against my personal preference, but it is our agreed style(9).
Set endptr correctly on malloc() failure in all three functions.
Remove an incorrect comment: This is pointer arithmetic,
so the code was not actually making that assumption.
wcstold() advanced the wcp pointer beyond leading whitespace
and then reset it back to the beginning of the string.
Do not reset it. This seems to have no functional effect,
since strtold_l() also skips leading whitespace. I'm making
the change to keep this function consistent with wcstof() and
wcstod(), and because the C11 spec prescribes the use of iswspace()
to skip leading space.
Reported by: libc++ unit test for std::stof(std::wstring)
MFC after: 8 days
Sponsored by: Dell EMC