CyberLeo.Net >> Repos - FreeBSD/FreeBSD.git/log

8423 8199 7432 Implement large_dnode pool feature

8423 Implement large_dnode pool feature
8199 multi-threaded dmu_object_alloc()
7432 Large dnode pool feature

llumos/illumos-gate@54811da5ac6b517992fdc173df5d605e4e61fdc0
https://github.com/illumos/illumos-gate/commit/54811da5ac6b517992fdc173df5d605e4e61fdc0
https://www.illumos.org/issues/8423
https://www.illumos.org/issues/8199
https://www.illumos.org/issues/7432

  ZoL issues:
  Improved dnode allocation #6564
  Clean up large dnode code #6262
  Fix dnode_hold() freeing dnode behavior #8172
  Fix dnode allocation race #6414, #6439
  Partial: Raw sends must be able to decrease nlevels #6821, #6864
  Remove unnecessary txg syncs from receive_object() Closes #7197

Author: Toomas Soome <tsoome@me.com>

6585 sha512, skein, and edonr have an unenforced dependency on extensible dataset

illumos/illumos-gate@892586e8a147c02d7f4053cc405229a13e796928
https://github.com/illumos/illumos-gate/commit/892586e8a147c02d7f4053cc405229a13e796928

https://www.illumos.org/issues/6585
  In any pool without the extensible dataset feature flag already enabled,
  creating a dataset with dedup set to use one of the new checksums would result
  in the following panic as soon as any data was added:
  panic[cpu0]/thread=ffffff0006761c40: feature_get_refcount(spa, feature,
  &refcount) != 48 (0x30 != 0x30), file: ../../common/fs/zfs/zfeature.c line 390

  ffffff0006761830 fffffffffba8fbdd ()
  ffffff0006761890 zfs:feature_do_action+11a ()
  ffffff00067618c0 zfs:spa_feature_incr+1e ()
  ffffff0006761920 zfs:dmu_object_zapify+b7 ()
  ffffff00067619b0 zfs:dsl_dataset_activate_feature+97 ()
  ffffff0006761a20 zfs:dsl_dataset_sync+ba ()
  ffffff0006761ab0 zfs:dsl_pool_sync+153 ()
  ffffff0006761b70 zfs:spa_sync+26e ()
  ffffff0006761c20 zfs:txg_sync_thread+227 ()
  ffffff0006761c30 unix:thread_start+8 ()
  Inspection showed that feature->fi_feature was 7, which is the value of
  SPA_FEATURE_EXTENSIBLE_DATASET in the spa_feature enum.
  Testing shows that the panic can be prevented by explicitly setting extensible
  dataset as a dependency for the sha512, edonr, and skein feature flags.
  Alternatively, the new checksums code could possibly be changed to obviate the
  need for the dependency.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: ilovezfs <ilovezfs@icloud.com>

9689 zfs range lock code should not be zpl-specific

illumos/illumos-gate@7931524763ef94dc16989451dddd206563d03bb4

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

9683 Allow bypassing devid in vdev_disk_open()

illumos/illumos-gate@6fe4f3002ca33af5ab7123ed78d81899dadf2fbb

Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

9847 leaking dd_clones (DMU_OT_DSL_CLONES) objects

illumos/illumos-gate@17fb938fd6cdce3ff1bb47dafda0774f742249a3

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

9559 zfs diff handles files on delete queue in fromsnap poorly

illumos/illumos-gate@20633e304b57bc98f70fdb194081b7023adf527b

Reviewed by: Joshua M. Clulow <josh@sysmgr.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Paul Dagnelie <pcd@delphix.com>

9962 zil_commit should omit cache thrash

illumos/illumos-gate@cab3a55e158118937e07d059c46f1bc14d1f254d

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Author: Prakash Surya <prakash.surya@delphix.com>

9963 Seperate tunable for disabling ZIL vdev flush

illumos/illumos-gate@f8fdf6812567ce1a3b76a6bee2ee3f04bd3fcaca

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>

9993 zil writes can get delayed in zio pipeline

illumos/illumos-gate@2258ad0b755b24a55c6173b1e6bb6188389f72dd

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: George Wilson <george.wilson@delphix.com>

9880 Race in ZFS parallel mount

illumos/illumos-gate@bc4c0ff1343a311cc24933908ac6c4455af09031

Reviewed by: Jason King <jason.king@joyent.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Author: Andy Fiddaman <omnios@citrus-it.co.uk>

9466 add JSON output support to channel programs

illumos/illumos-gate@5267591016146502784860802129b16dab6f135c

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Alek Pinchuk <apinchuk@datto.com>

9738 9112 broke third block copy allocations within one metaslab group

illumos/illumos-gate@b86e7e3f0e50748bb5bb5cc91632d72ff17f08dd

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Alexander Motin <mav@FreeBSD.org>

9690 metaslab of vdev with no space maps was flushed during removal

illumos/illumos-gate@4e75ba682600b2bf19d158577d528c3db65050e8

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

9688 aggsum_fini leaks memory

illumos/illumos-gate@29bf2d68bef208274f5a54a14cc80c4a8cb76f53

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>

9681 ztest failure in spa_history_log_internal due to spa_rename()

illumos/illumos-gate@6aee0ad76969eb0027131b3a338f2d94ae86f728

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

9682 page fault in dsl_async_clone_destroy() while opening pool

illumos/illumos-gate@ade2c82828f0dca1f46919aa1bd936ea1a5a0047

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

9862 fix typo in comment in vdev_impl.h

illumos/illumos-gate@84927f52bd837f6e4882a19e43fd026f1828d910

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Allan Jude <allanjude@freebsd.org>

1700 Add SCSI UNMAP support

illumos/illumos-gate@047c81d31d0f571d6652e97143cd15281de61e84

Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Saso Kiselkov <saso.kiselkov@nexenta.com>

9617 too-frequent TXG sync causes excessive write inflation

illumos/illumos-gate@7928f4baf4ab3230557eb6289be68aa7a3003f38

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

8601 memory leak in get_special_prop()

illumos/illumos-gate@e19b450bec203d8be04447ea476d7a86b36d63a1

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: John Gallagher <john.gallagher@delphix.com>

9677 panic from zio_write_gang_block() when creating dump device on fragmented rpool

illumos/illumos-gate@7341a7de4f0489193e0cfe11049a7bcf1596a4db

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Brad Lewis <brad.lewis@delphix.com>

9616 Bogus error when attempting to set property on read-only pool

illumos/illumos-gate@f62db44dbcda5dd786bb821f1e6fd3ca2e6d4391

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andrew Stormont <astormont@racktopsystems.com>

9058 postmortem DTrace frequently broken under vmware

illumos/illumos-gate@793bd7e3617ae7d3d24e8c6b7d6befe35f07ec1f

Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Sam Gwydir <sam.gwydir@joyent.com>

5882 Temporary pool names

illumos/illumos-gate@04e56356520b98d5a93c496b10f02530bb6647e0
https://github.com/illumos/illumos-gate/commit/04e56356520b98d5a93c496b10f02530bb6647e0

https://www.illumos.org/issues/5882
  This is an import of the temporary pool names functionality from ZoL:
  https://github.com/zfsonlinux/zfs/commit/
  e2282ef57edc79cdce2a4b9b7e3333c56494a807
  https://github.com/zfsonlinux/zfs/commit/
  26b42f3f9d03f85cc7966dc2fe4dfe9216601b0e
  https://github.com/zfsonlinux/zfs/commit/
  2f3ec9006146844af6763d1fa4e823fd9047fd54
  https://github.com/zfsonlinux/zfs/commit/
  00d2a8c92f614f49d23dea5d73f7ea7eb489ccf1
  https://github.com/zfsonlinux/zfs/commit/
  83e9986f6eefdf0afc387f06407087bba3ead4e9
  https://github.com/zfsonlinux/zfs/commit/
  023bbe6f017380f4a04c5060feb24dd8cdda9fce
  It is intended to assist the creation and management of virtual machines
  that have their rootfs on ZFS on hosts that also have their rootfs on
  ZFS. These situations cause SPA namespace collisions when the standard
  name rpool is used in both cases. The solution is either to give each
  guest pool a name unique to the host, which is not always desireable, or
  boot a VM environment containing an ISO image to install it, which is
  cumbersome.
  As a side note, this commit includes the removal of `zpool import -r`,
  which previously did nothing.
  patch [Magnifier] (14.3 KB) Richard Yao, 2015-04-30 04:33 PM

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Andriy Gapon <agapon@panzura.com>

9630 add lzc_rename and lzc_destroy to libzfs_core

illumos/illumos-gate@049ba636fa37a2892809192fc671bff9158a01cd
https://github.com/illumos/illumos-gate/commit/049ba636fa37a2892809192fc671bff9158a01cd

https://www.illumos.org/issues/9630
Rename and destroy are very useful operations that deserve to be in
libzfs_core. And they are not hard to implement too.

Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>

9700 ZFS resilvered mirror does not balance reads

illumos/illumos-gate@82f63c3c2bf5e4378706e8dcfccf717d67371be9

Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Jerry Jelinek <jerry.jelinek@joyent.com>

9626 move 'static xuio_stats_t xuio_stats' to file where it use

illumos/illumos-gate@857c96d257470e097e846ab8886580991b329c32

Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Igor Kozhukhov <igor@dilos.org>

9672 Reserve a ZFS replication stream feature flag for ZSTD compression

illumos/illumos-gate@acd7f809f0376580771fe4df8aaeecebe4c40b2f

Author: Allan Jude <allanjude@freebsd.org>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>

9580 Add a hash-table on top of nvlist to speed-up operations

illumos/illumos-gate@2ec7644aab2a726a64681fa66c6db8731b160de1

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

8375 Kernel memory leak in nvpair code

illumos/illumos-gate@843c2111b160463f014d325560ad4b051711928e

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

7261 nvlist code should enforce name length limit

illumos/illumos-gate@48dd5e630c9b1773b7b10d08a3b90b6c9062d713

Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

7263 deeply nested nvlist can overflow stack

illumos/illumos-gate@9ca527c3d3dfa7c8f304b34a9e03b5eddace838f

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

9621 Make createtxg and guid properties public

illumos/illumos-gate@e8d4a73c868afb740396041be80ed2b141065e76

Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Josh Paetzel <josh@tcbug.org>

9465 ARC check for 'anon_size > arc_c/2' can stall the system

illumos/illumos-gate@abe1fd01ce5a83718c5a840daeab4abdaec1c104

Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Don Brady <don.brady@delphix.com>

9577 remove zfs_dbuf_evict_key tsd

The zfs_dbuf_evict_key TSD (thread-specific data) is not necessary - we can
instead pass a flag down in a few places to prevent recursive dbuf eviction.
Making this change has 3 benefits:

1. The code semantics are easier to understand.
2. On Linux, performance is improved, because creating/removing TSD values
(by setting to NULL vs non-NULL) is expensive, and we do it very often.
3. According to Nexenta, the current semantics can cause a deadlock when
concurrently calling dmu_objset_evict_dbufs() (which is rare today, but they
are working on a "parallel unmount" change that triggers this more easily)

illumos/illumos-gate@c2919acbea007fa95c709b60d073db9a24526e01

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

9591 ms_shift can be incorrectly changed in MOS config for indirect vdevs that have been historically expanded

illumos/illumos-gate@11f6a9680e013a7c9c57dc0b64d3e91e2eee1a6b

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

9338 moved dnode has incorrect dn_next_type

illumos/illumos-gate@c7fbe46df966ea665df63b6e6071808987e839d1

Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

9439 ZFS double-free due to failure to dirty indirect block

illumos/illumos-gate@99a19144e82244f3426f055cc73af8a937c0135c

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

9438 Holes can lose birth time info if a block has a mix of birth times

Ultimately, the problem here is that when you truncate and write a file in
the same transaction group, the dbuf for the indirect block will be zeroed
out to deal with the truncation, and then written for the write. During
this process, we will lose hole birth time information for any holes in the
range. In the case where a dnode is being freed, we need to determine
whether the block should be converted to a higher-level hole in the zio
pipeline, and if so do it when the dnode is being synced out.

illumos/illumos-gate@738e2a3ce3b2579222d6855e7fe75b5bcfcddf8d

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>

8906 uts: illumos rootfs should support salted cksum

illumos/illumos-gate@0dd498c0020c4a0dd34d350b07f97399a475d102

Reviewed by: C Fraire <cfraire@me.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Toomas Soome <tsoome@me.com>

9456 ztest failure in zil_commit_waiter_timeout

illumos/illumos-gate@b6031810da58df96413bf76e068638fcab1f228a

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Prakash Surya <prakash.surya@delphix.com>

9454 ::zfs_blkstats should count embedded blocks

illumos/illumos-gate@dec267e7ea9828898b1c64462daa6636c4ef5e29

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

9424 ztest failure: "unprotected error in call to Lua API (Invalid value type 'function' for key 'error')"

illumos/illumos-gate@fe3ba4d1227d8746116ece7240682b13595c3142

Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

9486 reduce memory used by device removal on fragmented pools

In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed. Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.

illumos/illumos-gate@cfd63e1b1bcf7ba4bf72f55ddbd87ce008d2986d

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

9479 fix wrong format specifier for vdev_id

illumos/illumos-gate@c7a7b2fa084facd68c3d1e860dfabe20c713877b

Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>

9457 libzfs_import.c:add_config() has a memory leak

A memory leak occurs on lines 209 and 213 because the config is not freed
in the error case. The interface to add_config() seems less than ideal -
it would be better if it copied any data necessary from the config and the
caller freed it.

illumos/illumos-gate@ddfe901b12348d31c500fb57f9174e88860a4061

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: sara hartse <sara.hartse@delphix.com>

9330 stack overflow when creating a deeply nested dataset

Datasets that are deeply nested (~100 levels) are impractical. We just put
a limit of 50 levels to newly created datasets. Existing datasets should
work without a problem.

illumos/illumos-gate@5ac95da7d61660aa299c287a39277cb0372be959

Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>

9539 Make zvol operations use _by_dnode routines

Continues what was started in 7801 add more by-dnode routines by fully
converting zvols to avoid unnecessary dnode_hold() calls. This saves a
small amount of CPU time and slightly improves latencies of operations
on zvols.

illumos/illumos-gate@8dfe5547fbf0979fc1065a8b6fddc1e940a7cf4f

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Rick McNeal <rick.mcneal@nexenta.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Richard Yao <richard.yao@prophetstor.com>

9523 Large alloc in zdb can cause trouble

16MB alloc in zdb_embedded_block() can cause cores in certain situations
(clang, gcc55).

OsX commit: https://github.com/openzfsonosx/zfs/commit/ced236a5da6e72ea7bf6d2919fe14e17cffe10f1
FreeBSD commit: https://svnweb.freebsd.org/base?view=revision&revision=326150
illumos/illumos-gate@03a4c2f4bfaca30115963b76445279b36468a614

Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Jorgen Lundman <lundman@lundman.net>

9487 Free objects when receiving full stream as clone

All objects after the last written or freed object are not supposed to
exist after receiving the stream. We should free them accordingly, as if
a freeobjects record for them had been included in the stream.

zfsonlinux/zfs@48fbb9ddbf2281911560dfbc2821aa8b74127315
illumos/illumos-gate@7864b8192b8d30471fa2240466d516292e5765b8

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>

9464 txg_kick() fails to see that we are quiescing, forcing transactions
to their next stages without leaving them accumulate changes

Ideally we would like txg_kick() to get triggered only when we are sure
that we are not syncing AND not quiescing any txg. This way we can kick
an open TXG to the quiescing state when we are sure that there is nothing
going on and we would benefit from the different states running
concurrently.

illumos/illumos-gate@fa41d87de9ec9000964c605eb01d6dc19e4a1abe

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

9442 decrease indirect block size of spacemaps

Updates to indirect blocks of spacemaps can contribute significantly to
write inflation. Therefore we want to reduce the indirect block size of
spacemaps from 128K to 16K.

illumos/illumos-gate@221813c13b43ef48330b03725e00edee85108cf1

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Albert Lee <trisk@forkgnu.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

9512 zfs remap poolname@snapname coredumps

Only filesystems and volumes are valid "zfs remap" parameters: when passed
a snapshot name zfs_remap_indirects() does not handle the EINVAL returned
from libzfs_core, which results in failing an assertion and consequently
crashing.

illumos/illumos-gate@0b2e8253986c5c761129b58cfdac46d204903de1

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: loli10K <ezomori.nozomu@gmail.com>

8115 parallel zfs mount

Mounting of filesystems in "filesystem/local" is done using `zfs mount -a`,
which mounts each filesystems serially. The bottleneck for each mount is
the I/O done to load metadata for each filesystem. As such, mounting
filesystems using a parallel algorithm should be a big win, and bring down
the runtime of "filesystem/local"'s start method.

illumos/illumos-gate@591e0e133f9980083db5d64ac33a30bcc3382ff7

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Sebastien Roy <seb@delphix.com>

9426 metaslab size can exceed offset addressable by spacemap

metaslab size can exceed offset addressable by spacemap. The vdev can
address up to 2^63 * SPA_MAXBLOCKSIZE (512). A metaslab can address up to
2^47 * 2^vdev_ashift. Therefore we may need to increase the number of
metaslabs so that the maximum metaslab size is capped at the amount that
can be addressed by the spacemap. This should happen in
vdev_metaslab_set_size().

illumos/illumos-gate@b4bf0cf0458759c67920a031021a9d96cd683cfe

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Don Brady <don.brady@delphix.com>

9328 zap code can take advantage of c99
9329 panic in zap_leaf_lookup() due to concurrent zapification

illumos/illumos-gate@bf26014c5541b6119f34e0d95294b7f2eb105ac2

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

9403 assertion failed in arc_buf_destroy() when concurrently reading block with checksum error

This assertion (VERIFY) failure was reported when reading a block. Turns out
the problem is that if we get an i/o error (ECKSUM in this case), and there
are multiple concurrent ARC reads of the same block (from different clones),
then the ARC will put multiple buf's on the same ANON hdr, which isn't
supposed to happen, and then causes a panic when we try to arc_buf_destroy()
the buf.

illumos/illumos-gate@fa98e487a9619b7902f218663be219e787a57dad

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Matthew Ahrens <mahrens@delphix.com>

9443 panic when scrub a v10 pool

illumos/illumos-gate@bb1f424574ac8e08069d0ba993c2a41ffe796794

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

9237 "zpool add" fails for very large pools

illumos/illumos-gate@ae5ee1bd5e54409b4589217b160c8cf4b81e2fb8

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Alan Somers <asomers@gmail.com>

9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on the deleted queue

illumos/illumos-gate@20b5dafb425396adaebd0267d29e1026fc4dc413

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>

9194 mechanism to override ashift at pool creation time

illumos/illumos-gate@93a1902e519bf492c071b48ffb93e5c1c7b53fb9

Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

9434 Speculative prefetch is blocked by device removal code

illumos/illumos-gate@47b8d4b884f2dab4e4c6952b40fd740a8fc4ab32

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Alexander Motin <mav@FreeBSD.org>

Add new files missed in r336991.

9102 zfs should be able to initialize storage devices

The first access to a disk block can incur a performance penalty on some
platforms (e.g. AWS's EBS, VMware VMDKs). Therefore it is recommended that
volumes be "thick provisioned", where supported by the platform (VMware).
Thick provisioning is time consuming and often is ignored. If the thick
provision step is omitted, customers will see suboptimal performance until
we have written to all parts of the LUN. ZFS should be able to initialize
any unused storage to remove any first-write penalty that exists.

illumos/illumos-gate@094e47e980b0796b94b1b8f51f462a64d246e516

Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

9256 zfs send space estimation off by > 10% on some datasets

illumos/illummos-gate@df477c0afa111b5205c872dab36dbfde391656de

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Paul Dagnelie <pcd@delphix.com>

9337 zfs get all is slow due to uncached metadata

This project's goal is to make read-heavy channel programs and zfs(1m)
administrative commands faster by caching all the metadata that they will
need in the dbuf layer. This will prevent the data from being evicted, so
that any future call to i.e. zfs get all won't have to go to disk (very
much).

illumos/illumos-gate@adb52d9262f45a04318fc6e188fe2b7f59d989a5

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

9236 nuke spa_dbgmsg

We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least,
metaslab_condense() should call zfs_dbgmsg because it's important and rare
enough to always log. It's possible that the message in zio_dva_allocate()
would be too high-frequency for zfs_dbgmsg.

illumos/illumos-gate@21f7c81cc1156e9202ce3412d3ecaa697c3b2222

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

9192 explicitly pass good_writes to vdev_uberblock/label_sync

Currently vdev_label_sync and vdev_uberblock_sync take a zio_t and assume
that its io_private is a pointer to the good_writes count. They should
instead accept this argument explicitly.

illumos/illumos-gate@a3b5583021b7b45676bf1f0cc68adf7a97900b56

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

9290 device removal reduces redundancy of mirrors

Mirrors are supposed to provide redundancy in the face of whole-disk failure
and silent damage (e.g. some data on disk is not right, but ZFS hasn't
detected the whole device as being broken). However, the current device
removal implementation bypasses some of the mirror's redundancy.

illumos/illumos-gate@3a4b1be953ee5601bab748afa07c26ed4996cde6

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

9112 Improve allocation performance on high-end systems

On high-end systems running async sequential write workloads, especially
NUMA systems with flash or NVMe storage, one significant performance
bottleneck is selecting a metaslab to do allocations from. This process
can be parallelized, providing significant performance increases for
these workloads.

illumos/illumos-gate@f78cdc34af236a6199dd9e21376f4a46348c0d56

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Paul Dagnelie <pcd@delphix.com>

9238 ZFS Spacemap Encoding V2

The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
device removal, zpool checkpoint) we've started imposing limits in the
vdevs that can be used with them based on the maximum addressable offset
(currently 64PB for a top-level vdev).

The new remains backwards compatible with the old one. The introduced
two-word entry format, besides extending the limits imposed by the single-entry
layout, also includes a vdev field and some extra padding after its prefix.

The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps.

One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and
we wanted to fit the vdev field in the space map. In addition that gives us
some extra bits in dva_t.

illumos/illumos-gate@17f11284b49b98353b5119463254074fd9bc0a28

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

9286 want refreservation=auto

When a ZFS volume is created with zfs create -V (but without -s), the
refreservation property is set to a value that is volsize plus the maximum
size of metadata. If refreservation is ever set to another value, it is
impossible to set it back to the automatically determined value. There are
other cases where refreservation may be wrong. These include receiving a
volume that was sent without properties and zfs clone.

We need:

zfs set refreservation=auto <volume>
zfs clone -o refreservation=auto <volume>

Each one would use the same function used by zfs create -V to determine the
proper value for refreservation.

illumos/illumos-gate@1c10ae76c0cb31326c320e7cef1d3f24a1f47125

Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Mike Gerdts <mike.gerdts@joyent.com>

9189 Add debug to vdev_label_read_config when txg check fails

illumos/illumos-gate@b6bf6e1540f30bd97b8d6e2c21d95e17841e0f23

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

9284 arc_reclaim_thread has 2 jobs

`arc_reclaim_thread()` calls `arc_adjust()` after calling
`arc_kmem_reap_now()`; `arc_adjust()` signals `arc_get_data_buf()` to
indicate that we may no longer be `arc_is_overflowing()`.

The problem is, `arc_kmem_reap_now()` can take several seconds to
complete, has no impact on `arc_is_overflowing()`, but due to how the
code is structured, can impact how long the ARC will remain in the
`arc_is_overflowing()` state.

The fix is to use seperate threads to:

1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which
improves `arc_is_overflowing()`

2. keep enough free memory in the system, by calling
`arc_kmem_reap_now()` plus `arc_shrink()`, which improves
`arc_available_memory()`.

illumos/illumos-gate@de753e34f9c399037936e8bc547d823bba9d4b0d

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Brad Lewis <brad.lewis@delphix.com>

9280 Assertion failure while running removal_with_ganging test with 4K devices

illumos/illumos-gate@243952c7eeef020886e3e2e3df99a513df40584a

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Matt Ahrens <Matt.Ahrens@delphix.com>

9188 increase size of dbuf cache to reduce indirect block decompression

illumos/illumos-gate@268bbb2a2fa79c36d4695d13a595ba50a7754b76

With compressed ARC (6950) we use up to 25% of our CPU to decompress indirect
blocks, under a workload of random cached reads. To reduce this decompression
cost, we would like to increase the size of the dbuf cache so that more
indirect blocks can be stored uncompressed.

If we are caching entire large files of recordsize=8K, the indirect blocks
use 1/64th as much memory as the data blocks (assuming they have the same
compression ratio). We suggest making the dbuf cache be 1/32nd of all memory,
so that in this scenario we should be able to keep all the indirect blocks
decompressed in the dbuf cache. (We want it to be more than the 1/64th that
the indirect blocks would use because we need to cache other stuff in the
dbuf cache as well.)

In real world workloads, this won't help as dramatically as the example
above, but we think it's still worth it because the risk of decreasing
performance is low. The potential negative performance impact is that we
will be slightly reducing the size of the ARC (by ~3%).

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: George Wilson <george.wilson@delphix.com>

9321 arc_loan_compressed_buf() can increment arc_loaned_bytes by the wrong value

illumos/illumos-gate@9be12bd737714550277bd02b0c693db560976990

arc_loan_compressed_buf() increments arc_loaned_bytes by psize unconditionally
In the case of zfs_compressed_arc_enabled=0, when the buf is returned via
arc_return_buf(), if ARC_BUF_COMPRESSED(buf) is false, then arc_loaned_bytes
is decremented by lsize, not psize.

Switch to using arc_buf_size(buf), instead of psize, which will return
psize or lsize, depending on the result of ARC_BUF_COMPRESSED(buf).

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Allan Jude <allanjude@freebsd.org>

9235 rename zpool_rewind_policy_t to zpool_load_policy_t

illumos/illumos-gate@5dafeea3ebd2dd77affc802bcb90f63faf01589f

We want to be able to pass various settings during import/open of a pool,
which are not only related to rewind. Instead of adding a new policy and
duplicate a bunch of code, we should just rename rewind_policy to a more
generic term like load_policy.

For instance, we'd like to set spa->spa_import_flags from the nvlist,
rather from a flags parameter passed to spa_import as in some cases we want
those flags not only for the import case, but also for the open case. One
such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb) which would
allow zfs to open a pool when logs are missing.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

9191 dump vdev tree to zfs_dbgmsg when spa load fails due to missing log devices

illumos/illumos-gate@ccef24b493bcbd146fcd6d8946666cae081470b6

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

9187 racing condition between vdev label and spa_last_synced_txg in vdev_validate

illumos/illumos-gate@d1de72cfa29ab77ff80e2bb0e668a6afa5bccaf0

ztest failed with uncorrectable IO error despite having the fix for #7163.
Both sides of the mirror have CANT_OPEN_BAD_LABEL, which also distinguishes
it from that issue.

Definitely seems like a racing condition between the vdev_validate and spa_sync:
1. Thread A (spa_sync): vdev label is updated to latest txg
2. Thread B (vdev_validate): vdev label's txg is compared to spa_last_synced_txg and is ahead.
3. Thread A (spa_sync): spa_last_synced_txg is updated to latest txg.

Solution: do not check txg in vdev_validate unless config lock is held.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

Add files missed from r331695.

9166 zfs storage pool checkpoint

illumos/illumos-gate@8671400134a11c848244896ca51a7db4d0f69da4

The idea of Storage Pool Checkpoint (aka zpool checkpoint) deals with
exactly that.  It can be thought of as a “pool-wide snapshot” (or a
variation of extreme rewind that doesn’t corrupt your data).  It remembers
the entire state of the pool at the point that it was taken and the user
can revert back to it later or discard it.  Its generic use case is an
administrator that is about to perform a set of destructive actions to ZFS
as part of a critical procedure.  She takes a checkpoint of the pool before
performing the actions, then rewinds back to it if one of them fails or puts
the pool into an unexpected state.  Otherwise, she discards it.  With the
assumption that no one else is making modifications to ZFS, she basically
wraps all these actions into a “high-level transaction”.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>

9213 zfs: sytem typo

illumos/illumos-gate@edc8ef7d921c96b23969898aeb766cb24960bda7

Reviewed by: C Fraire <cfraire@me.com>
Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Author: Toomas Soome <tsoome@me.com>

9084 spa_*_ashift must ignore spare devices

illumos/illumos-gate@b037f3dbd69cef4a7ffd576ad33e07bfaf0b1e84

Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>

8484 Implement aggregate sum and use for arc counters

In pursuit of improving performance on multi-core systems, we should
implements fanned out counters and use them to improve the performance of
some of the arc statistics. These stats are updated extremely frequently,
and can consume a significant amount of CPU time.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>

9164 assert: newds == os->os_dsl_dataset

illumos/illumos-gate@5f5913bb83405db87f982abee80162a479d363af
https://github.com/illumos/illumos-gate/commit/5f5913bb83405db87f982abee80162a479d363af

https://www.illumos.org/issues/9164
  This issue has been reported by Alan Somers as
  https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225877

  dmu_objset_refresh_ownership() first disowns a dataset (and releases
  it) and then owns it again. There is an assert that the new dataset
  object is the same as the old dataset object.  When running ZFS Test
  Suite on FreeBSD we see this panic from zpool_upgrade_007_pos test:

  panic: solaris assert: newds == os->os_dsl_dataset (0xfffff80045f4c000
  == 0xfffff80021ab4800)

  I see that the old dataset has dsl_dataset_evict_async() pending in
  ds_dbu.dbu_tqent and its ds_dbuf is NULL.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andriy Gapon <avg@FreeBSD.org>

8984 fix for 6764 breaks ACL inheritance

illumos/illumos-gate@e9bacc6d1a71ea3f7082038b2868de8c4dd98bdc
https://github.com/illumos/illumos-gate/commit/e9bacc6d1a71ea3f7082038b2868de8c4dd98bdc

https://www.illumos.org/issues/8984
  Consider a directory configured as:
  drwx-ws---+ 2 henson cpp 3 Jan 23 12:35 dropbox/
  user:henson:rwxpdDaARWcC--:f-i----:allow
  owner@:--------------:f-i----:allow
  group@:--------------:f-i----:allow
  everyone@:--------------:f-i----:allow
  owner@:rwxpdDaARWcC--:-di----:allow
  group:cpp:-wx-----------:-------:allow
  owner@:rwxpdDaARWcC--:-------:allow
  A new file created in this directory ends up looking like:
  rw-r--r-+ 1 astudent cpp 0 Jan 23 12:39 testfile
  user:henson:rw-pdDaARWcC--:------I:allow
  owner@:--------------:------I:allow
  group@:--------------:------I:allow
  everyone@:--------------:------I:allow
  owner@:rw-p--aARWcCos:-------:allow
  group@:r-----a-R-c--s:-------:allow
  everyone@:r-----a-R-c--s:-------:allow
  with extraneous group@ and everyone@ entries allowing read access that
  shouldn't exist.
  Per Albert Lee on the zfs mailing list:
  "aclinherit=passthrough/passthrough-x should still
  ignore the requested mode when an inheritable ACE for owner@ group@,
  or everyone@ is present in the parent directory.
  It appears there was an oversight in my fix for
  https://www.illumos.org/issues/6764 which made calling zfs_acl_chmod
  from zfs_acl_inherit unconditional. I think the parent ACL check for
  aclinherit=passthrough needs to be reintroduced in zfs_acl_inherit."
  We have a large number of faculty who use dropbox directories like the example
  to have students submit projects. All of these directories are now allowing

Reviewed by: Sam Zaydel <szaydel@racktopsystems.com>
Reviewed by: Paul B. Henson <henson@acm.org>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Dominik Hassler <hadfl@omniosce.org>

8940 Sending an intra-pool resumable send stream may result in EXDEV

illumos/illumos-gate@544132fce3fa6583f01318f9559adc46614343a7

"zfs send -t <token>" for an incremental send should be able to resume
successfully when sending to the same pool: a subtle issue in
zfs_iter_children() doesn't currently allow this.

Because resuming from a token requires "guid" -> "dataset" mapping
(guid_to_name()), we have to walk the whole hierarchy to find the right
snapshots to send.
When resuming an incremental send both source and destination live in the
same pool and have the same guid: this is where zfs_iter_children() gets
confused and picks up the wrong snapshot, so we end up trying to send an
incremental "destination@snap1 -> source@snap2" stream instead of
"source@snap1 -> source@snap2": this fails with an "Invalid cross-device
link" (EXDEV) error.

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: loli10K <ezomori.nozomu@gmail.com>

9080 recursive enter of vdev_indirect_rwlock from vdev_indirect_remap()

illumos/illumos-gate@bdfded42e66b9fc1395ff2401aa2952f7c44ae34

A scenario came up where a callback executed by vdev_indirect_remap() on a vdev, calls
vdev_indirect_remap() on the same vdev and tries to reacquire vdev_indirect_rwlock that
was already acquired from the first call to vdev_indirect_remap(). The specific scenario,
is that we want to remap a block pointer that is snapshoted but its dataset's remap_deadlist
is not cached. So in order to add it we issue a read through a vdev_indirect_remap() on the
same vdev, which brings up the aforementioned issue.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>

Missed pieces of r329799.

9079 race condition in starting and ending condesing thread for indirect vdevs

illumos/illumos-gate@667ec66f1b4f491d5e839644e0912cad1c9e7122

The timeline of the race condition is the following:
[1] Thread A is about to finish condesing the first vdev in spa_condense_indirect_thread(),
so it calls the spa_condense_indirect_complete_sync() sync task which sets the
spa_condensing_indirect field to NULL. Waiting for the sync task to finish, thread A
sleeps until the txg is done. When this happens, thread A will acquire spa_async_lock
and set spa_condense_thread to NULL.
[2] While thread A waits for the txg to finish, thread B which is running spa_sync() checks
whether it should condense the second vdev in vdev_indirect_should_condense() by checking
the spa_condensing_indirect field which was set to NULL by spa_condense_indirect_thread()
from thread A. So it goes on and tries to spawn a new condensing thread in
spa_condense_indirect_start_sync() and the aforementioned assertions fails because thread A
has not set spa_condense_thread to NULL (which is basically the last thing it does before
returning).

The main issue here is that we rely on both spa_condensing_indirect and spa_condense_thread to
signify whether a condensing thread is running. Ideally we would only use one throughout the
codebase. In addition, for managing spa_condense_thread we currently use spa_async_lock which
basically tights condensing to scrubing when it comes to pausing and resuming those actions
during spa export.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

r329793 | mav | 2018-02-22 04:21:03 +0200 (чт, 22 февр. 2018) | 58 lines

9075 Improve ZFS pool import/load process and corrupted pool recovery

illumos/illumos-gate@6f7938128a2c5e23f4b970ea101137eadd1470a1

Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:

https://www.illumos.org/issues/7638: Refactor spa_load_impl into several functions
https://www.illumos.org/issues/8961: SPA load/import should tell us why it failed
https://www.illumos.org/issues/7277: zdb should be able to print zfs_dbgmsg's

To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.

The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.

The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.

When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.

This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.

With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
Of data, this feature is for now restricted to a read-only import.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

9075 Improve ZFS pool import/load process and corrupted pool recovery

illumos/illumos-gate@6f7938128a2c5e23f4b970ea101137eadd1470a1

Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:

https://www.illumos.org/issues/7638: Refactor spa_load_impl into several functions
https://www.illumos.org/issues/8961: SPA load/import should tell us why it failed
https://www.illumos.org/issues/7277: zdb should be able to print zfs_dbgmsg's

To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.

The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.

The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.

When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.

This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.

With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
Of data, this feature is for now restricted to a read-only import.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

8942 zfs promote .../%recv should be an error

illumos/illumos-gate@add927f8c8d101e16c23eb9cd270be4fd7edf7d5

Reported on the ZFSonLinux https://github.com/zfsonlinux/zfs/issues/4843,
fixed by https://github.com/zfsonlinux/zfs/pull/6339:

If we are in the middle of an incremental zfs receive, the child .../%recv
will exist. If you concurrently run zfs promote .../%recv, it will "work",
but then zfs gets confused. For example, there's no obvious way to destroy
the containing filesystem (because it is now a clone of its invisible child).

Attempting to do this promote should be an error. We could fix this by
having zfs_ioc_promote() check if zc_name contains a %, similar to
zfs_ioc_rename().

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: loli10K <ezomori.nozomu@gmail.com>

8941 zpool add: assertion failed in get_replication() with nested interior VDEVs

illumos/illumos-gate@ac0215f4d618163d117a40fbf77a3f944852cb7b

When replacing a faulted device which was previously handled by a spare
multiple levels of nested interior VDEVs will be present in the pool
configuration: get_replication() needs to handle this situation gracefully
to let zpool add new devices to the pool

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: loli10K <ezomori.nozomu@gmail.com>

8477 Assertion failed in vdev_state_dirty(): spa_writeable(spa)

illumos/illumos-gate@f4c1745bd6c9829a05ecec15759ede7757100ab5

Illumos 4080 allows "zpool clear" to work on readonly pools: i don't think
this is the intended behaviour, we shouldn't be allowed to clear readonly
pools. Probably.

A fix is already in the ZFS on Linux repository to addess this issue:
https://github.com/zfsonlinux/zfs/commit/92e43c17188d47f47b69318e4884096dec380e36

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: loli10K <ezomori.nozomu@gmail.com>

8408 dsl_props_set_sync_impl() does not handle nested nvlists correctly

illumos/illumos-gate@85723e5eec42f46dbfdb4c09b9e1ed66501d1ccf

When iterating over the input nvlist in dsl_props_set_sync_impl() when we
don't preserve the nvpair name before looking up ZPROP_VALUE, so when we
later go to process it nvpair_name() is always "value" instead of the actual
property name.

This results in a couple of bugs in the recv code:

- received properties are not restored correctly when failing to receive
an incremental send stream
- received properties are not completely replaced by the new ones when
successfully receiving an incremental send stream

This was discovered on ZFS on Linux (fixed in
https://github.com/zfsonlinux/zfs/commit/5f1346c29997dd4e02acf4c19c875d5484f33b1e)

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: loli10K <ezomori.nozomu@gmail.com>

9036 zfs: duplicate 'const' declaration specifier

illumos/illumos-gate@f02c28e434fb4d81d95122bab187fb43d4f19c2a

Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Toomas Soome <tsoome@me.com>

9035 zfs: this statement may fall through

illumos/illumos-gate@46ac8fdfc5a1f9d8240c79a6ae5b2889cbe83553

Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Toomas Soome <tsoome@me.com>

8962 zdb should work on non-idle pools

illumos/illumos-gate@e144c4e6c90e7d4dccaad6db660ee42b6e7ba04f

Currently `zdb` consistently fails to examine non-idle pools as it fails
during the `spa_load()` process. The main problem seems to be that
`spa_load_verify()` fails as can be seen below:

$ sudo zdb -d -G dcenter
    zdb: can't open 'dcenter': I/O error

ZFS_DBGMSG(zdb):
    spa_open_common: opening dcenter
    spa_load(dcenter): LOADING
    disk vdev '/dev/dsk/c4t11d0s0': best uberblock found for spa dcenter. txg 40824950
    spa_load(dcenter): using uberblock with txg=40824950
    spa_load(dcenter): UNLOADING
    spa_load(dcenter): RELOADING
    spa_load(dcenter): LOADING
    disk vdev '/dev/dsk/c3t10d0s0': best uberblock found for spa dcenter. txg 40824952
    spa_load(dcenter): using uberblock with txg=40824952
    spa_load(dcenter): FAILED: spa_load_verify failed [error=5]
    spa_load(dcenter): UNLOADING

This change makes `spa_load_verify()` a dryrun when ran from `zdb`. This is
done by creating a global flag in zfs and then setting it in `zdb`.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>