MFV r288063: make dataset property de-registration operation O(1)
A change to a property on a dataset must be propagated to its descendants
in case that property is inherited. For datasets whose information is
not currently loaded into memory (e.g. a snapshot that isn't currently
mounted), there is nothing to do; the property change will take effect
the next time that dataset is loaded. To handle updates to datasets that
are in-core, ZFS registers a callback entry for each property of each
loaded dataset with the dsl directory that holds that dataset. There
is a dsl directory associated with each live dataset that references
both the live dataset and any snapshots of the live dataset. A property
change is effected by doing a traversal of the tree of dsl directories
for a pool, starting at the directory sourcing the change, and invoking
these callbacks.
The current implementation both registers and de-registers properties
individually for each loaded dataset. While registration for a property is
O(1) (insert into a list), de-registration is O(n) (search list and then
remove). The 'n' for de-registration, however, is not limited to the size
(number of snapshots + 1) of the dsl directory. The eviction portion
of the life cycle for the in core state of datasets is asynchronous,
which allows multiple copies of the dataset information to be in-core
at once. Only one of these copies is active at any time with the rest
going through tear down processing, but all copies contribute to the
cost of performing a dsl_prop_unregister().
One way to create multiple, in-flight copies of dataset information
is by performing "zfs list" operations from multiple threads
concurrently. In-core dataset information is loaded on demand and then
evicted when reference counts drops to zero. For datasets that are not
mounted, there is no persistent reference count to keep them resident.
So, a list operation will load them, compute the information required to
do the list operation, and then evict them. When performing this operation
from multiple threads it is possible that some of the in-core dataset
information will be reused, but also possible to lose the race and load
the dataset again, even while the same information is being torn down.
Compounding the performance issue further is a change made for illumos
issue 5056 which made dataset eviction single threaded. In environments
using automation to manage ZFS datasets, it is now possible to create
enough of a backlog of dataset evictions to consume excessive amounts
of kernel memory and to bog down the system.
The fix employed here is to make property de-registration O(1). With this
change in place, it is hoped that a single thread is more than sufficient
to handle eviction processing. If it isn't, the problem can be solved
by increasing the number of threads devoted to the eviction taskq.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_prop.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dataset.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dir.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_prop.h:
Associate dsl property callback records with both the
dsl directory and the dsl dataset that is registering the
callback. Both connections are protected by the dsl directory's
"dd_lock".
When linking callbacks into a dsl directory, group them by
the property type. This helps reduce the space penalty for the
double association (the property name pointer is stored once
per dsl_dir instead of in each record) and reduces the number of
strcmp() calls required to do callback processing when updating
a single property. Property types are stored in a linked list
since currently ZFS registers a maximum of 10 property types
for each dataset.
Note that the property buckets/records associated with a dsl
directory are created on demand, but only freed when the dsl
directory is freed. Given the static nature of property types
and their small number, there is no benefit to freeing the few
bytes of memory used to represent the property record earlier.
When a property record becomes empty, the dsl directory is either
going to become unreferenced a little later in this thread of
execution, or there is a high chance that another dataset is
going to be loaded that would recreate the bucket anyway.
Replace dsl_prop_unregister() with dsl_prop_unregister_all().
All callers of dsl_prop_unregister() are trying to remove
all property registrations for a given dsl dataset anyway. By
changing the API, we can avoid doing any lookups of callbacks
by property type and just traverse the list of all callbacks
for the dataset and free each one.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:
Replace use of dsl_prop_unregister() with the new
dsl_prop_unregister_all() API.
illumos/illumos-gate@03bad06fbb261fd4a7151a70dfeff2f5041cce1f
Author: Justin Gibbs <gibbs@scsiguy.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Illumos issue:
6171 dsl_prop_unregister() slows down dataset eviction
https://www.illumos.org/issues/6171
bsd.obj.mk handles the needs fine. When an objdir exists it will
just rm -Rf the objdir. When it does not exist though it will
call 'clean' and 'cleandepend', which properly recurse in bsd.progs.mk.
Implement support for reading USB quirks from the kernel environment.
Refer to the usb_quirk(4) manual page for more details on how to use
this new feature.
Fix running make in src directories without a Makefile giving confusing errors.
This fixes the following errors:
make: don't know how to make bsd.README. Stop
make: don't know how to make auto.obj.mk. Stop
This is easily seen in sys/dev/*.
The new behavior is now the expected output:
make: no target to make.
This would happen as MAKESYSPATH (.../share/mk) is auto added to the -I list.
Any directory where make is ran in the src tree that has no local Makefile
would then try executing the target in share/mk/Makefile, which by default
was to build the first entry in FILES. Of course, because bsd.README and
auto.obj.mk are not in the current directory the error is shown.
This check only works for bmake, but I will still MFC it with an extra
'!defined(.PARSEDIR) ||' guard for stable/10.
adrian [Thu, 24 Sep 2015 17:23:41 +0000 (17:23 +0000)]
Fix up error path handling after the recent churn.
* Don't free the mbuf in the tx path - it uses the transmit path now,
so the caller frees the mbuf.
* Don't decrement the node ref upon error - that's up to the caller to
do as well.
This avoids needing a large boot partition / file system in order to
accommodate multiple kernels, and provides consistency with userland
debug. This also simplifies the process of moving kernel debug files
to a separate package and installing them on demand.
In addition, change kernel debug file extension to .debug, to match
userland debug files.
When using the supported kernel installation method the
/usr/lib/debug/boot/kernel directory will be renamed (to kernel.old)
as is done with /boot/kernel.
Developers wishing to maintain the historical behavior of installing
debug files in /boot/kernel/ can set KERN_DEBUGDIR="" in src.conf(5).
Reviewed by: bdrewery, brooks, imp, markj
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D1006
Fix most cases of bsd.progs.mk running duplicate or missing commands.
This mostly fixes an interaction with bsd.test.mk with PROGS and SCRIPTS.
This was most notable with 'make clean' and 'make install', which r281055
and r272055 attempted to address but were inadequate.
It also addresses similar issues in bsd.progs.mk when not using bsd.test.mk.
This also fixes cases of NOT running commands in the parent when using
bsd.progs.mk:
- 'make clean' was not run for the main process for Makefiles which had both
FILES and SUBDIR but no PROGS or SCRIPTS. This usually was just a
leftover Kyuafile.auto. One such example is usr.bin/bmake/tests/sysmk/t1/2.
- 'make obj' was not running in the current directory with bsd.test.mk due
to early inclusion of bsd.subdir.mk. This was not really a problem due to
the SUBDIRS using 'mkdir -p' for their objdirs.
There were subtle bugs causing this wrong behavior:
1. bsd.progs.mk needs to set SCRIPTS to empty when recursing to avoid
the sub-makes from installing, cleaning or building the SCRIPTS;
only the parent make should be doing this. r281055 effectively did
the same but wasn't enough.
2. CLEANFILES may contain (especially from *.test.mk) files which only
the parent should clean, such as from FILES and SCRIPTS. To resolve
sub-makes also cleaning these, reset CLEANFILES and CLEANDIRS in the
children before including bsd.prog.mk. A tempting alternative would be
to only handle CLEANFILES in the parent but then the child bsd.prog.mk
CLEANFILES of per-PROGS wouldn't be setup.
3. bsd.subdir.mk was included too soon in bsd.test.mk. It needs to be
included after bsd.prog.mk as the SCRIPTS logic is short-circuitted if
'install:' is already defined (which bsd.subdir.mk does). There is
actually no need to include bsd.subdir.mk from bsd.test.mk as bsd.prog.mk
and bsd.obj.mk will do so in the proper order. The description in r257095
covers this for FILES and was fixed differently, though changing the
handling of target(install) in bsd.prog.mk may make sense after more
research.
4. bsd.progs.mk had extra logic to handle recursing SCRIPTS if PROGS was
empty, which isn't its business to be doing. SCRIPTS is handled fine
by bsd.prog.mk. This mostly reverts and reworks the fix in r259209 and
partially reverts r272055.
5. bsd.progs.mk has no need to depend 'all:' on SCRIPTS and FILES. These
are handled by bsd.prog.mk/bsd.files.mk fine. This also partially reverts
r272055.
6. bsd.progs.mk was not drop-in safe for bsd.prog.mk. Move the PROGS
check from r273186 to allow it to be used safely.
Specific tested cases:
SCRIPTS:no PROGS:no FILES:yes SUBDIR:yes
usr.bin/bmake/tests/sysmk/t1/2
A full buildworld/installworld/clean comparison with mtree was also done.
The only relevant difference was the new fixed behavior of removing
Kyuafile.auto from the objdir in 'clean'.
Converting SCRIPTS to be a special case FILES group will make this less
fragile and is being explored.
One known remaining issue is 'cleandepend' removing the tags files for
every recursive call.
Note that the 'make clean' command runs for the CURDIR last, which can make
it appear to run multiple times when cleaning in tests/, but each command is
for a SUBDIR returning up the chain. This is purely bsd.subdir.mk behavior.
META_MODE: Fix 2nd build causing everything to rebuild due to changed CC.
In the first build the TOOLSDIR does not exit yet which causes CC to default
to the sys.mk version. Once a TOOLSDIR is created during the build though,
this logic was changing CC to ${TOOLSDIR}/usr/bin/cc even though that file
did not exist. Thus CC went from 'cc' to '/usr/bin/cc' which forced a
rebuild of everything while using the same compiler. Check that TOOLSDIR is
not empty to avoid this. If there is actually a TOOLSDIR cc then it will be
used and properly rebuild.
META_MODE: Follow-up r287865 and define CCACHE_DIR as realpath'd.
Filemon(4) will record paths as they are seen, not as fully resolved. make(1)
will take the .MAKE.META.IGNORE_PATHS values and resolve them. This creates
a discrepancy if CCACHE_DIR is a symlink. Fix this by ensuring it is
resolved for its actual usage.
We allow to modify only few fields in mode pages now, but still it is
not good if they unexpectedly change during failover. Also this fixes
reporting of "Mode parameters changed" UAs on secondary node.
Make HA peers announce their parameters on connect.
HA protocol requires strict version, parameters and configuration match.
Differences there may cause full set of problems up to kernel panic.
To avoid that, validate peer parameters on connect, and abort connection
immediately if some mismatch detected.
jeff [Tue, 22 Sep 2015 23:57:52 +0000 (23:57 +0000)]
Some refactoring of the buf/vm interface.
- Eliminate bogus page replacement that is inconsistently applied in the
invalidation loop in brelse. This has been a no-op in modern times as
biodone() is responsible for cleaning up after bogus pages. This
would've spammed the console with printfs at a minimum.
- Allow the compiler and human readers alike to reason about allocbuf()
by splitting it into constituent parts.
- Separate the VM manipulating and buf manipulating code in brelse() and
bufdone() so that the intentions are clear. This makes it evident that
there are several duplicated buf pages loops that will be consolidated
at a later time.
Call ast when handling irq from userspace, otherwise we could miss
reschedule. Right now arm_cpu_intr() does critical_exit() as the last
action, so the impact is not serious.
Remove duplicated interrupt disable in restore_registers macro, when
returning to usermode. The do_ast macro disabled interrupts for us.
Reviewed by: andrew
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3714
Change vm_page_unwire() such that it (1) accepts PQ_NONE as the specified
queue and (2) returns a Boolean indicating whether the page's wire count
transitioned to zero.
Exploit this change in vfs_vmio_release() to avoid pointlessly enqueueing
a page that is about to be freed.
(An earlier version of this change was developed by attilio@ and kmacy@.
Any errors in this version are my own.)
addr2line: skip CUs lacking debug info instead of bailing out
Some binaries (such as the FreeBSD kernel) contain a mixture of CUs
with and without debug information. Previously translate() exited upon
encountering a CU without debug information. Instead, just move on to
the next CU.
Reported by: royger
Reviewed by: royger
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3712
This limits CLEANFILES removal to just bsd.obj.mk now and removes the need
for NOPATH_FILES.
This reverts r96529 which was done due to the command line being too long
for libc. Since then all architectures now use 256k for ARG_MAX (r170102).
Regardless of that, the libc CLEANFILES is only 72k now. Others
may be larger but not likely to hit the limit. If needed, we can improve
the bsd.obj.mk clean: target to split up the list via bmake features.
This also removes some workarounds that are no longer needed.
- a.out removal
- OBJS.tmp, which has not been needed since r117080.
- *.so, which has not been needed since a .so->.So rename in r42450.
This also fixes STATICOBJS and SHLIB_LINK not being in the .NOPATH list.
adrian [Tue, 22 Sep 2015 02:57:18 +0000 (02:57 +0000)]
Begin fleshing out basic power-on / power-off and A-MPDU TX support.
* Add a new method to control NIC poweron / network-sleep / power off;
* Add in A-MPDU TX negotiation support, but comment it out because it
does break TX traffic;
* blank out the tx buffer before sending a firmware message, just in case;
* go into network-sleep once associated;
TODO:
* figure out why ampdu negotiation isn't working and breaking TX traffic,
then enable it.
Fix installation of 32bit libraries after r288074.
FILES is not used when LIBRARIES_ONLY is set, which is used to build and
install the lib32 sysroot. All of the csu files do quality as "libraries"
for this case so just undefine LIBRARIES_ONLY.
This is still better than the previous realinstall handling as it does
not hook into META_MODE properly.
des [Mon, 21 Sep 2015 17:26:35 +0000 (17:26 +0000)]
Restore the upstream (and documented) behavior of searching for modules
both in /usr/lib and /usr/local/lib, thus simplifying the use of modules
from ports, without breaking the compat32 case again.
Ensure that maxproc does not exceed pid_max, at the time of boot.
Changes to kern.pid_max mib after the boot can break this relation.
The maxfiles value was calculated by the MAXFILES formula based on
maxproc value, but this change decouples them, and MAXFILES now
references maxusers. Without manual tuning, the maxfiles default
value remains as it was prior to this commit. But for systems which
have tuned maxproc and rely on maxfiles to adjust, additional
reconfiguration is needed.
Reported by: rwatson
Reviewed by: emaste
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
In the UDP over IPv6 implementation several cases are using the wrong protocol,
e.g., based on wrong "next header" assumptions (which does not have to point to
the upper layer protocol), or using hard-coded UDP instead of UDP or UDP-Lite
possibly switching protocols. Fix those cases for UDP-Lite to work correctly.
https://github.com/illumos/illumos-gate/commit/c546f36aa898d913ff77674fb5ff97f15b2e08b4
https://www.illumos.org/issues/6220
5408 introduced a memleak in l2arc, namely the member b_thawed gets leaked when
an arc_hdr is realloced from full to l2only.
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Arne Jansen <sensille@gmx.net>
Unify nd6 state switching by using newly-created nd6_llinfo_setstate()
function. The change is mostly mechanical with the following exception:
Last piece of nd6_resolve_slow() was refactored: ND6_LLINFO_PERMANENT
condition was removed as always-true, explicit ND6_LLINFO_NOSTATE ->
ND6_LLINFO_INCOMPLETE state transition was removed as duplicate.
Add "stale" timer back to nd6_cache_lladdr().
Setting timer was accidentally removed in r276844 due to misleading
comment on its meaningless. Add it back to restore proper behaviour.
adrian [Mon, 21 Sep 2015 02:30:22 +0000 (02:30 +0000)]
Convert if_rsu to use a deferred transmit task rather than using rsu_start()
to do it directly.
Ensure that we re-queue starting transmit upon TX completion.
This solves two issues:
* It stops tx stalls - before this, if the transmit path filled the
mbuf queue then it'd never start another transmit.
* It enforces ordering - this is very required for 802.11n which
requires frames to be transmitted in the order they're queued.
Since everything remotely involved in USB has an unlock/thing/relock
pattern with that mutex, the only way to guarantee TX ordering is
to 100% defer it into a separate thread.
This now survives an iperf test and gets a reliable 30mbit/sec.
adrian [Mon, 21 Sep 2015 02:12:01 +0000 (02:12 +0000)]
Drain the mbuf queue upon rsu_stop().
Correctly (I hope!) remove net80211 references before doing so.
Just doing a dumb mbufq drain isn't enough.
If enough traffic occurs and the mbuf queue fills up then transmit
stalls (which I'm not fixing in this commit!) but then the mbuf queue
stays full until the driver is removed. There's also the net80211
node refcounting leak.
This just ensures that during rsu_stop and detach the mbuf queue
is purged (and references!) so the queue-full situation can be
recovered from.