Metadata-intensive workloads can cause the ARC to become permanently
filled with dnode_t objects as they're pinned by the VFS layer.
Subsequent data-intensive workloads may only benefit from about
25% of the potential ARC (arc_c_max - arc_meta_limit).
In order to help track metadata usage more precisely, the other_size
metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size.
The new zfs_arc_dnode_limit tunable, which defaults to 10% of
zfs_arc_meta_limit, defines the minimum number of bytes which is desirable
to be consumed by dnodes. Attempts to evict non-metadata will trigger
async prune tasks if the space used by dnodes exceeds this limit.
The new zfs_arc_dnode_reduce_percent tunable specifies the amount by
which the excess dnode space is attempted to be pruned as a percentage of
the amount by which zfs_arc_dnode_limit is being exceeded. By default,
it tries to unpin 10% of the dnodes.
The problem of dnode metadata pinning was observed with the following
testing procedure (in this example, zfs_arc_max is set to 4GiB):
- Create a large number of small files until arc_meta_used exceeds
arc_meta_limit (3GiB with default tuning) and arc_prune
starts increasing.
- Create a 3GiB file with dd. Observe arc_mata_used. It will still
be around 3GiB.
- Repeatedly read the 3GiB file and observe arc_meta_limit as before.
It will continue to stay around 3GiB.
With this modification, space for the 3GiB file is gradually made
available as subsequent demands on the ARC are made. The previous behavior
can be restored by setting zfs_arc_dnode_limit to the same value as the
zfs_arc_meta_limit.
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4345
Issue #4512
Issue #4773
Closes #4858
pstef [Sat, 11 Aug 2018 19:20:06 +0000 (19:20 +0000)]
indent(1): revert r334640 and r334632
While STACKSIZE macro is indeed problematic on some systems, the commits
were wrong to shrink il[] and cstk[], because they need to be of the same
size as p_stack[] as they're accessed with the same index ps.tos.
kp [Sat, 11 Aug 2018 16:34:30 +0000 (16:34 +0000)]
pf: Fix 'set skip on' for groups
The pfi_skip_if() function sometimes caused skipping of groups to work,
if the members of the group used the groupname as a name prefix.
This is often the case, e.g. group lo usually contains lo0, lo1, ...,
but not always.
Rather than relying on the name explicitly check for group memberships.
netchild [Sat, 11 Aug 2018 16:29:54 +0000 (16:29 +0000)]
- Correct the description when jobs are executed related to load avg
to match reality (slightly different to what was submitted in the
PR: use english word instead of math-symbol).
- Wrap the corresponding part to below 80 characters per line.
netchild [Sat, 11 Aug 2018 16:12:23 +0000 (16:12 +0000)]
Re-enable reading byte swapped NFS_MAGIC dumps.
Fix bug introduced in r98542: previously to this revision the byte-swapped
value was compared at this place. The current check is in a conditional
section where the non-byte-swapped value was already checked to be not
the value which is checked again. As byte-swapping is activated afterwards,
it only makes sense if the byte-swapped value is checked.
Submitted by: Keith White <kwhite@site.uottawa.ca>
PR: 200059
MFC after: 1 month
Sponsored by: Essen Hackathon
netchild [Sat, 11 Aug 2018 13:18:19 +0000 (13:18 +0000)]
Add svnlite to places where svn is mentioned.
The Makefile part in the PR is solved already differently, so this
part is skipped form the PR The man page change change is slightly
changed to adapt to the way the Makefile works and to the spirit
of what is intended here.
Submitted by: Juan Ramón Molina Menor <info@juanmolina.eu>
PR: 194910
Sponsored by: Essen Hackathon
dim [Sat, 11 Aug 2018 10:42:12 +0000 (10:42 +0000)]
Pull in r338481 from upstream llvm trunk (by Chandler Carruth):
[x86] Fix a really subtle miscompile due to a somewhat glaring bug in
EFLAGS copy lowering.
If you have a branch of LLVM, you may want to cherrypick this. It is
extremely unlikely to hit this case empirically, but it will likely
manifest as an "impossible" branch being taken somewhere, and will be
... very hard to debug.
Hitting this requires complex conditions living across complex
control flow combined with some interesting memory (non-stack)
initialized with the results of a comparison. Also, because you have
to arrange for an EFLAGS copy to be in *just* the right place, almost
anything you do to the code will hide the bug. I was unable to reduce
anything remotely resembling a "good" test case from the place where
I hit it, and so instead I have constructed synthetic MIR testing
that directly exercises the bug in question (as well as the good
behavior for completeness).
The issue is that we would mistakenly assume any SETcc with a valid
condition and an initial operand that was a register and a virtual
register at that to be a register *defining* SETcc...
It isn't though....
This would in turn cause us to test some other bizarre register,
typically the base pointer of some memory. Now, testing this register
and using that to branch on doesn't make any sense. It even fails the
machine verifier (if you are running it) due to the wrong register
class. But it will make it through LLVM, assemble, and it *looks*
fine... But wow do you get a very unsual and surprising branch taken
in your actual code.
The fix is to actually check what kind of SETcc instruction we're
dealing with. Because there are a bunch of them, I just test the
may-store bit in the instruction. I've also added an assert for
sanity that ensure we are, in fact, *defining* the register operand.
=D
sevan [Sat, 11 Aug 2018 10:21:21 +0000 (10:21 +0000)]
Drop the ternary operator for calculating ssid display length in list_scan().
Regardless if a verbose scan is required or not, we'd still want to display the
full SSID name by default so use the IEE80211_NWID_LEN constant to set the
value to use instead.
cem [Sat, 11 Aug 2018 02:56:43 +0000 (02:56 +0000)]
stat(1): cache id->name resolution
When invoked on a large list of files, it is most common for a small number of
uids/gids to own most of the results.
Like ls(1), use pwcache(3) to avoid repeatedly looking up the same IDs.
Example microbenchmark and non-scientific results:
$ time (find /usr/src -type f -print0 | xargs -0 stat >/dev/null)
BEFORE:
3.62s user 5.23s system 102% cpu 8.655 total
3.47s user 5.38s system 102% cpu 8.647 total
AFTER:
1.23s user 1.81s system 108% cpu 2.810 total
1.43s user 1.54s system 107% cpu 2.754 total
Does this microbenchmark have any real-world significance? Until a use case
is demonstrated otherwise, I doubt it. Ordinarily I would be resistant to
optimizing pointless microbenchmarks in base utilities (e.g., recent totally
gratuitous changes to yes(1)). However, the pwcache(3) APIs actually
simplify stat(1) logic ever so slightly compared to the raw APIs they wrap,
so I think this is at worst harmless.
PR: 230491
Reported by: Thomas Hurst <tom AT hur.st>
Discussed with: gad@
Use cached feature info in spa_add_feature_stats()
Avoid issuing I/O to the pool when retrieving feature flags information.
Trying to read the ZAPs from disk means that zpool clear would hang if
the pool is suspended and recovery would require a reboot. To keep the
feature stats resident in memory, we hang a cached nvlist off of the
spa. It is built up from disk the first time spa_add_feature_stats() is
called, and refreshed thereafter using the cached feature reference
counts. spa_add_feature_stats() gets called at pool import time so we
can be sure the cached nvlist will be available if the pool is later
suspended.
Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3082
dim [Fri, 10 Aug 2018 19:57:55 +0000 (19:57 +0000)]
In r308100, an explicit -fexceptions flag was added for the C sources
from LLVM's libunwind, which end up in libgcc_eh.a and libgcc_s.so.
This is because the unwinder needs the unwinder data for its own
functions.
However, for the C++ sources in libunwind, -fexceptions is already the
default, and this can have the side effect of generating a reference to
__gxx_personality_v0, the so-called personality function, which is
normally provided by the C++ ABI library (libcxxrt or libsupc++).
If the reference ends up in the eventual libgcc_s.so, linking any
non-C++ programs against it will fail with "undefined reference to
`__gxx_personality_v0'".
Note that at high optimization levels, the reference is usually
optimized away, which is why we have never noticed this problem before.
With clang 7.0.0 though, higher optimization levels don't help anymore,
since the addition of address-significance tables [1] in
<https://reviews.llvm.org/rL337339>. Effectively, this always causes a
reference to __gxx_personality_v0.
After discussion with the upstream author of that change, it turns out
that we should compile libunwind sources with the -fno-exceptions
-funwind-tables flags instead. This ensures unwind tables are
generated, but no references to any personality functions are emitted.
cem [Fri, 10 Aug 2018 19:19:07 +0000 (19:19 +0000)]
Walk back r337554 while discussion continues
The idea was to get the uncontroversial mechanical change out of the way,
then get the meatier functional changes reviewed subsequently. I had not
realized that the immediately adjacent issue was addressed in a different
direction in r334506 (see Warner's guidance in D15592).
Discussion continues, trying to determine if there is a secondary issue
still[1] and how best to fix it. With 12-related activities coming up,
while that is ongoing, just take this back for now.
[1]: Shutdown-time eventhandler events fire normally during panic's reboot
path. Driver callbacks that attempt to issue and wait on interrupt-
completed IO may never complete, hanging the system. This is particularly
obnoxious in the shutdown/panic path, as the debugger cannot be entered
anymore and the hang prevents reboot restoring availability.
(There's nothing CAM-specific about this problem -- any shutdown
event-triggered driver could do something like this during panic. But most
NICs, etc. don't try to send spin-down commands at shutdown. ;-))
kevans [Fri, 10 Aug 2018 15:29:06 +0000 (15:29 +0000)]
boot tagging: minor fixes
msgbufinit may be called multiple times as we initialize the msgbuf into a
progressively larger buffer. This doesn't happen as of now on head, but it
may happen in the future and we generally support this. As such, only print
the boot tag if we've just initialized the buffer for the first time.
The boot tag also now has a newline appended to it for better visibility,
and has been switched to a normal printf, by requesto f bde, after we've
denoted that the msgbuf is mapped.
imp [Fri, 10 Aug 2018 15:16:41 +0000 (15:16 +0000)]
Update man page to include FreeBSD-specific details.
While this implements a standards-conforming C11 function, there's
implementation details the programmer needs to know. Include those
here. Make changes inspired by comments on the initial review as well,
though mostly this involves stealing the epoch verbage from
gettimeofday(2). Add myself to authors since I've now changed a
substantial amount of this man page.
imp [Fri, 10 Aug 2018 15:16:36 +0000 (15:16 +0000)]
Remove assert.h and commented out _DIAGASSERT.
Remove assert.h and _DIAGASSERT to create a paper-trail of changes
from NetBSD. Specifically didn't fix other style issues since I
don't want this to diverge from the NetBSD original too much and
that's too niggling a change to be worth future merge hassles.
imp [Fri, 10 Aug 2018 15:16:30 +0000 (15:16 +0000)]
Bring in timespce_get form NetBSD.
Bring in the functionality for timespec_get from NetBSD. I've lightly
edited the .c file to remove _DIAGASSERT because FreeBSD doesn't have
that functionality and the typical #define'ing it to assert isn't
right here. The man page is verbatim from NetBSD, but will be revised
as part of a larger cleanup of the time man pages (they are
inconsistent and vague in all the wrong places).
kevans [Fri, 10 Aug 2018 13:32:02 +0000 (13:32 +0000)]
net80211: Drain ageq before cleaning it up.
The comment above ieee80211_ageq_cleanup specifically notes that the queue
is assumed to be empty, and in order to make it so, ieee80211_ageq_drain
must be used.
emaste [Fri, 10 Aug 2018 10:37:25 +0000 (10:37 +0000)]
readelf: display NT_GNU_PROPERTY_TYPE_0 note name
NT_GNU_PROPERTY_TYPE_0 in a .note.gnu.property section "contains a
program property note which describes special handling requirements
for linker and run-time loader." (from the System V Application Binary
Interface - Linux Extensions")
Intel CET uses two processor-specific program properties in
NT_GNU_PROPERTY_TYPE_0: GNU_PROPERTY_X86_FEATURE_1_IBT to indicate that
all executable sections are compatible with Indirect Branch Tracking,
and GNU_PROPERTY_X86_FEATURE_1_SHSTK to indicate that sections are
compatible with shadow stack.
A later change should add decoding of the individual properties.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
perf: 2.75x faster ddt_entry_compare()
First 256bits of ddt_key_t is a block checksum, which are expected
to be close to random data. Hence, on average, comparison only needs to
look at first few bytes of the keys. To reduce number of conditional
jump instructions, the result is computed as: sign(memcmp(k1, k2)).
Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} ,
which is computed efficiently. Synthetic performance evaluation of
original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R)
CPU E5-2660 v3:
old 6.85789 s
new 2.49089 s
perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare()
Compute the result directly instead of using conditionals
perf: zfs_range_compare()
Speedup between 1.1x - 2.5x, depending on compiler version and
optimization level.
perf: spa_error_entry_compare()
`bcmp()` is not suitable for comparator use. Use `memcmp()` instead.
jhibbits [Fri, 10 Aug 2018 03:28:40 +0000 (03:28 +0000)]
powerpc: Add lwsync and ptesync 'sync' opcode variants to ddb disassembler
The canonical form of sync is:
sync L, E (if Category Elemental Memory Barriers implemented)
The L bits (2) denote the type of sync:
0 -- hwsync
1 -- lwsync
2 -- ptesync or hwsync
It's been found that most 32-bit CPUs designed prior to the introduction of
lwsync will ignore the L bits. However, some cores, particularly the e500 core,
will trigger an illegal instruction exception. Adding these variants will make
it easier to see which sync variant is actually being used in case of a trap.
kevans [Fri, 10 Aug 2018 00:10:57 +0000 (00:10 +0000)]
Makefile.inc1: Add libl to -legacy as well
libl is needed for config(8), which is a bootstrap-tool. It is possible to
build a system WITHOUT_TOOLCHAIN to exclude lex and thus, libl. We still
need to support building from this kind of host, though.
While here, group the config(8) dependencies together and add a small
explanation. These can likely both be scoped more clearly, but this will
need some further investigation.
cy [Fri, 10 Aug 2018 00:04:32 +0000 (00:04 +0000)]
Identify the return value (rval) that led to the IPv4 NAT failure
in ipf_nat_checkout() and report it in the frb_natv4out and frb_natv4in
dtrace probes.
This is currently being used to diagnose NAT failures in PR/208566. It's
rather handy so this commit makes it available for future diagnosis and
debugging efforts.
cem [Thu, 9 Aug 2018 21:53:32 +0000 (21:53 +0000)]
cam(4): Add an xpt-neutral flag indicating a valid panic CCB
No functional change.
Note that this change is careful to set the CCB header xflags after
foo_fill_bar() routines, which generally zero existing flags. An earlier
version of this patch mistakenly set the flag before the fill routines.
Submitted by: Scott Ferris <sferris AT isilon.com>, jhibbits@
Reviewed by: bdrewery@, markj@, and non-committer FreeBSD contributor Anton Rang
Sponsored by: Dell EMC Isilon
dim [Thu, 9 Aug 2018 21:28:31 +0000 (21:28 +0000)]
Add optional LLVM BPF target support
BPF (eBPF) is an independent instruction set architecture which is
introduced in Linux a few years ago. Originally, eBPF execute
environment was only inside Linux kernel. However, recent years there
are some user space implementation (https://github.com/iovisor/ubpf,
https://doc.dpdk.org/guides/prog_guide/bpf_lib.html) and kernel space
implementation for FreeBSD is going on
(https://github.com/YutaroHayakawa/generic-ebpf).
The BPF target support can be enabled using WITH_LLVM_TARGET_BPF, as it
is not built by default.
kevans [Thu, 9 Aug 2018 20:29:44 +0000 (20:29 +0000)]
libnv: Remove -I${SRCTOP}/sys
This should have been done as part of r336019 -- including ${SRCTOP}/sys is
not a good business model for something that's build in legacy/bootstrap
stages.
Beyond that, libnv seems to build quite alright as legacy, part of
buildworld, and standalone without. Axe it.
Reported by: truckman (head building stable/11)
Tested by: Shawn Webb (HardenedBSD)
MFC after: 3 days
markj [Thu, 9 Aug 2018 18:25:49 +0000 (18:25 +0000)]
Account for the lowmem handlers in the inactive queue scan target.
Before r329882 the target would be computed after lowmem handlers run
and free pages. On some systems a significant amount of page
reclamation happens this way. However, with r329882 the target is
computed first, which can lead to unnecessary reclamation from the
page cache, and this in turn may result in excessive swapping.
Instead, adjust the target after running lowmem handlers. Don't
invoke the lowmem handlers before the PID controller, though, since
that would hide the true rate of page allocation.
Reviewed by: alc, kib (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16606
kevans [Thu, 9 Aug 2018 17:47:47 +0000 (17:47 +0000)]
BOOT_TAG: Make a config(5) option, expose as sysctl and loader tunable
BOOT_TAG lived shortly in sys/msgbuf.h, but this wasn't necessarily great
for changing it or removing it. Move it into subr_prf.c and add options for
it to opt_printf.h.
One can specify both the BOOT_TAG and BOOT_TAG_SZ (really, size of the
buffer that holds the BOOT_TAG). We expose it as kern.boot_tag and also add
a loader tunable by the same name that we'll fetch upon initialization of
the msgbuf.
This allows for flexibility and also ensures that there's a consistent way
to figure out the boot tag of the running kernel, rather than relying on
headers to be in-sync.
cxgbe(4): Add support for high priority filters on T6+. They have their
own region in the TCAM starting with T6, unlike previous chips where
they were in the same region as normal filters.
These filters "hit" before anything else in the LE's lookup. The exact
order is:
a) High priority filters
b) TOE's active region (TCAM and/or hash)
c) Servers (TOE hw listeners)
d) Normal filters
luporl [Thu, 9 Aug 2018 14:04:51 +0000 (14:04 +0000)]
[ppc] Fix kernel panic when using BOOTP_NFSROOT
On PowerPC (and possibly other architectures), that doesn't use
EARLY_AP_STARTUP, the config task queue may be used initialized.
This was observed while trying to mount the root fs from NFS, as
reported here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230168.
This patch has 2 main changes:
1- Perform a basic initialization of qgroup_config, similar to
what is done in taskqgroup_adjust, but simpler.
This makes qgroup_config ready to be used during NFS root mount.
2- When EARLY_AP_STARTUP is not used, call inm_init() and
in6m_init() right before SI_SUB_ROOT_CONF, because bootp needs
to send multicast packages to request an IP.
avg [Thu, 9 Aug 2018 11:21:31 +0000 (11:21 +0000)]
add an option for ddb ps command to print process arguments
We use ps to collect the information of all processes in textdump. But
it doesn't contain process arguments which however sometimes are very
useful for debugging. The new 'a' modifier adds that capability.
While here, remove 'm' modifier from ddb.4. It was in the manual page
from its very first revision, but I could not find any evidence of the
code ever supporting it.
Submitted by: Terry Hu <thu@panzura.com>
Reviewed by: kib
MFC after: 1 week
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D16603
kevans [Thu, 9 Aug 2018 02:06:25 +0000 (02:06 +0000)]
isoboot, gptboot: Fix WITHOUT_LOADER_GELI (gptboot) and isoboot in general
gptboot was broken when r316078 added the LOADER_GELI_SUPPORT #ifdef to
not pass geliargs via __exec. KARGS_FLAGS_EXTARG must not be used if we're
not going to pass an additional argument to __exec.
kevans [Thu, 9 Aug 2018 01:32:09 +0000 (01:32 +0000)]
kern: Add a BOOT_TAG marker at the beginning of boot dmesg
From the "newly licensed to drive" PR department, add a BOOT_TAG marker (by
default, --<<BOOT>>--, to the beginning of each boot's dmesg. This makes it
easier to do textproc magic to locate the start of each boot and, of
particular interest to some, the dmesg of the current boot.
The PR has a dmesg(8) component as well that I've opted not to include for
the moment- it was the more contentious part of this PR.
bde@ also made the statement that this boot tag should be written with an
ordinary printf, which I've- for the moment- declined to change about this
patch to keep it more transparent to observer of the boot process.
PR: 43434
Submitted by: dak <aurelien.nephtali@wanadoo.fr> (basically rewritten)
MFC after: maybe never
brooks [Wed, 8 Aug 2018 22:45:30 +0000 (22:45 +0000)]
Terminate filter_create_ext() args with NULL, not 0.
filter_create_ext() is documented to take a NULL terminated set of
arguments. 0 is promoted to an int so this would fail on 64-bit
systems if the value was not passed in a register. On all currently
supported 64-bit architectures it is.
kevans [Wed, 8 Aug 2018 21:51:19 +0000 (21:51 +0000)]
ls(1): Enable colors with COLORTERM is set in the environment
COLORTERM is the de facto standard, while CLICOLOR is generally specific to
FreeBSD and ls(1).
PR: 230101
Submitted by: D Green <dfrg@xsmail.com> (with manpage additions by myself)
Reviewed by: cem ("LGTM" in PR; pre-manpage changes)
MFC after: 1 week
kevans [Wed, 8 Aug 2018 21:21:28 +0000 (21:21 +0000)]
apply(1): Fix magic number substitution with magic character ' '
Using a space as the magic character would result in problems if the command
started with a number:
- For a 'valid' number n, n < size of argv, it would erroneously get
replaced with that argument; e.g. `apply -a ' ' -d 1rm x => `execxrm x`
- For an 'invalid' number n, n >= size of argv, it would segfault.
e.g. `apply -a ' ' 2to3 test.py` would try to access argv[2]
This problem occurred because apply(1) would prepend "exec " to the command
string before doing the actual magic number replacements, so it would come
across "exec 2to3 1" and assume that the " 2" is also a magic number to be
replaced.
Re-work this to instead just append "exec " to the command sbuf and
workaround the ugliness. This also simplifies stuff in the process.
leitao [Wed, 8 Aug 2018 21:19:07 +0000 (21:19 +0000)]
powerpc64/powernv: re-read RTC after polling
If OPAL_RTC_READ is busy and does not return the information on the first run,
as returning OPAL_BUSY_EVENT, the system will crash since ymd and hmsm variable
will contain junk values.
This is happening because we were not calling OPAL_RTC_READ again after
OPAL_POLL_EVENTS' return, which would finally replace the old/junk hmsm and ymd
values.
The code was also mixing OPAL_RTC_READ and OPAL_POLL_EVENTS return values.
This patch fix this logic and guarantee that we call OPAL_RTC_READ after
OPAL_POLL_EVENTS return, and guarantee the code will only proceed if
OPAL_RTC_READ returns OPAL_SUCCESS.
rmacklem [Wed, 8 Aug 2018 20:30:12 +0000 (20:30 +0000)]
Fix the err() arguments for a nfssvc(8) failure.
argv has been incremented during argument handling, so elements of the
array are no longer valid. Change the err() arguments so only the
first string pointer in argv is used.
Found during code inspection.
rmacklem [Wed, 8 Aug 2018 20:21:45 +0000 (20:21 +0000)]
Assorted fixes to handling of LayoutRecall callbacks, mostly error handling.
After a re-read of the appropriate section of RFC5661, I decided that a
few things should be changed related to LayoutRecall callback handling.
Here are the things fixed by this patch.
- For two of the three cases that LayoutRecall is done, I now think
setting the clora_changed argument false is correct.
- All errors other than NFSERR_DELAY returned by LayoutRecall appear
permanent, so don't retry for any of them. (NFSERR_DELAY is retried by
newnfs_request(), so it is not affected by this patch.)
- Instead of waiting "forever" (actually until the process is SIGTERM'd)
for Layouts to be returned during a mirror copy, fail and return
ENXIO after about 1minute.
Waiting for a <ctrl>C made sense when pnfsdscopymr() was done by itself,
but did not make sense when done via find(1).
This patch only affects the pNFS server.