Kyle Evans [Mon, 13 Aug 2018 03:38:32 +0000 (03:38 +0000)]
Use INCS for non-sys/ libnvpair and libzfs_core includes
While nothing was wrong with libnvpair.h, libzfs_core.h was only guarded by
MK_CDDL rather than MK_CDDL && MK_ZFS. Rather than ugl'if'ying
include/Makefile to impose the extra restriction, just move the non-sys/
includes into INCS with the respect lib builds.
This has the added bonus of allowing third party packagers to try and split
these libs out of the FreeBSD-runtime package, if they are so inclined.
The sys/ include was left alone- generally userland libraries shouldn't
install kernel headers.
Justin Hibbits [Sun, 12 Aug 2018 20:33:55 +0000 (20:33 +0000)]
ipmi/opal: Enable polled mode and proper callback
Fix a NULL dereference that would occur any time an ioctl() was done, due to a
missing ipmi_enqueue_request callback. Just use the default for now, until we
decide to properly enable IPMI interrupts.
Introduce kstats about the dbuf hash and dbuf cache
to make it easier to inspect state. This should help
with debugging and understanding of these portions
of the codebase.
Correct format of dbuf kstat file.
Introduce a dbc column to dbufs kstat to indicate if
a dbuf is in the dbuf cache.
Introduce field filtering in the dbufstat python script.
Introduce a no header option to the dbufstat python script.
Introduce a test case to test basic mru->mfu list movement
in the ARC.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6906
This commit preserves the recursive function dbuf_hold_impl() but moves
the local variables and function arguments to the heap to minimize
the stack frame size. Enough space is initially allocated on the
stack for 20 levels of recursion. This technique was based on commit 34229a2f2ac07363f64ddd63e014964fff2f0671 which reduced stack usage of
traverse_visitbp().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
John Baldwin [Sun, 12 Aug 2018 01:54:05 +0000 (01:54 +0000)]
Add an overview section to bus_dma.9.
Describe the role of tags and mapping objects as abstractions.
Describe static vs dynamic transaction types and give a brief overview
of the set of functions and object life cycles used for static vs
dynamic.
While here, fix a few other typos and expand a bit on parent tags.
Certain function must never be automatically inlined by gcc because
they are stack heavy or called recursively. This patch flags all
such functions I've found as 'noinline' to prevent gcc from making
the optimization.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The following scenario can result in garbage in the dn_spill field.
The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
is clear to ensure the dn_spill field is cleared.
Current txg = A.
* A new spill buffer is created. Its dbuf is initialized with
db_blkptr = NULL and it's dirtied.
Current txg = B.
* The spill buffer is modified. It's marked as dirty in this txg.
* Additional changes make the spill buffer unnecessary because the
xattr fits into the bonus buffer, so it's removed. The dbuf is
undirtied in this txg, but it's still referenced and cannot be
destroyed.
Current txg = C.
* Starts syncing of txg A
* dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
is NULL, dbuf_check_blkptr() is called.
* The dbuf starts being written and it reaches the ready state
(not done yet).
* A new change makes the spill buffer necessary again.
sa_build_layouts() ends up calling dbuf_find() to locate the
dbuf. It finds the old dbuf because it has not been destroyed yet
(it will be destroyed when the previous write is done and there
are no more references). The old dbuf has db_blkptr != NULL.
* txg A write is complete and the dbuf released. However it's still
referenced, so it's not destroyed.
Current txg = D.
* Starts syncing of txg B
* dbuf_sync_leaf() is called for the bonus buffer. Its contents are
directly copied into the dnode, overwriting the blkptr area because,
in txg B, the bonus buffer was big enough to hold the entire xattr.
* At this point, the db_blkptr of the spill buffer used in txg C
gets corrupted.
Signed-off-by: Peng <peng.hse@xtaotech.com> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3937
Currently there is no mechanism to inspect which dbufs are being
cached by the system. There are some coarse counters in arcstats
by they only give a rough idea of what's being cached. This patch
aims to improve the current situation by adding a new dbufs kstat.
When read this new kstat will walk all cached dbufs linked in to
the dbuf_hash. For each dbuf it will dump detailed information
about the buffer. It will also dump additional information about
the referenced arc buffer and its related dnode. This provides a
more complete view in to exactly what is being cached.
With this generic infrastructure in place utilities can be written
to post-process the data to understand exactly how the caching is
working. For example, the data could be processed to show a list
of all cached dnodes and how much space they're consuming. Or a
similar list could be generated based on dnode type. Many other
ways to interpret the data exist based on what kinds of questions
you're trying to answer.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov>
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
Kyle Evans [Sun, 12 Aug 2018 00:08:14 +0000 (00:08 +0000)]
getopt_long(3): Document behavior of leading characters in optstring
Leading '+', '-', and ':' in optstring have special meaning. We briefly
mention that the first two have special meaning in that we say
POSIXLY_CORRECT turns them off, but we don't actually document their
meaning. Add a paragraph to RETURN VALUES explaining how they control
the treatment of non-option arguments.
A leading ':' has no mention; add a note that it suppresses warnings about
missing arguments.
Kyle Evans [Sat, 11 Aug 2018 23:50:09 +0000 (23:50 +0000)]
Merge libbe(3)/bectl(8) from projects/bectl into head
bectl(8) is an administrative interface for working with ZFS boot
environments, intended to provide a superset of the functionality provided
by sysutils/beadm.
libbe(3) is the back-end library that the required functionality has been
pulled out into for later reuse.
These were originally written for GSoC 2017 under the mentorship of
allanjude@.
bectl(8) has proven pretty stable in my testing, with the known bug
documented in the man page.
Kyle Evans [Sat, 11 Aug 2018 22:45:39 +0000 (22:45 +0000)]
libbe(3)/bectl(8): More SYSROOT/GCC build fixes
- Missing include path
- Fully specify libzfs's dependencies (except for deps pulled in by other
deps) in Makefile.inc1
- Drop WARNS back down to 2 for libbe(3). I do this with much hesitation,
but the libzfs headers are apparently a hot warning-filled mess as far as
GCC 4.2 is concerned.
As described in the comment above arc_adapt_thread() it is critical
that the arc_adapt_thread() function never sleep while holding a hash
lock. This behavior was possible in the Linux implementation because
the arc_prune() logic was implemented to be synchronous. Under
illumos the analogous dnlc_reduce_cache() function is asynchronous.
To address this the arc_do_user_prune() function is has been reworked
in to two new functions as follows:
* arc_prune_async() is an asynchronous implementation which dispatches
the prune callback to be run by the system taskq. This makes it
suitable to use in the context of the arc_adapt_thread().
* arc_prune() is a synchronous implementation which depends on the
arc_prune_async() implementation but blocks until the outstanding
callbacks complete. This is used in arc_kmem_reap_now() where it
is safe, and expected, that memory will be freed.
This patch additionally adds the zfs_arc_meta_strategy module option
while allows the meta reclaim strategy to be configured. It defaults
to a balanced strategy which has been proved to work well under Linux
but the illumos meta-only strategy can be enabled.
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Navdeep Parhar [Sat, 11 Aug 2018 21:10:08 +0000 (21:10 +0000)]
cxgbe(4): Move all control queues to the adapter.
There used to be one control queue per adapter (the mgmtq) that was
initialized during adapter init and one per port that was initialized
later during port init. This change moves all the control queues (one
per port/channel) to the adapter so that they are initialized during
adapter init and are available before any port is up. This allows the
driver to issue ctrlq work requests over any channel without having to
bring up any port.
Metadata-intensive workloads can cause the ARC to become permanently
filled with dnode_t objects as they're pinned by the VFS layer.
Subsequent data-intensive workloads may only benefit from about
25% of the potential ARC (arc_c_max - arc_meta_limit).
In order to help track metadata usage more precisely, the other_size
metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size.
The new zfs_arc_dnode_limit tunable, which defaults to 10% of
zfs_arc_meta_limit, defines the minimum number of bytes which is desirable
to be consumed by dnodes. Attempts to evict non-metadata will trigger
async prune tasks if the space used by dnodes exceeds this limit.
The new zfs_arc_dnode_reduce_percent tunable specifies the amount by
which the excess dnode space is attempted to be pruned as a percentage of
the amount by which zfs_arc_dnode_limit is being exceeded. By default,
it tries to unpin 10% of the dnodes.
The problem of dnode metadata pinning was observed with the following
testing procedure (in this example, zfs_arc_max is set to 4GiB):
- Create a large number of small files until arc_meta_used exceeds
arc_meta_limit (3GiB with default tuning) and arc_prune
starts increasing.
- Create a 3GiB file with dd. Observe arc_mata_used. It will still
be around 3GiB.
- Repeatedly read the 3GiB file and observe arc_meta_limit as before.
It will continue to stay around 3GiB.
With this modification, space for the 3GiB file is gradually made
available as subsequent demands on the ARC are made. The previous behavior
can be restored by setting zfs_arc_dnode_limit to the same value as the
zfs_arc_meta_limit.
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4345
Issue #4512
Issue #4773
Closes #4858
While STACKSIZE macro is indeed problematic on some systems, the commits
were wrong to shrink il[] and cstk[], because they need to be of the same
size as p_stack[] as they're accessed with the same index ps.tos.
Kristof Provost [Sat, 11 Aug 2018 16:34:30 +0000 (16:34 +0000)]
pf: Fix 'set skip on' for groups
The pfi_skip_if() function sometimes caused skipping of groups to work,
if the members of the group used the groupname as a name prefix.
This is often the case, e.g. group lo usually contains lo0, lo1, ...,
but not always.
Rather than relying on the name explicitly check for group memberships.
- Correct the description when jobs are executed related to load avg
to match reality (slightly different to what was submitted in the
PR: use english word instead of math-symbol).
- Wrap the corresponding part to below 80 characters per line.
Fix bug introduced in r98542: previously to this revision the byte-swapped
value was compared at this place. The current check is in a conditional
section where the non-byte-swapped value was already checked to be not
the value which is checked again. As byte-swapping is activated afterwards,
it only makes sense if the byte-swapped value is checked.
Submitted by: Keith White <kwhite@site.uottawa.ca>
PR: 200059
MFC after: 1 month
Sponsored by: Essen Hackathon
The Makefile part in the PR is solved already differently, so this
part is skipped form the PR The man page change change is slightly
changed to adapt to the way the Makefile works and to the spirit
of what is intended here.
Submitted by: Juan Ramón Molina Menor <info@juanmolina.eu>
PR: 194910
Sponsored by: Essen Hackathon
Dimitry Andric [Sat, 11 Aug 2018 10:42:12 +0000 (10:42 +0000)]
Pull in r338481 from upstream llvm trunk (by Chandler Carruth):
[x86] Fix a really subtle miscompile due to a somewhat glaring bug in
EFLAGS copy lowering.
If you have a branch of LLVM, you may want to cherrypick this. It is
extremely unlikely to hit this case empirically, but it will likely
manifest as an "impossible" branch being taken somewhere, and will be
... very hard to debug.
Hitting this requires complex conditions living across complex
control flow combined with some interesting memory (non-stack)
initialized with the results of a comparison. Also, because you have
to arrange for an EFLAGS copy to be in *just* the right place, almost
anything you do to the code will hide the bug. I was unable to reduce
anything remotely resembling a "good" test case from the place where
I hit it, and so instead I have constructed synthetic MIR testing
that directly exercises the bug in question (as well as the good
behavior for completeness).
The issue is that we would mistakenly assume any SETcc with a valid
condition and an initial operand that was a register and a virtual
register at that to be a register *defining* SETcc...
It isn't though....
This would in turn cause us to test some other bizarre register,
typically the base pointer of some memory. Now, testing this register
and using that to branch on doesn't make any sense. It even fails the
machine verifier (if you are running it) due to the wrong register
class. But it will make it through LLVM, assemble, and it *looks*
fine... But wow do you get a very unsual and surprising branch taken
in your actual code.
The fix is to actually check what kind of SETcc instruction we're
dealing with. Because there are a bunch of them, I just test the
may-store bit in the instruction. I've also added an assert for
sanity that ensure we are, in fact, *defining* the register operand.
=D
Sevan Janiyan [Sat, 11 Aug 2018 10:21:21 +0000 (10:21 +0000)]
Drop the ternary operator for calculating ssid display length in list_scan().
Regardless if a verbose scan is required or not, we'd still want to display the
full SSID name by default so use the IEE80211_NWID_LEN constant to set the
value to use instead.
Conrad Meyer [Sat, 11 Aug 2018 02:56:43 +0000 (02:56 +0000)]
stat(1): cache id->name resolution
When invoked on a large list of files, it is most common for a small number of
uids/gids to own most of the results.
Like ls(1), use pwcache(3) to avoid repeatedly looking up the same IDs.
Example microbenchmark and non-scientific results:
$ time (find /usr/src -type f -print0 | xargs -0 stat >/dev/null)
BEFORE:
3.62s user 5.23s system 102% cpu 8.655 total
3.47s user 5.38s system 102% cpu 8.647 total
AFTER:
1.23s user 1.81s system 108% cpu 2.810 total
1.43s user 1.54s system 107% cpu 2.754 total
Does this microbenchmark have any real-world significance? Until a use case
is demonstrated otherwise, I doubt it. Ordinarily I would be resistant to
optimizing pointless microbenchmarks in base utilities (e.g., recent totally
gratuitous changes to yes(1)). However, the pwcache(3) APIs actually
simplify stat(1) logic ever so slightly compared to the raw APIs they wrap,
so I think this is at worst harmless.
PR: 230491
Reported by: Thomas Hurst <tom AT hur.st>
Discussed with: gad@
Kyle Evans [Sat, 11 Aug 2018 01:02:27 +0000 (01:02 +0000)]
libbe(3)/bectl(8): Kill off the 'add' functionality for now
The mostly-undocumented 'add' functionality, from initial read-through, is
intended for construction of deep ("bdrewery style") boot environments.
However, it's mostly broken at this point. `#if SOON` it out on both sides
so that we're not exposing a broken API/feature.
Use cached feature info in spa_add_feature_stats()
Avoid issuing I/O to the pool when retrieving feature flags information.
Trying to read the ZAPs from disk means that zpool clear would hang if
the pool is suspended and recovery would require a reboot. To keep the
feature stats resident in memory, we hang a cached nvlist off of the
spa. It is built up from disk the first time spa_add_feature_stats() is
called, and refreshed thereafter using the cached feature reference
counts. spa_add_feature_stats() gets called at pool import time so we
can be sure the cached nvlist will be available if the pool is later
suspended.
Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3082
Kyle Evans [Fri, 10 Aug 2018 21:23:56 +0000 (21:23 +0000)]
libbe(3): More error handling bits
be_add_child functionality gets split out into separate places as a bonus.
A lot of places here we'll gloss over libzfs errors, because they shouldn't
be happening given the conditions that we're operating under. "Unknown
error" is what I'm intending to use for the moment to indicate an
exceptional circumstance- exceptional enough that we can't tell the consumer
did because we're not so certain that they did anything.
Dimitry Andric [Fri, 10 Aug 2018 19:57:55 +0000 (19:57 +0000)]
In r308100, an explicit -fexceptions flag was added for the C sources
from LLVM's libunwind, which end up in libgcc_eh.a and libgcc_s.so.
This is because the unwinder needs the unwinder data for its own
functions.
However, for the C++ sources in libunwind, -fexceptions is already the
default, and this can have the side effect of generating a reference to
__gxx_personality_v0, the so-called personality function, which is
normally provided by the C++ ABI library (libcxxrt or libsupc++).
If the reference ends up in the eventual libgcc_s.so, linking any
non-C++ programs against it will fail with "undefined reference to
`__gxx_personality_v0'".
Note that at high optimization levels, the reference is usually
optimized away, which is why we have never noticed this problem before.
With clang 7.0.0 though, higher optimization levels don't help anymore,
since the addition of address-significance tables [1] in
<https://reviews.llvm.org/rL337339>. Effectively, this always causes a
reference to __gxx_personality_v0.
After discussion with the upstream author of that change, it turns out
that we should compile libunwind sources with the -fno-exceptions
-funwind-tables flags instead. This ensures unwind tables are
generated, but no references to any personality functions are emitted.
Conrad Meyer [Fri, 10 Aug 2018 19:19:07 +0000 (19:19 +0000)]
Walk back r337554 while discussion continues
The idea was to get the uncontroversial mechanical change out of the way,
then get the meatier functional changes reviewed subsequently. I had not
realized that the immediately adjacent issue was addressed in a different
direction in r334506 (see Warner's guidance in D15592).
Discussion continues, trying to determine if there is a secondary issue
still[1] and how best to fix it. With 12-related activities coming up,
while that is ongoing, just take this back for now.
[1]: Shutdown-time eventhandler events fire normally during panic's reboot
path. Driver callbacks that attempt to issue and wait on interrupt-
completed IO may never complete, hanging the system. This is particularly
obnoxious in the shutdown/panic path, as the debugger cannot be entered
anymore and the hang prevents reboot restoring availability.
(There's nothing CAM-specific about this problem -- any shutdown
event-triggered driver could do something like this during panic. But most
NICs, etc. don't try to send spin-down commands at shutdown. ;-))
Kyle Evans [Fri, 10 Aug 2018 15:29:06 +0000 (15:29 +0000)]
boot tagging: minor fixes
msgbufinit may be called multiple times as we initialize the msgbuf into a
progressively larger buffer. This doesn't happen as of now on head, but it
may happen in the future and we generally support this. As such, only print
the boot tag if we've just initialized the buffer for the first time.
The boot tag also now has a newline appended to it for better visibility,
and has been switched to a normal printf, by requesto f bde, after we've
denoted that the msgbuf is mapped.
Warner Losh [Fri, 10 Aug 2018 15:16:41 +0000 (15:16 +0000)]
Update man page to include FreeBSD-specific details.
While this implements a standards-conforming C11 function, there's
implementation details the programmer needs to know. Include those
here. Make changes inspired by comments on the initial review as well,
though mostly this involves stealing the epoch verbage from
gettimeofday(2). Add myself to authors since I've now changed a
substantial amount of this man page.
Warner Losh [Fri, 10 Aug 2018 15:16:36 +0000 (15:16 +0000)]
Remove assert.h and commented out _DIAGASSERT.
Remove assert.h and _DIAGASSERT to create a paper-trail of changes
from NetBSD. Specifically didn't fix other style issues since I
don't want this to diverge from the NetBSD original too much and
that's too niggling a change to be worth future merge hassles.
Warner Losh [Fri, 10 Aug 2018 15:16:30 +0000 (15:16 +0000)]
Bring in timespce_get form NetBSD.
Bring in the functionality for timespec_get from NetBSD. I've lightly
edited the .c file to remove _DIAGASSERT because FreeBSD doesn't have
that functionality and the typical #define'ing it to assert isn't
right here. The man page is verbatim from NetBSD, but will be revised
as part of a larger cleanup of the time man pages (they are
inconsistent and vague in all the wrong places).
Kyle Evans [Fri, 10 Aug 2018 13:32:02 +0000 (13:32 +0000)]
net80211: Drain ageq before cleaning it up.
The comment above ieee80211_ageq_cleanup specifically notes that the queue
is assumed to be empty, and in order to make it so, ieee80211_ageq_drain
must be used.
Ed Maste [Fri, 10 Aug 2018 10:37:25 +0000 (10:37 +0000)]
readelf: display NT_GNU_PROPERTY_TYPE_0 note name
NT_GNU_PROPERTY_TYPE_0 in a .note.gnu.property section "contains a
program property note which describes special handling requirements
for linker and run-time loader." (from the System V Application Binary
Interface - Linux Extensions")
Intel CET uses two processor-specific program properties in
NT_GNU_PROPERTY_TYPE_0: GNU_PROPERTY_X86_FEATURE_1_IBT to indicate that
all executable sections are compatible with Indirect Branch Tracking,
and GNU_PROPERTY_X86_FEATURE_1_SHSTK to indicate that sections are
compatible with shadow stack.
A later change should add decoding of the individual properties.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
perf: 2.75x faster ddt_entry_compare()
First 256bits of ddt_key_t is a block checksum, which are expected
to be close to random data. Hence, on average, comparison only needs to
look at first few bytes of the keys. To reduce number of conditional
jump instructions, the result is computed as: sign(memcmp(k1, k2)).
Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} ,
which is computed efficiently. Synthetic performance evaluation of
original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R)
CPU E5-2660 v3:
old 6.85789 s
new 2.49089 s
perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare()
Compute the result directly instead of using conditionals
perf: zfs_range_compare()
Speedup between 1.1x - 2.5x, depending on compiler version and
optimization level.
perf: spa_error_entry_compare()
`bcmp()` is not suitable for comparator use. Use `memcmp()` instead.
Justin Hibbits [Fri, 10 Aug 2018 03:28:40 +0000 (03:28 +0000)]
powerpc: Add lwsync and ptesync 'sync' opcode variants to ddb disassembler
The canonical form of sync is:
sync L, E (if Category Elemental Memory Barriers implemented)
The L bits (2) denote the type of sync:
0 -- hwsync
1 -- lwsync
2 -- ptesync or hwsync
It's been found that most 32-bit CPUs designed prior to the introduction of
lwsync will ignore the L bits. However, some cores, particularly the e500 core,
will trigger an illegal instruction exception. Adding these variants will make
it easier to see which sync variant is actually being used in case of a trap.
Kyle Evans [Fri, 10 Aug 2018 00:10:57 +0000 (00:10 +0000)]
Makefile.inc1: Add libl to -legacy as well
libl is needed for config(8), which is a bootstrap-tool. It is possible to
build a system WITHOUT_TOOLCHAIN to exclude lex and thus, libl. We still
need to support building from this kind of host, though.
While here, group the config(8) dependencies together and add a small
explanation. These can likely both be scoped more clearly, but this will
need some further investigation.
Cy Schubert [Fri, 10 Aug 2018 00:04:32 +0000 (00:04 +0000)]
Identify the return value (rval) that led to the IPv4 NAT failure
in ipf_nat_checkout() and report it in the frb_natv4out and frb_natv4in
dtrace probes.
This is currently being used to diagnose NAT failures in PR/208566. It's
rather handy so this commit makes it available for future diagnosis and
debugging efforts.