Alexander Motin [Mon, 8 Jul 2019 10:21:38 +0000 (10:21 +0000)]
MFC r349321: Improve AHCI Enclosure Management and SES interoperation.
Since SES specs do not define mechanism to map enclosure slots to SATA
disks, AHCI EM code I written many years ago appeared quite useless,
that always bugged me. I was thinking whether it was a good idea, but
if LSI HBAs do that, why I shouldn't?
This change introduces simple non-standard mechanism for the mapping
into both AHCI EM and SES code, that makes AHCI EM on capable controllers
(most of Intel's) a first-class SES citizen, allowing it to report disk
physical path to GEOM, show devices inserted into each enclosure slot in
`sesutil map` and `getencstat`, control locate and fault LEDs for specific
devices with `sesutil locate adaX on` and `sesutil fault adaX on`, etc.
I've successfully tested this on Supermicro X10DRH-i motherboard connected
with sideband cable of its S-SATA Mini-SAS connector to SAS815TQ backplane.
It can indicate with LEDs Locate, Fault and Rebuild/Remap SES statuses for
each disk identical to real SES of Supermicro SAS2 backplanes.
Alexander Motin [Mon, 8 Jul 2019 10:19:49 +0000 (10:19 +0000)]
MFC r341811 (by delphij):
Remove questionable initialization for ICH8M, rely on BIOS to properly
initialize the controller.
According to the datasheet, the old code checks if port 2 (P2E, 0x4) was
the only enabled port (except port 0, which was ignored by mask 0xfe),
and issue a write to the PCS register to disable all but port 0, right
before ahci_ctlr_reset.
Some other operating systems would issue a port enable to all ports, but
since the current code only does the special initialization for ICH8M,
it entirely and rely on BIOS to do the right thing (the alternative
would be https://reviews.freebsd.org/D18300?id=50922 , should we see
reports that we really need to do it).
Alexander Motin [Mon, 8 Jul 2019 10:18:11 +0000 (10:18 +0000)]
MFC r340092 (by imp): Implement ability to turn on/off PHYs for AHCI devices.
As part of Chuck's work on fixing kernel crashes caused by disk I/O
errors, it is useful to be able to trigger various kinds of
errors. This patch allows causing an AHCI-attached disk to disappear,
by having the driver keep the PHY disabled when the driver would
otherwise enable the PHY. It also allows making the disk reappear by
having the driver go back to setting the PHY enable/disable state as
it normal would and simulating the hardware event that causes a bus
rescan.
Alexander Motin [Sun, 7 Jul 2019 18:50:23 +0000 (18:50 +0000)]
MFC r349243: Optimize xpt_getattr().
Do not allocate temporary buffer for attributes we are going to return
as-is, just make sure to NUL-terminate them. Do not zero temporary 64KB
buffer for CDAI_TYPE_SCSI_DEVID, XPT tells us how much data it filled
and there are also length fields inside the returned data also.
Alexander Motin [Sun, 7 Jul 2019 18:49:39 +0000 (18:49 +0000)]
MFC r349220: Add wakeup_any(), cheaper wakeup_one() for taskqueue(9).
wakeup_one() and underlying sleepq_signal() spend additional time trying
to be fair, waking thread with highest priority, sleeping longest time.
But in case of taskqueue there are many absolutely identical threads, and
any fairness between them is quite pointless. It makes even worse, since
round-robin wakeups not only make previous CPU affinity in scheduler quite
useless, but also hide from user chance to see CPU bottlenecks, when
sequential workload with one request at a time looks evenly distributed
between multiple threads.
This change adds new SLEEPQ_UNFAIR flag to sleepq_signal(), making it wakeup
thread that went to sleep last, but no longer in context switch (to avoid
immediate spinning on the thread lock). On top of that new wakeup_any()
function is added, equivalent to wakeup_one(), but setting the flag.
On top of that taskqueue(9) is switchied to wakeup_any() to wakeup its
threads.
As result, on 72-core Xeon v4 machine sequential ZFS write to 12 ZVOLs
with 16KB block size spend 34% less time in wakeup_any() and descendants
then it was spending in wakeup_one(), and total write throughput increased
by ~10% with the same as before CPU usage.
Alexander Motin [Sun, 7 Jul 2019 18:44:51 +0000 (18:44 +0000)]
MFC r349178: Optimize kern.geom.conf* sysctls.
On large systems those sysctls may generate megabytes of output. Before
this change sbuf(9) code was resizing buffer by 4KB each time many times,
generating tons of TLB shootdowns. Unfortunately in this case existing
sbuf_new_for_sysctl() mechanism, supposed to help with this issue, is not
applicable, since all the sbuf writes are done in different kernel thread.
This change improves situation in two ways:
- on first sysctl call, not providing any output buffer, it sets special
sbuf drain function, just counting the data and so not needing big buffer;
- on second sysctl call it uses as initial buffer size value saved on
previous call, so that in most cases there will be no reallocation, unless
GEOM topology changed significantly.
For busy ARC situation when arc_size close to arc_c is desired. But
then it is quite likely that aggsum_compare(&arc_size, arc_c) will need
to flush per-CPU buckets to find exact comparison result. Doing that
often in a hot path penalizes whole idea of aggsum usage there, since it
replaces few simple atomic additions with dozens of lock acquisitions.
Replacing aggsum_compare() with aggsum_upper_bound() in code increasing
arc_p when ARC is growing (arc_size < arc_c) according to PMC profiles
allows to save ~5% of CPU time in aggsum code during sequential write
to 12 ZVOLs with 16KB block size on large dual-socket system.
I suppose there some minor arc_p behavior change due to lower precision
of the new code, but I don't think it is a big deal, since it should
affect only very small window in time (aggsum buckets are flushed every
second) and in ARC size (buckets are limited to 10 average ARC blocks
per CPU).
Alexander Motin [Sun, 7 Jul 2019 18:40:35 +0000 (18:40 +0000)]
MFC r349039: Alike to ZoL disable metaslab allocation tracing code.
It is too generous to collect in production debug traces that can only
be read with kernel debugger. Illumos includes special code in their
mdb debugger to read it, we don't.
Alexander Motin [Sun, 7 Jul 2019 18:38:40 +0000 (18:38 +0000)]
MFC r349029: Update td_runtime of running thread on each statclock().
Normally td_runtime is updated on context switch, but there are some kernel
threads that due to high absolute priority may run for many seconds without
context switches (yes, that is bad, but that is true), which means their
td_runtime was not updated all that time, that made them invisible for top
other then as some general CPU usage.
Alexander Motin [Sun, 7 Jul 2019 18:37:46 +0000 (18:37 +0000)]
MFC r349006: Move write aggregation memory copy out of vq_lock.
Memory copy is too heavy operation to do under the congested lock.
Moving it out reduces congestion by many times to almost invisible.
Since the original zio removed from the queue, and the child zio is
not executed yet, I don't see why would the copy need protection.
My guess it just remained like this from the time when lock was not
dropped here, which was added later to fix lock ordering issue.
Multi-threaded sequential write tests with both HDD and SSD pools
with ZVOL block sizes of 4KB, 16KB, 64KB and 128KB all show major
reduction of lock congestion, saving from 15% to 35% of CPU time
and increasing throughput from 10% to 40%.
Alexander Motin [Sun, 7 Jul 2019 18:31:53 +0000 (18:31 +0000)]
MFC r349284: Make ELEMENT INDEX validation more strict.
SES specifications tell: "The Additional Element Status descriptors shall
be in the same order as the status elements in the Enclosure Status
diagnostic page". It allows us to question ELEMENT INDEX that is lower
then values we already processed. There are many SAS2 enclosures with
this kind of problem.
While there, add more specific error messages for cases when ELEMENT INDEX
is obviously wrong. Also skip elements with INVALID bit set.
Alexander Motin [Sun, 7 Jul 2019 18:29:10 +0000 (18:29 +0000)]
MFC r349281: Fix individual_element_index when some type has 0 elements.
When some type has 0 elements, saved_individual_element_index was set
to -1 on second type bump, since individual_element_index was not
restored after the first. To me it looks easier just to increment
saved_individual_element_index separately than think when to save it.
Rick Macklem [Fri, 5 Jul 2019 22:36:31 +0000 (22:36 +0000)]
MFC: r348590, r348591
Modify mountd so that it incrementally updates the kernel exports upon a reload.
Without this patch, mountd would delete/load all exports from the exports
file(s) when it receives a SIGHUP. This works fine for small exports file(s),
but can take several seconds to do when there are large numbers (10000+) of
exported file systems. Most of this time is spent doing the system calls
that delete/export each of these file systems. When the "-S" option
has been specified (the default these days), the nfsd threads are suspended
for several seconds while the reload is done.
This patch changes mountd so that it only does system calls for file systems
where the exports have been changed/added/deleted as compared to the exports
done for the previous load/reload of the exports file(s).
Basically, when SIGHUP is posted to mountd, it saves the exportlist structures
from the previous load and creates a new set of structures from the current
exports file(s). Then it compares the current with the previous and only does
system calls for cases that have been changed/added/deleted.
The nfsd threads do not need to be suspended until the comparison step is
being done. This results in a suspension period of milliseconds for a server
with 10000+ exported file systems.
There is some code using a LOGDEBUG() macro that allow runtime debugging
output via syslog(LOG_DEBUG,...) that can be enabled by creating a file
called /var/log/mountd.debug. This code is expected to be replaced with
code that uses dtrace by cy@ in the near future, once issues w.r.t. dtrace
in stable/12 have been resolved.
The patch should not change the usage of the exports file(s), but improves
the performance of reloading large exports file(s) where there are only a
small number of changes done to the file(s).
MFC r349507:
Need to wait for epoch callbacks to complete before detaching a
network interface.
This particularly manifests itself when an INP has multicast options
attached during a network interface detach. Then the IPv4 and IPv6
leave group call which results from freeing the multicast address, may
access a freed ifnet structure. These are the steps to reproduce:
service mdnsd onestart # installed from ports
ifconfig epair create
ifconfig epair0a 0/24 up
ifconfig epair0a destroy
MFC r349506:
Implement API for draining EPOCH(9) callbacks.
The epoch_drain_callbacks() function is used to drain all pending
callbacks which have been invoked by prior epoch_call() function calls
on the same epoch. This function is useful when there are shared
memory structure(s) referred to by the epoch callback(s) which are not
refcounted and are rarely freed. The typical place for calling this
function is right before freeing or invalidating the shared
resource(s) used by the epoch callback(s). This function can sleep and
is not optimized for performance.
MFC r340404, r340415, r340417, r340419 and r340420:
Synchronize epoch(9) code with -head to make merging new patches
and features easier.
The FreeBSD version has been bumped to force recompilation of
external kernel modules.
Sponsored by: Mellanox Technologies
MFC r340404:
Uninline epoch(9) entrance and exit. There is no proof that modern
processors would benefit from avoiding a function call, but bloating
code. In fact, clang created an uninlined real function for many
object files in the network stack.
- Move epoch_private.h into subr_epoch.c. Code copied exactly, avoiding
any changes, including style(9).
- Remove private copies of critical_enter/exit.
MFC r340415:
The dualism between epoch_tracker and epoch_thread is fragile and
unnecessary. So, expose CK types to kernel and use a single normal
structure for epoch_tracker.
Reviewed by: jtl, gallatin
MFC r340417:
With epoch not inlined, there is no point in using _lite KPI. While here,
remove some unnecessary casts.
MFC r340419:
style(9), mostly adjusting overly long lines.
MFC r340420:
epoch(9) revert r340097 - no longer a need for multiple sections per cpu
I spoke with Samy Bahra and recent changes to CK to make ck_epoch_call and
ck_epoch_poll not modify the record have eliminated the need for this.
MFC r349369:
Convert all IPv4 and IPv6 multicast memberships into using a STAILQ
instead of a linear array.
The multicast memberships for the inpcb structure are protected by a
non-sleepable lock, INP_WLOCK(), which needs to be dropped when
calling the underlying possibly sleeping if_ioctl() method. When using
a linear array to keep track of multicast memberships, the computed
memory location of the multicast filter may suddenly change, due to
concurrent insertion or removal of elements in the linear array. This
in turn leads to various invalid memory access issues and kernel
panics.
To avoid this problem, put all multicast memberships on a STAILQ based
list. Then the memory location of the IPv4 and IPv6 multicast filters
become fixed during their lifetime and use after free and memory leak
issues are easier to track, for example by: vmstat -m | grep multi
All list manipulation has been factored into inline functions
including some macros, to easily allow for a future hash-list
implementation, if needed.
Rick Macklem [Fri, 5 Jul 2019 00:55:46 +0000 (00:55 +0000)]
MFC: r348452
Replace a single linked list with a hash table of lists.
mountd.c uses a single linked list of "struct exportlist" structures,
where there is one of these for each exported file system on the NFS server.
This list gets long if there are a large number of file systems exported and
the list must be searched for each line in the exports file(s) when
SIGHUP causes the exports file(s) to be reloaded.
A simple benchmark that traverses SLIST() elements and compares two 32bit
fields in the structure for equal (which is what the search is)
appears to take a couple of nsec. So, for a server with 72000 exported file
systems, this can take about 5sec during reload of the exports file(s).
By replacing the single linked list with a hash table with a target of
10 elements per list, the time should be reduced to less than 1msec.
Peter Errikson (who has a server with 72000+ exported file systems) ran
a test program using 5 hashes to see how they worked.
fnv_32_buf(fsid,..., 0)
fnv_32_buf(fsid,..., FNV1_32_INIT)
hash32_buf(fsid,..., 0)
hash32_buf(fsid,..., HASHINIT)
- plus simply using the low order bits of fsid.val[0].
The first three behaved about equally well, with the first one being
slightly better than the others.
It has an average variation of about 4.5% about the target list length
and that is what this patch uses.
Peter Errikson also tested this hash table version and found that the
performance wasn't measurably improved by a larger hash table, so a
load factor of 10 appears adequate.
Ed Maste [Thu, 4 Jul 2019 19:52:50 +0000 (19:52 +0000)]
MFC r349239, r349241: update vm_map_protect.9
Clarify that vm_map_protect cannot upgrade max_protection
It's implied by the man page's RETURN VALUES section, but be explicit in
the description that vm_map_protect can not set new protection bits that
are already in each entry's max_protection.
Clarify vm_map_protect max_protection downgrade
As reported in review D20709 by brooks calling vm_map_protect to set a
new max_protection value downgrades existing mappings if necessary (as
opposed to returning an error).
MFC r345970: network.subr: improve configuration of cloned gif(4) interfaces
ifconfig(8) syntax allows to specify only single address_family,
so we need additional invocation of ifconfig to support configuration
of cloned gif interface that may use different address families
for its internal and external addresses.
Also, ifconfig(8) does not allow to omit "inet6" keyword for address family
specifying IPv6 addresses as outer addresses of the interface.
Also, address_family is not "parameter" and it has to go before parameters
including "tunnel" keyword, so "ifconfig gif0 tunnel inet6 $oip1 $oip2" would be
wrong syntax and only "ifconfig gif0 inet6 tunnel $oip1 $oip2" is right.
At least since version 4.0.0, QEMU became bug-compatible with PowerVM's
vty, by inserting a \0 after every \r. As this confuses loader's
interpreter and as a \0 coming from the console doesn't seem reasonable,
it's now being filtered at OFW console input.
Build lib32 libl. The library is built from usr.bin/lex/lib. It would be
better to move this directory to lib/libl, but this requires more extensive
changes to Makefile.inc1. This simple fix can be MFCed quickly.
The vsc_rx_ready and the RX virtqueue is protected by the rx_mtx lock.
However, pci_vtnet_ping_rxq() (currently called only once after each
device reset) accesses those without acquiring the lock.
Both virtio_net and e82545 network frontends have code to validate and
generate MAC addresses. These functionalities are replicated in the two
files, so we move them in a separate compilation unit.
bhyve: virtio: introduce vq_kick_enable() and vq_kick_disable()
The VirtIO standard supports two schemes for notification suppression:
a notification enable bit and a more sophisticated one (event_idx) that
also supports delayed notifications. Currently bhyve fully supports
only the first scheme. This patch hides the notification suppression
internals by means of two inline routines, vq_kick_enable() and
vq_kick_disable(), and makes the code more readable.
Moreover, further improve readability by replacing the call to mb()
with a call to atomic_thread_fence_seq_cst(), which is already used
in virtio.c
On vtnet device reset it is necessary to wait for threads to stop TX and
RX processing. However, the rx_in_progress variable (used for to wait for
RX processing to stop) is actually useless, and can be removed. Acquiring
and releasing the RX lock is enough to synchronize correctly. Moreover,
it is possible to reset the device while holding both TX and RX locks, so
that the "resetting" variable becomes unnecessary for the RX thread, and
can be protected by the TX lock (instead of being volatile).
Eric van Gyzen [Wed, 3 Jul 2019 19:52:24 +0000 (19:52 +0000)]
MFC r349285
VirtIO SCSI: validate seg_max on attach
Until head r349278 (stable/12 r349690), bhyve presented a seg_max
to the guest that was too large. Detect this case and clamp it to
the virtqueue size. Otherwise, we would fail the "too many segments
to enqueue" assertion in virtqueue_enqueue().
I hit this by running a guest with a MAXPHYS of 256 KB.
Eric van Gyzen [Wed, 3 Jul 2019 19:50:22 +0000 (19:50 +0000)]
MFC r349278
bhyve: Fix vtscsi maximum segment config
The seg_max value reported to the guest should be two less than the
host's maximum, in order to leave room for the request and the
response. This is analogous to r347033 for virtio_block.
We hit the "too many segments to enqueue" assertion on OneFS because
we increase MAXPHYS to 256 KB.
The POWER8NVL (POWER8 NVLink) architecturally behaves identically to the
POWER8, with a different PVR identifier. Mark it as such, so it shows up
appropriately to the user.
The second statements on the lines are not guarded by the `if' condition.
This triggers a warning with newer gcc. It's relatively harmless given the
usage, but incorrect. Instead, wrap the statements so they're properly
guarded.
r345829: powerpc: Apply r178139 from sparc64 to powerpc's fpu_sqrt
r345831: powerpc: Allow emulating optional FPU instructions on CPUs with an FPU
r349402: powerpc/booke: Handle misaligned floating point loads/stores as on AIM
MFC r349522:
Need to apply the PCIM_BAR_MEM_BASE mask to the physical memory
address before returning it to the user. Some of the least significant
bits have special meaning and should be masked away.
MFC r349409 and r349410:
Fix support for LIBUSB_HOTPLUG_ENUMERATE in libusb. Currently all
devices are enumerated regardless of of the LIBUSB_HOTPLUG_ENUMERATE
flag. Make sure when the flag is not specified no arrival events are
generated for currently enumerated devices.
MFC r349368:
Free all allocated unit IDs in cuse(3) after the client character
devices have been destroyed to avoid creating character devices with
identical name.
Ed Maste [Wed, 3 Jul 2019 17:34:26 +0000 (17:34 +0000)]
MFC r349268: nandsim: correct test to avoid out-of-bounds access
Previously nandsim_chip_status returned EINVAL iff both of user-provided
chip->ctrl_num and chip->num were out of bounds. If only one failed the
bounds check arbitrary memory would be read and returned.
The NAND framework is not built by default, nandsim is not intended for
production use (it is a simulator), and the nandsim device has root-only
permissions.
admbugs: 827
Reported by: Daniel Hodson of elttam
Security: kernel information leak or DoS
Sponsored by: The FreeBSD Foundation
While working on PR/238796 I discovered an unused variable in frdest,
the next hop structure. It is likely this contributes to PR/238796
though other factors remain to be investigated.