Ian Lepore [Mon, 31 Jul 2017 15:24:40 +0000 (15:24 +0000)]
Check the clock-halted flag every time the clock is read, not just once
at startup. The flag stays set until the clock is loaded with good time,
so we need to keep saying the time is invalid until that happens.
Alexander Motin [Mon, 31 Jul 2017 15:23:19 +0000 (15:23 +0000)]
Improve FHA locality control for NFS read/write requests.
This change adds two new tunables, allowing to control serialization for
read and write NFS requests separately. It does not change the default
behavior since there are too many factors to consider, but gives additional
space for further experiments and tuning.
The main motivation for this change is very low write speed in case of ZFS
with sync=always or when NFS clients requests sychronous operation, when
every separate request has to be written/flushed to ZIL, and requests are
processed one at a time. Setting vfs.nfsd.fha.write=0 in that case allows
to increase ZIL throughput by several times by coalescing writes and cache
flushes. There is a worry that doing it may increase data fragmentation
on disks, but I suppose it should not happen for pool with SLOG.
Ian Lepore [Mon, 31 Jul 2017 14:57:02 +0000 (14:57 +0000)]
Switch from using iic_transfer() to iicdev_readfrom/writeto(), mostly so
that transfers will be done with proper ownership of the bus. No
behavioral changes. Also add a detach() method.
Andrew Gallatin [Mon, 31 Jul 2017 14:56:35 +0000 (14:56 +0000)]
Don't request CTLTYPE_OPAQUE if we can't print them.
The intent is to skip expensive opaque sysctls like tcp_pcblist unless
they are explicitly requested. Sysctl nodes like this don't show up in
sysctl -a, but they do generate output that winds up being dropped,
unless the user specifically requested binary/hex output or opaques.
This reduces the runtime of sysctl in many circumstances on a loaded
system. It also reduces the likelihood that simply gathering
diagnostics on a sick machine (stuck lock, etc) via sysctl -a might
push it over the edge into a total lockup.
Add inpcb pointer to struct ipsec_ctx_data and pass it to the pfil hook
from enc_hhook().
This should solve the problem when pf is used with if_enc(4) interface,
and outbound packet with existing PCB checked by pf, and this leads to
deadlock due to pf does its own PCB lookup and tries to take rlock when
wlock is already held.
Now we pass PCB pointer if it is known to the pfil hook, this helps to
avoid extra PCB lookup and thus rlock acquiring is not needed.
For inbound packets it is safe to pass NULL, because we do not held any
PCB locks yet.
How network VF works with hn(4) on Hyper-V in non-transparent mode:
- Each network VF has a cooresponding hn(4).
- The network VF and the it's cooresponding hn(4) have the same hardware
address.
- Once the network VF is up, e.g. ifconfig VF up:
o All of the transmission should go through the network VF.
o Most of the reception goes through the network VF.
o Small amount of reception may go through the cooresponding hn(4).
This reception will happen, even if the the cooresponding hn(4) is
down. The cooresponding hn(4) will change the reception interface
to the network VF, so that network layer and application layer will
be tricked into thinking that these packets were received by the
network VF.
o The cooresponding hn(4) pretends the physical link is down.
- Once the network VF is down or detached:
o All of the transmission should go through the cooresponding hn(4).
o All of the reception goes through the cooresponding hn(4).
o The cooresponding hn(4) fallbacks to the original physical link
detection logic.
All these features are mainly used to help live migration, during which
the network VF will be detached, while the network communication to the
VM must not be cut off. In order to reach this level of live migration
transparency, we use failover mode lagg(4) with the network VF and the
cooresponding hn(4) attached to it.
To ease user configuration for both network VF and non-network VF, the
lagg(4) will be created by the following rules, and the configuration
of the cooresponding hn(4) will be applied to the lagg(4) automatically.
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D11635
Ian Lepore [Mon, 31 Jul 2017 01:36:51 +0000 (01:36 +0000)]
Use the new clock_schedule() to arrange for clock_settime() to be called
at the right time to keep the RTC hardware time in sync, instead of using
pause_sbt() to sleep until the right time.
Ian Lepore [Mon, 31 Jul 2017 01:18:21 +0000 (01:18 +0000)]
Add clock_schedule(), a feature that allows realtime clock drivers to
request that their clock_settime() methods be called at a given offset
from top-of-second. This adds a timeout_task to the rtc_instance so that
each clock can be separately added to taskqueue_thread with the scheduling
it prefers, instead of looping through all the clocks at once with a
single task on taskqueue_thread. If a driver doesn't call clock_schedule()
the default is the old behavior: clock_settime() is queued immediately.
The motivation behind this is that I was on the path of adding identical
code to a third RTC driver to figure out a delta to top-of-second and
sleep for that amount of time because writing the the RTC registers resets
the hardware's concept of top-of-second. (Sometimes it's not top-of-second,
some RTC clocks tick over a half second after you set their time registers.)
Worst-case would be to sleep for almost a full second, which is a rude thing
to do on a shared task queue thread.
Ian Lepore [Mon, 31 Jul 2017 00:54:50 +0000 (00:54 +0000)]
Add taskqueue_enqueue_timeout_sbt(), because sometimes you want more control
over the scheduling precision than 'ticks' can offer, and because sometimes
you're already working with sbintime_t units and it's dumb to convert them
to ticks just so they can get converted back to sbintime_t under the hood.
- Remove 'if_rtwn_load="YES"' line from loader.conf; the module was
renamed in r319733 + it will be loaded automatically as a dependency.
- Move new sentence to new line.
- Add short description for dev.rtwn.%d.rx_buf_size tunable.
Since device can pass multiple frames in a single payload temporary
Rx buffer was big enough to hold all of them; now the driver can
concatenate a single frame from multiple payloads.
The Rx buffer size may be configured via tunable (dev.rtwn.%d.rx_buf_size).
Tested with:
- rtl8188cus, rtl8188eu and rtl8821au (STA mode).
- (by kevlo) rtl8192cu and rtl8188eu.
Scott Long [Sun, 30 Jul 2017 22:34:24 +0000 (22:34 +0000)]
Change from using underbar function names to normal function names for
the informational print functions. Collapse the debug API a bit to be
more generic and not require as much code duplication. While here, fix
a bug in MPS that was already fixed in MPR.
Ian Lepore [Sun, 30 Jul 2017 19:58:31 +0000 (19:58 +0000)]
Fix AM/PM mode handling. The bits to mask off in the hours register changes
between 12/24 hour mode. Also fix conversion between 12 and 24 hour mode.
It's not as easy as adding/subtracting 12, because the clock doesn't roll
over 11->0, it rolls over 12->1; 0 isn't a valid hour in AM/PM mode.
Ian Lepore [Sun, 30 Jul 2017 18:46:38 +0000 (18:46 +0000)]
Bugfixes and enhancements...
Don't enable the oscillator when it is found to be stopped at init time,
just let the first setting of valid time start it. But still report a dead
battery if it's stopped at init time.
Don't force the chip into 24hr mode, just cope with whatever mode it is
already in.
Align the RTC clock to top of second when setting it.
Ian Lepore [Sun, 30 Jul 2017 16:17:06 +0000 (16:17 +0000)]
Switch from using iic_transfer() to iicdev_readfrom/writeto(), mostly so
that transfers will be done with proper ownership of the bus. No
behavioral changes.
Alexander Motin [Sun, 30 Jul 2017 15:19:07 +0000 (15:19 +0000)]
Attach ichwd(4) only to ISA bus of the LPC bridge.
Resource allocation for parent device does not look good by itself, but
attempt to allocate them for unrelated device just does not end up good.
On Asus X99-E WS/USB3.1 system reporting ISA bridge via both PCI and ACPI
this reported to cause kernel panic on shutdown due to messed resources:
https://bugs.freenas.org/issues/25237.
Pull in r309503 from upstream clang trunk (by Richard Smith):
PR33902: Invalidate line number cache when adding more text to
existing buffer.
This led to crashes as the line number cache would report a bogus
line number for a line of code, and we'd try to find a nonexistent
column within the line when printing diagnostics.
This fixes an assertion when building the graphics/champlain port.
Scott Long [Sun, 30 Jul 2017 06:53:58 +0000 (06:53 +0000)]
Split the interrupt setup code into two parts: allocation and configuration.
Do the allocation before requesting the IOCFacts message. This triggers
the LSI firmware to recognize the multiqueue should be enabled if available.
Multiqueue isn't used by the driver yet, but this also fixes a problem with
the cached IOCFacts not matching latter checks, leading to potential problems
with error recovery.
As a side-effect, fetch the driver tunables as early as possible.
Ian Lepore [Sat, 29 Jul 2017 23:45:57 +0000 (23:45 +0000)]
Replace the pcf8563 i2c RTC driver with a new nxprtc driver which handles
all the chips in the NXP PCA212x and PCA/PCF85xx series. In addition to
supporting more chips, this driver uses the countdown timer on the chips as
a fractional seconds counter, giving it a resolution of about 15 milliseconds.
Rick Macklem [Sat, 29 Jul 2017 20:08:25 +0000 (20:08 +0000)]
Add a new "-N" option to umount(8), that does a forced dismount of an NFS mount
point.
The new "-N" option does a forced dismount of an NFS mount point, but avoids
doing any checking of the mounted-on path, so that it will not get hung
when a vnode lock is held by another hung process on the mounted-on vnode.
The most common case of this is a "umount" with the "-f" option.
Other than avoiding checking the mounted-on path, it performs the same
forced dismount as a successful "umount -f" would do.
This commit includes a content change to the man page.
Rick Macklem [Sat, 29 Jul 2017 19:52:47 +0000 (19:52 +0000)]
Add kernel support for the NFS client forced dismount "umount -N" option.
When an NFS mount is hung against an unresponsive NFS server, the "umount -f"
option can be used to dismount the mount. Unfortunately, "umount -f" gets
hung as well if a "umount" without "-f" has already been done. Usually,
this is because of a vnode lock being held by the "umount" for the mounted-on
vnode.
This patch adds kernel code so that a new "-N" option can be added to "umount",
allowing it to avoid getting hung for this case.
It adds two flags. One indicates that a forced dismount is about to happen
and the other is used, along with setting mnt_data == NULL, to handshake
with the nfs_unmount() VFS call.
It includes a slight change to the interface used between the client and
common NFS modules, so I bumped __FreeBSD_version to ensure both modules are
rebuilt.
Ian Lepore [Sat, 29 Jul 2017 17:00:23 +0000 (17:00 +0000)]
Add inline functions to convert between sbintime_t and decimal time units.
Use them in some existing code that is vulnerable to roundoff errors.
The existing constant SBT_1NS is a honeypot, luring unsuspecting folks into
writing code such as long_timeout_ns*SBT_1NS to generate the argument for a
sleep call. The actual value of 1ns in sbt units is ~4.3, leading to a
large roundoff error giving a shorter sleep than expected when multiplying
by the trucated value of 4 in SBT_1NS. (The evil honeypot aspect becomes
clear after you waste a whole day figuring out why your sleeps return early.)
Don't use libc++ when cross-building for gcc arches
Since we imported clang 5.0.0, the version check in Makefile.inc1 which
checks whether to use libc++ fires even when the compiler for the target
architecture is gcc 4.2.1. This is because only X_COMPILER_VERSION is
checked. Also check X_COMPILER_TYPE, so it will only use libc++ when an
external gcc toolchain is used.
Currently in Virtio driver without TSO/GSO features enabled, the max scatter
gather segments for the TX path can be 4, which limits the support for 9K JUMBO
frames. 9K JUMBO frames results in more than 4 scatter gather segments and
virtio driver fails to send the frame down to host OS. With TSO/GSO feature
enabled max scatter gather segments can be 64, then 9K JUMBO frames are fine,
this is making virtio driver to support JUMBO frames only with TSO/GSO.
Increasing the VTNET_MIN_TX_SEGS which is the case for non TSO/GSO to 32 to
support upto 64K JUMBO frames to Host.
Ed Schouten [Sat, 29 Jul 2017 08:35:07 +0000 (08:35 +0000)]
Be a bit more liberal about sysctl naming.
On the systems on which I tested this exporter, I never ran into metrics
that were named in such a way that they couldn't be exported to
Prometheus metrics directly. Now it turns out that on systems with NUMA,
the sysctl tree contains metrics named dev.${driver}.${index}.%domain.
For these metrics, the % in the name is problematic, as Prometheus
doesn't allow this symbol to be used.
Remove the assertions that were originally put in place to prevent the
exporter from generating malformed output and add code to deal with it
accordingly. For metric names, convert any unsupported character to an
underscore. For label values, perform string escaping.
Rick Macklem [Sat, 29 Jul 2017 02:25:49 +0000 (02:25 +0000)]
Fix possible crash for the NFSv4.1 pNFS client.
If the nfsrpc_createlayoutrpc() call in nfsrpc_getcreatelayout() fails,
the code used nfhpp when it might be set NULL. This patch checks for
the error cases (laystat != 0) and avoids using nfhpp for the failure case.
This would only affect NFSv4.1 mounts with the "pnfs" option.
Found while testing the "umount -N" patch not yet in head.
Rick Macklem [Fri, 28 Jul 2017 21:07:57 +0000 (21:07 +0000)]
Modify /etc/rc.d/nfsd so it doesn't force a startup of nfsuserd for NFSv4.
Given that RFC7530 allows uid/gids to be placed in owner/owner_group
strings directly, many NFSv4 environments don't need the nfsuserd.
This small patch modified /etc/rc.d/nfsd so that it does not force
startup of the nfsuserd daemon unless nfs_server_managegids is enabled.
This implies that nfsuserd_enable="YES" must be added to /etc/rc.conf
for NFSv4 server environments that use Kerberos mounts or clients that
do not support the uid/gid in string capability.
Since this could be considered a POLA violation, it will not be MFC'd.
Pull in r308891 from upstream llvm trunk (by Benjamin Kramer):
[CodeGenPrepare] Cut off FindAllMemoryUses if there are too many uses.
This avoids excessive compile time. The case I'm looking at is
Function.cpp from an old version of LLVM that still had the giant
memcmp string matcher in it. Before r308322 this compiled in about 2
minutes, after it, clang takes infinite* time to compile it. With
this patch we're at 5 min, which is still bad but this is a
pathological case.
The cut off at 20 uses was chosen by looking at other cut-offs in LLVM
for user scanning. It's probably too high, but does the job and is
very unlikely to regress anything.
Fixes PR33900.
* I'm impatient and aborted after 15 minutes, on the bug report it was
killed after 2h.
Pull in r308986 from upstream llvm trunk (by Simon Pilgrim):
[X86][CGP] Reduce memcmp() expansion to 2 load pairs (PR33914)
D35067/rL308322 attempted to support up to 4 load pairs for memcmp
inlining which resulted in regressions for some optimized libc memcmp
implementations (PR33914).
Until we can match these more optimal cases, this patch reduces the
memcmp expansion to a maximum of 2 load pairs (which matches what we
do for -Os).
This patch should be considered for the 5.0.0 release branch as well
Sean Bruno [Fri, 28 Jul 2017 18:11:53 +0000 (18:11 +0000)]
binmiscctl should use modfind instead of kldfind
kldfind() only matches kernel modules, so if you link imgact_binmisc directly
into the kernel, binmiscctl can't find it, tries to load it, and errors
out with:
Can't load imgact_binmisc kernel module: File exists
A quick search of other base commands shows that the correct procedure is to
call modfind(), and then try kldload() if that fails.
PR: 218593
Submitted by: Dan Nelson <dnelson_1901@yahoo.com>
MFC after: 1 week
Marcin Wojtas [Fri, 28 Jul 2017 11:51:55 +0000 (11:51 +0000)]
Fix remapping VM attributes on Armada 38x
pmap_remap_vm_attr() function requires indexes to
pte2_attr_tab as the arguments (VM_MEMATTR_).
Mistakenly, instead of them, actual values from the
table were used (PTE2_ATTR_), when applying
work-around for Marvell Armada 38x SoCs.