brooks [Fri, 30 Mar 2018 18:50:13 +0000 (18:50 +0000)]
Use an accessor function to access ifr_data.
This fixes 32-bit compat (no ioctl command defintions are required
as struct ifreq is the same size). This is believed to be sufficent to
fully support ifconfig on 32-bit systems.
manu [Fri, 30 Mar 2018 16:37:08 +0000 (16:37 +0000)]
efinet: Do not return only if ReceiveFilter fails
If the network interface or the uefi implementation do not support the
ReceiveFilter interface do not return only and just print a message.
U-Boot doesn't support is and likely never will. Also even if this fails
it doesn't mean that network in EFI isn't supported.
ken [Fri, 30 Mar 2018 15:28:25 +0000 (15:28 +0000)]
Bring in the Broadcom/Emulex Fibre Channel driver, ocs_fc(4).
The ocs_fc(4) driver supports the following hardware:
Emulex 16/8G FC GEN 5 HBAS
LPe15004 FC Host Bus Adapters
LPe160XX FC Host Bus Adapters
Emulex 32/16G FC GEN 6 HBAS
LPe3100X FC Host Bus Adapters
LPe3200X FC Host Bus Adapters
The driver supports target and initiator mode, and also supports FC-Tape.
Note that the driver only currently works on little endian platforms. It
is only included in the module build for amd64 and i386, and in GENERIC
on amd64 only.
kib [Fri, 30 Mar 2018 10:55:31 +0000 (10:55 +0000)]
Make vm_map_max/min/pmap KBI stable.
There are out of tree consumers of vm_map_min() and vm_map_max(), and
I believe there are consumers of vm_map_pmap(), although the later is
arguably less in the need of KBI-stable interface. For the consumers
benefit, make modules using this KPI not depended on the struct vm_map
layout.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14902
emaste [Fri, 30 Mar 2018 03:38:08 +0000 (03:38 +0000)]
makefs: sync fragment and block size with newfs
r222319 in newfs raised the default blocksize for UFS/FFS filesystems
from 16K to 32K and the default fragment size from 2K to 4K, with a
rationale that most disks were now running with 4K sectors.
MFC after: 2 weeks
Relnotes: Yes
Sponsored by: The FreeBSD Foundation
landonf [Thu, 29 Mar 2018 19:44:15 +0000 (19:44 +0000)]
bhnd(4): include a subset of the ChipCommon capability flags in bhnd_chipid;
this provides early access to device capability flags required by bhnd(4)
bus and bhndb(4) bridge drivers.
davidcs [Thu, 29 Mar 2018 17:36:34 +0000 (17:36 +0000)]
1. Add additional debug prints.
2. Break transmit when IFF_DRV_RUNNING is OFF.
3. set desc_count=0 for default case in switch in ql_rcv_isr()
MFC after:5 days
markj [Thu, 29 Mar 2018 17:19:59 +0000 (17:19 +0000)]
Have TD_LOCKS_DEC() assert that td_locks is positive.
This makes it easier to catch lock accounting bugs, since the problem
is otherwise only detected upon a return to user mode (or never, for
kernel threads).
brooks [Thu, 29 Mar 2018 15:58:49 +0000 (15:58 +0000)]
GC never enabled support for SIOCGADDRROM and SIOCGCHIPID.
When de(4) was imported in 1997 the world was not ready for these ioctls.
In over 20 years that hasn't changed so it seems safe to assume their
time will never come.
markj [Thu, 29 Mar 2018 14:27:40 +0000 (14:27 +0000)]
Fix the background laundering mechanism after r329882.
Rather than using the number of inactive queue scans as a metric for
how many clean pages are being freed by the page daemon, have the
page daemon keep a running counter of the number of pages it has freed,
and have the laundry thread use that when computing the background
laundering threshold.
jeff [Thu, 29 Mar 2018 02:54:50 +0000 (02:54 +0000)]
Implement several enhancements to NUMA policies.
Add a new "interleave" allocation policy which stripes pages across
domains with a stride or width keeping contiguity within a multi-page
region.
Move the kernel to the dedicated numbered cpuset #2 making it possible
to assign kernel threads and memory policy separately from user. This
also eliminates the need for the complicated interrupt binding code.
Add a sysctl API for viewing and manipulating domainsets. Refactor some
of the cpuset_t manipulation code using the generic bitset type so that
it can be used for both. This probably belongs in a dedicated subr file.
kevans [Thu, 29 Mar 2018 00:55:11 +0000 (00:55 +0000)]
stand: Add workaround for HP BIOS issues
hrs@ and kuriyama@ have found that on some HP BIOS, a system will fail to
boot immediately after installation with the claim that it can't work out
which disk they are booting from.
They tracked it down to a buffer overrun, and found that it could be
alleviated by doing a dummy read before-hand.
jhb [Thu, 29 Mar 2018 00:12:50 +0000 (00:12 +0000)]
Reformat the enum of syscall argument types.
List enum values on separate lines to minimize diffs as new types are
added. Split the enum values up into groups and use some simple sorting
within groups (scalar enums are sorted by size, then base, all other
groups are generally sorted alphabetically).
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Matt Ahrens <Matt.Ahrens@delphix.com>
With compressed ARC (6950) we use up to 25% of our CPU to decompress indirect
blocks, under a workload of random cached reads. To reduce this decompression
cost, we would like to increase the size of the dbuf cache so that more
indirect blocks can be stored uncompressed.
If we are caching entire large files of recordsize=8K, the indirect blocks
use 1/64th as much memory as the data blocks (assuming they have the same
compression ratio). We suggest making the dbuf cache be 1/32nd of all memory,
so that in this scenario we should be able to keep all the indirect blocks
decompressed in the dbuf cache. (We want it to be more than the 1/64th that
the indirect blocks would use because we need to cache other stuff in the
dbuf cache as well.)
In real world workloads, this won't help as dramatically as the example
above, but we think it's still worth it because the risk of decreasing
performance is low. The potential negative performance impact is that we
will be slightly reducing the size of the ARC (by ~3%).
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: George Wilson <george.wilson@delphix.com>
arc_loan_compressed_buf() increments arc_loaned_bytes by psize unconditionally
In the case of zfs_compressed_arc_enabled=0, when the buf is returned via
arc_return_buf(), if ARC_BUF_COMPRESSED(buf) is false, then arc_loaned_bytes
is decremented by lsize, not psize.
Switch to using arc_buf_size(buf), instead of psize, which will return
psize or lsize, depending on the result of ARC_BUF_COMPRESSED(buf).
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Allan Jude <allanjude@freebsd.org>
We want to be able to pass various settings during import/open of a pool,
which are not only related to rewind. Instead of adding a new policy and
duplicate a bunch of code, we should just rename rewind_policy to a more
generic term like load_policy.
For instance, we'd like to set spa->spa_import_flags from the nvlist,
rather from a flags parameter passed to spa_import as in some cases we want
those flags not only for the import case, but also for the open case. One
such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb) which would
allow zfs to open a pool when logs are missing.
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
ztest failed with uncorrectable IO error despite having the fix for #7163.
Both sides of the mirror have CANT_OPEN_BAD_LABEL, which also distinguishes
it from that issue.
Definitely seems like a racing condition between the vdev_validate and spa_sync:
1. Thread A (spa_sync): vdev label is updated to latest txg
2. Thread B (vdev_validate): vdev label's txg is compared to spa_last_synced_txg and is ahead.
3. Thread A (spa_sync): spa_last_synced_txg is updated to latest txg.
Solution: do not check txg in vdev_validate unless config lock is held.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
The idea of Storage Pool Checkpoint (aka zpool checkpoint) deals with
exactly that. It can be thought of as a “pool-wide snapshot” (or a
variation of extreme rewind that doesn’t corrupt your data). It remembers
the entire state of the pool at the point that it was taken and the user
can revert back to it later or discard it. Its generic use case is an
administrator that is about to perform a set of destructive actions to ZFS
as part of a critical procedure. She takes a checkpoint of the pool before
performing the actions, then rewinds back to it if one of them fails or puts
the pool into an unexpected state. Otherwise, she discards it. With the
assumption that no one else is making modifications to ZFS, she basically
wraps all these actions into a “high-level transaction”.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andrew Stormont <astormont@racktopsystems.com>
We do not have libfakekernel, but need to reduce code divergence.
jeff [Wed, 28 Mar 2018 18:47:35 +0000 (18:47 +0000)]
Restore r331606 with a bugfix to setup cpuset_domain[] earlier on all
platforms. Original commit message as follows:
Only use CPUs in the domain the device is attached to for default
assignment. Device drivers are able to override the default assignment
if they bind directly. There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.
The idea of Storage Pool Checkpoint (aka zpool checkpoint) deals with
exactly that. It can be thought of as a “pool-wide snapshot” (or a
variation of extreme rewind that doesn’t corrupt your data). It remembers
the entire state of the pool at the point that it was taken and the user
can revert back to it later or discard it. Its generic use case is an
administrator that is about to perform a set of destructive actions to ZFS
as part of a critical procedure. She takes a checkpoint of the pool before
performing the actions, then rewinds back to it if one of them fails or puts
the pool into an unexpected state. Otherwise, she discards it. With the
assumption that no one else is making modifications to ZFS, she basically
wraps all these actions into a “high-level transaction”.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
hselasky [Wed, 28 Mar 2018 17:39:23 +0000 (17:39 +0000)]
Fix for regression issue in USB keyboard driver after r304735.
A series of zero delay callouts can happen causing high CPU usage of the
timer subsystem when trying to repeat keys, because the time of the
absolute timeout is not moving forward. The condition clears when all
keys are released.
ae [Wed, 28 Mar 2018 12:44:28 +0000 (12:44 +0000)]
Rework ipfw rules parsing and printing code.
Introduce show_state structure to keep information about printed opcodes.
Split show_static_rule() function into several smaller functions. Make
parsing and printing opcodes into several passes. Each printed opcode
is marked in show_state structure and will be skipped in next passes.
Now show_static_rule() function is simple, it just prints each part
of rule separately: action, modifiers, proto, src and dst addresses,
options. The main goal of this change is avoiding occurrence of wrong
result of `ifpw show` command, that can not be parsed by ipfw(8).
Also now it is possible to make some simple static optimizations
by reordering of opcodes in the rule.
avg [Wed, 28 Mar 2018 08:55:31 +0000 (08:55 +0000)]
ZFS vn_rele_async: catch up with the use of refcount(9) for the vnode use count
It's not sufficient nor required to use the vnode interlock when
checking if we are going to drop the last use count as the code in
vputx() uses refcount (atomic) operations for both checking and
decrementing the use code. Apply the same method to vn_rele_async().
While here, remove vn_rele_inactive(), a wrapper around vrele() that
didn't add any value.
Also, the change required making vfs_refcount_release_if_not_last()
public. I've made vfs_refcount_acquire_if_not_zero() public as well.
They are in sys/refcount.h now. While making the move I've dropped the
vfs_ prefix.
mjg [Wed, 28 Mar 2018 03:15:42 +0000 (03:15 +0000)]
seq: disable preemption around seq_write_*
This is a long standing performance bug which happened to not cause trouble
in practice due to rather limited use of these primitives.
The read side expects the writer to finish soon(tm) hence it loops with one
pause in-between. But it is possible the writer gets preempted in which case
the waiting can take a long time, especially so if it got preempted by the
reader. In principle this may never clean itself up.
In the current kernel seq is only used to obtain stable fp + capabilities
state. In order for looping at least once to occur there has to be a
concurrent writer modifying the fd slot for the very fd we are trying to
read. That is, for any looping to occur in the first place the program has
to be multithreaded and be doing something fishy to begin with. As such,
the indefinite looping is rather hard to run into unless you really try
(and I did not).
brooks [Tue, 27 Mar 2018 21:14:39 +0000 (21:14 +0000)]
Don't access userspace directly from the kernel in nxge(4).
Update to what the previous code seemed to be doing via the correct
interfaces. Further issues exist in xge_ioctl_registers(), but this is
debugging code in a driver that has few users and they don't appear to
be crashes or leaks.
brooks [Tue, 27 Mar 2018 21:06:18 +0000 (21:06 +0000)]
Copy flags over ifr_union directly rather than via casts through ifr_data.
No functional change in practice. If the sbni driver supported
64-bit big-endian system, this would be an ABI changes, but it is
i386-only. The old version leaked a word of stack on 64-bit systems.
brooks [Tue, 27 Mar 2018 21:03:29 +0000 (21:03 +0000)]
Copy flags over ifr_union directly rather than via casts through ifr_data.
No functional change in practice. If the sbni driver supported
64-bit big-endian system, this would be an ABI changes, but it is
i386-only. The old version leaked a word of stack on 64-bit systems.
jhb [Tue, 27 Mar 2018 20:54:57 +0000 (20:54 +0000)]
Use the offload transmit queue to set flags on TLS connections.
Requests to modify the state of TLS connections need to be sent on the
same queue as TLS record transmit requests to ensure ordering.
However, in order to use the offload transmit queue in t4_set_tcb_field(),
the function needs to be updated to do proper flow control / credit
management when queueing a request to an offload queue. This required
passing a pointer to the toepcb itself to this function, so while here
remove the 'tid' and 'iqid' parameters and obtain those values from the
toepcb in t4_set_tcb_field() itself.
brooks [Tue, 27 Mar 2018 20:51:49 +0000 (20:51 +0000)]
Improve copy-and-pasted versions of SIOCGIFADDR.
The original implementation used a reference to ifr_data and a cast to
do the equivalent of accessing ifr_addr. This was copied multiple
times since 1996.
brooks [Tue, 27 Mar 2018 18:26:50 +0000 (18:26 +0000)]
Fix access to ifru_buffer on freebsd32.
Make all kernel accesses to ifru_buffer go via access functions
which take the process ABI into account and use an appropriate union
to access members in the correct place in struct ifreq.
kib [Tue, 27 Mar 2018 18:05:51 +0000 (18:05 +0000)]
Fix several leaks of kernel stack data through paddings.
It is random collection of fixes for issues not yet corrected,
reported at https://tsyrklevi.ch/clang_analyzer/freebsd_013017/. Many
issues from that list were already corrected. Most of them are for
compat32, old compat32 or affect both primary host ABI and compat32.
The freebsd32_kldstat(), for instance, was already fixed by using
malloc(M_ZERO). Patch includes correction to report the supplied
version back, which is just pedantic.
Reviewed by: brooks, emaste (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14868
cem [Tue, 27 Mar 2018 17:58:00 +0000 (17:58 +0000)]
opencrypto: Add mechanism to pass multiple crypto blocks to some ciphers
xforms that support processing of multiple blocks at a time (to support more
efficient modes, for example) can define the encrypt_ and decrypt_multi
interfaces. If these interfaces are not present, the generic cryptosoft
code falls back on the block-at-a-time encrypt/decrypt interfaces.
Stream ciphers may support arbitrarily sized inputs (equivalent to an input
block size of 1 byte) but may be more efficient if a larger block is passed.
eugen [Tue, 27 Mar 2018 17:37:08 +0000 (17:37 +0000)]
Fix instructions in the zfsboot manual page.
zfsloader(8) fails to probe a slice containing ZFS pool if its second sector
contains traces of BSD label (DISKMAGIC == 0x82564557).
Fix manual page to show working example erasing such traces.
PR: 226714
Approved by: avg (mentor)
MFC after: 3 days
kib [Tue, 27 Mar 2018 15:29:32 +0000 (15:29 +0000)]
Allow to specify PCP on packets not belonging to any VLAN.
According to 802.1Q-2014, VLAN tagged packets with VLAN id 0 should be
considered as untagged, and only PCP and DEI values from the VLAN tag
are meaningful. See for instance
https://www.cisco.com/c/en/us/td/docs/switches/connectedgrid/cg-switch-sw-master/software/configuration/guide/vlan0/b_vlan_0.html.
Make it possible to specify PCP value for outgoing packets on an
ethernet interface. When PCP is supplied, the tag is appended, VLAN
id set to 0, and PCP is filled by the supplied value. The code to do
VLAN tag encapsulation is refactored from the if_vlan.c and moved into
if_ethersubr.c.
Drivers might have issues with filtering VID 0 packets on
receive. This bug should be fixed for each driver.
trasz [Tue, 27 Mar 2018 14:54:02 +0000 (14:54 +0000)]
Add trailing slash for consistency.
For some reason, the other link - https://lists.FreeBSD.org/ - needs
the trailing slash, otherwise man(8) renders it in a weird way. No
idea why's that. At least try to be consistent. Revert it when the
other link gets fixed.
avg [Tue, 27 Mar 2018 14:31:42 +0000 (14:31 +0000)]
vfs_donmount: in certain cases try r/o mount if r/w mount fails
If the operation is not an update, if neither r/w nor r/o mode is
explicitly requested, if the error code hints at the possibility of the
media being read-only, and if the fallback is allowed, then we can try
to automatically downgrade to the readonly mode.
This is especially useful for auto-mounting of removable media that
sometimes can happen to be write-protected.
The fallback to r/o is not enabled by default. It can be requested on a
per-mount basis with a new mount option, 'autoro'. Or it can be
globally allowed by setting vfs.default_autoro.
jeff [Tue, 27 Mar 2018 03:37:04 +0000 (03:37 +0000)]
Only use CPUs in the domain the device is attached to for default
assignment. Device drivers are able to override the default assignment
if they bind directly. There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.
jeff [Tue, 27 Mar 2018 03:27:02 +0000 (03:27 +0000)]
Move vm_ndomains to vm.h where it can be used with a single header include
rather than requiring a half-dozen. Many non-vm files may want to know
the number of valid domains.
cem [Mon, 26 Mar 2018 23:54:59 +0000 (23:54 +0000)]
Update to Zstandard 1.3.4
Includes our local patch to conditionalize use of __builtin_clz(ll) on
Clang's __has_builtin() (which is just defined to false when building with
GCC).
The issue is tracked upstream at https://github.com/facebook/zstd/pull/884 .
Otherwise, these are vanilla Zstandard 1.3.4 files.
cem [Mon, 26 Mar 2018 20:30:07 +0000 (20:30 +0000)]
cryptodev: Match intent for enc_xform ciphers with blocksize != ivsize
No functional change for Skipjack, AES-ICM, Blowfish, CAST-128, Camellia,
DES3, Rijndael128, DES. All of these have identical IV and blocksizes
declared in the associated enc_xform.
Functional changes for:
* AES-GCM: block len of 1, IV len of 12
* AES-XTS: block len of 16, IV len of 8
* NULL: block len of 4, IV len of 0
For these, it seems like the IV specified in the enc_xform is correct (and
the blocksize used before was wrong).
Additionally, the not-yet-OCFed cipher Chacha20 has a logical block length
of 1 byte, and a 16 byte IV + nonce.
Rationalize references to IV lengths to refer to the declared ivsize, rather
than declared blocksize.
sbruno [Mon, 26 Mar 2018 19:53:36 +0000 (19:53 +0000)]
CC Cubic: fix underflow for cubic_cwnd()
Singed calculations in cubic_cwnd() can result in negative cwnd
value which is then cast to an unsigned value. Values less than
1 mss are generally bad for other parts of the code, also fixed.
cem [Mon, 26 Mar 2018 19:53:02 +0000 (19:53 +0000)]
vmci(4): Fix GCC build and rationalize vmci_kernel_defs.h
To fix the GCC build, remove multiple redundant declarations of
vmci_send_datagram() (the copy in vmci.h as well as the extern definition in
vmci_queue_pair.c were wholly redundant).
Also to fix the GCC build, include a non-empty format string in the vmci(4)
definition of ASSERT(). It seems harmless either way, but adding the
stringified invariant is easier than masking the warning.
The other vmci_kernel_defs.h changes are cosmetic and simply match macros to
existing definitions.
kevans [Mon, 26 Mar 2018 19:06:25 +0000 (19:06 +0000)]
lualoader: Actually re-raise error in try_include
It was previously only printed, but we do actually want to raise it as a
full blown error so that things don't look OK when they've actually gone
wrong.
The second parameter to error, level, is set to 2 here so that the error
message reflects the position of the try_include caller, rather than the
try_include itself. Example:
LUA ERROR: /boot/lua/loader.lua:46: /boot/lua/local.lua:1: attempt to call a
nil value (global 'cxcint').