Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Matt Ahrens <Matt.Ahrens@delphix.com>
With compressed ARC (6950) we use up to 25% of our CPU to decompress indirect
blocks, under a workload of random cached reads. To reduce this decompression
cost, we would like to increase the size of the dbuf cache so that more
indirect blocks can be stored uncompressed.
If we are caching entire large files of recordsize=8K, the indirect blocks
use 1/64th as much memory as the data blocks (assuming they have the same
compression ratio). We suggest making the dbuf cache be 1/32nd of all memory,
so that in this scenario we should be able to keep all the indirect blocks
decompressed in the dbuf cache. (We want it to be more than the 1/64th that
the indirect blocks would use because we need to cache other stuff in the
dbuf cache as well.)
In real world workloads, this won't help as dramatically as the example
above, but we think it's still worth it because the risk of decreasing
performance is low. The potential negative performance impact is that we
will be slightly reducing the size of the ARC (by ~3%).
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: George Wilson <george.wilson@delphix.com>
arc_loan_compressed_buf() increments arc_loaned_bytes by psize unconditionally
In the case of zfs_compressed_arc_enabled=0, when the buf is returned via
arc_return_buf(), if ARC_BUF_COMPRESSED(buf) is false, then arc_loaned_bytes
is decremented by lsize, not psize.
Switch to using arc_buf_size(buf), instead of psize, which will return
psize or lsize, depending on the result of ARC_BUF_COMPRESSED(buf).
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Allan Jude <allanjude@freebsd.org>
We want to be able to pass various settings during import/open of a pool,
which are not only related to rewind. Instead of adding a new policy and
duplicate a bunch of code, we should just rename rewind_policy to a more
generic term like load_policy.
For instance, we'd like to set spa->spa_import_flags from the nvlist,
rather from a flags parameter passed to spa_import as in some cases we want
those flags not only for the import case, but also for the open case. One
such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb) which would
allow zfs to open a pool when logs are missing.
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
ztest failed with uncorrectable IO error despite having the fix for #7163.
Both sides of the mirror have CANT_OPEN_BAD_LABEL, which also distinguishes
it from that issue.
Definitely seems like a racing condition between the vdev_validate and spa_sync:
1. Thread A (spa_sync): vdev label is updated to latest txg
2. Thread B (vdev_validate): vdev label's txg is compared to spa_last_synced_txg and is ahead.
3. Thread A (spa_sync): spa_last_synced_txg is updated to latest txg.
Solution: do not check txg in vdev_validate unless config lock is held.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
The idea of Storage Pool Checkpoint (aka zpool checkpoint) deals with
exactly that. It can be thought of as a “pool-wide snapshot” (or a
variation of extreme rewind that doesn’t corrupt your data). It remembers
the entire state of the pool at the point that it was taken and the user
can revert back to it later or discard it. Its generic use case is an
administrator that is about to perform a set of destructive actions to ZFS
as part of a critical procedure. She takes a checkpoint of the pool before
performing the actions, then rewinds back to it if one of them fails or puts
the pool into an unexpected state. Otherwise, she discards it. With the
assumption that no one else is making modifications to ZFS, she basically
wraps all these actions into a “high-level transaction”.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andrew Stormont <astormont@racktopsystems.com>
We do not have libfakekernel, but need to reduce code divergence.
jeff [Wed, 28 Mar 2018 18:47:35 +0000 (18:47 +0000)]
Restore r331606 with a bugfix to setup cpuset_domain[] earlier on all
platforms. Original commit message as follows:
Only use CPUs in the domain the device is attached to for default
assignment. Device drivers are able to override the default assignment
if they bind directly. There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.
The idea of Storage Pool Checkpoint (aka zpool checkpoint) deals with
exactly that. It can be thought of as a “pool-wide snapshot” (or a
variation of extreme rewind that doesn’t corrupt your data). It remembers
the entire state of the pool at the point that it was taken and the user
can revert back to it later or discard it. Its generic use case is an
administrator that is about to perform a set of destructive actions to ZFS
as part of a critical procedure. She takes a checkpoint of the pool before
performing the actions, then rewinds back to it if one of them fails or puts
the pool into an unexpected state. Otherwise, she discards it. With the
assumption that no one else is making modifications to ZFS, she basically
wraps all these actions into a “high-level transaction”.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
hselasky [Wed, 28 Mar 2018 17:39:23 +0000 (17:39 +0000)]
Fix for regression issue in USB keyboard driver after r304735.
A series of zero delay callouts can happen causing high CPU usage of the
timer subsystem when trying to repeat keys, because the time of the
absolute timeout is not moving forward. The condition clears when all
keys are released.
ae [Wed, 28 Mar 2018 12:44:28 +0000 (12:44 +0000)]
Rework ipfw rules parsing and printing code.
Introduce show_state structure to keep information about printed opcodes.
Split show_static_rule() function into several smaller functions. Make
parsing and printing opcodes into several passes. Each printed opcode
is marked in show_state structure and will be skipped in next passes.
Now show_static_rule() function is simple, it just prints each part
of rule separately: action, modifiers, proto, src and dst addresses,
options. The main goal of this change is avoiding occurrence of wrong
result of `ifpw show` command, that can not be parsed by ipfw(8).
Also now it is possible to make some simple static optimizations
by reordering of opcodes in the rule.
avg [Wed, 28 Mar 2018 08:55:31 +0000 (08:55 +0000)]
ZFS vn_rele_async: catch up with the use of refcount(9) for the vnode use count
It's not sufficient nor required to use the vnode interlock when
checking if we are going to drop the last use count as the code in
vputx() uses refcount (atomic) operations for both checking and
decrementing the use code. Apply the same method to vn_rele_async().
While here, remove vn_rele_inactive(), a wrapper around vrele() that
didn't add any value.
Also, the change required making vfs_refcount_release_if_not_last()
public. I've made vfs_refcount_acquire_if_not_zero() public as well.
They are in sys/refcount.h now. While making the move I've dropped the
vfs_ prefix.
mjg [Wed, 28 Mar 2018 03:15:42 +0000 (03:15 +0000)]
seq: disable preemption around seq_write_*
This is a long standing performance bug which happened to not cause trouble
in practice due to rather limited use of these primitives.
The read side expects the writer to finish soon(tm) hence it loops with one
pause in-between. But it is possible the writer gets preempted in which case
the waiting can take a long time, especially so if it got preempted by the
reader. In principle this may never clean itself up.
In the current kernel seq is only used to obtain stable fp + capabilities
state. In order for looping at least once to occur there has to be a
concurrent writer modifying the fd slot for the very fd we are trying to
read. That is, for any looping to occur in the first place the program has
to be multithreaded and be doing something fishy to begin with. As such,
the indefinite looping is rather hard to run into unless you really try
(and I did not).
brooks [Tue, 27 Mar 2018 21:14:39 +0000 (21:14 +0000)]
Don't access userspace directly from the kernel in nxge(4).
Update to what the previous code seemed to be doing via the correct
interfaces. Further issues exist in xge_ioctl_registers(), but this is
debugging code in a driver that has few users and they don't appear to
be crashes or leaks.
brooks [Tue, 27 Mar 2018 21:06:18 +0000 (21:06 +0000)]
Copy flags over ifr_union directly rather than via casts through ifr_data.
No functional change in practice. If the sbni driver supported
64-bit big-endian system, this would be an ABI changes, but it is
i386-only. The old version leaked a word of stack on 64-bit systems.
brooks [Tue, 27 Mar 2018 21:03:29 +0000 (21:03 +0000)]
Copy flags over ifr_union directly rather than via casts through ifr_data.
No functional change in practice. If the sbni driver supported
64-bit big-endian system, this would be an ABI changes, but it is
i386-only. The old version leaked a word of stack on 64-bit systems.
jhb [Tue, 27 Mar 2018 20:54:57 +0000 (20:54 +0000)]
Use the offload transmit queue to set flags on TLS connections.
Requests to modify the state of TLS connections need to be sent on the
same queue as TLS record transmit requests to ensure ordering.
However, in order to use the offload transmit queue in t4_set_tcb_field(),
the function needs to be updated to do proper flow control / credit
management when queueing a request to an offload queue. This required
passing a pointer to the toepcb itself to this function, so while here
remove the 'tid' and 'iqid' parameters and obtain those values from the
toepcb in t4_set_tcb_field() itself.
brooks [Tue, 27 Mar 2018 20:51:49 +0000 (20:51 +0000)]
Improve copy-and-pasted versions of SIOCGIFADDR.
The original implementation used a reference to ifr_data and a cast to
do the equivalent of accessing ifr_addr. This was copied multiple
times since 1996.
brooks [Tue, 27 Mar 2018 18:26:50 +0000 (18:26 +0000)]
Fix access to ifru_buffer on freebsd32.
Make all kernel accesses to ifru_buffer go via access functions
which take the process ABI into account and use an appropriate union
to access members in the correct place in struct ifreq.
kib [Tue, 27 Mar 2018 18:05:51 +0000 (18:05 +0000)]
Fix several leaks of kernel stack data through paddings.
It is random collection of fixes for issues not yet corrected,
reported at https://tsyrklevi.ch/clang_analyzer/freebsd_013017/. Many
issues from that list were already corrected. Most of them are for
compat32, old compat32 or affect both primary host ABI and compat32.
The freebsd32_kldstat(), for instance, was already fixed by using
malloc(M_ZERO). Patch includes correction to report the supplied
version back, which is just pedantic.
Reviewed by: brooks, emaste (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14868
cem [Tue, 27 Mar 2018 17:58:00 +0000 (17:58 +0000)]
opencrypto: Add mechanism to pass multiple crypto blocks to some ciphers
xforms that support processing of multiple blocks at a time (to support more
efficient modes, for example) can define the encrypt_ and decrypt_multi
interfaces. If these interfaces are not present, the generic cryptosoft
code falls back on the block-at-a-time encrypt/decrypt interfaces.
Stream ciphers may support arbitrarily sized inputs (equivalent to an input
block size of 1 byte) but may be more efficient if a larger block is passed.
eugen [Tue, 27 Mar 2018 17:37:08 +0000 (17:37 +0000)]
Fix instructions in the zfsboot manual page.
zfsloader(8) fails to probe a slice containing ZFS pool if its second sector
contains traces of BSD label (DISKMAGIC == 0x82564557).
Fix manual page to show working example erasing such traces.
PR: 226714
Approved by: avg (mentor)
MFC after: 3 days
kib [Tue, 27 Mar 2018 15:29:32 +0000 (15:29 +0000)]
Allow to specify PCP on packets not belonging to any VLAN.
According to 802.1Q-2014, VLAN tagged packets with VLAN id 0 should be
considered as untagged, and only PCP and DEI values from the VLAN tag
are meaningful. See for instance
https://www.cisco.com/c/en/us/td/docs/switches/connectedgrid/cg-switch-sw-master/software/configuration/guide/vlan0/b_vlan_0.html.
Make it possible to specify PCP value for outgoing packets on an
ethernet interface. When PCP is supplied, the tag is appended, VLAN
id set to 0, and PCP is filled by the supplied value. The code to do
VLAN tag encapsulation is refactored from the if_vlan.c and moved into
if_ethersubr.c.
Drivers might have issues with filtering VID 0 packets on
receive. This bug should be fixed for each driver.
trasz [Tue, 27 Mar 2018 14:54:02 +0000 (14:54 +0000)]
Add trailing slash for consistency.
For some reason, the other link - https://lists.FreeBSD.org/ - needs
the trailing slash, otherwise man(8) renders it in a weird way. No
idea why's that. At least try to be consistent. Revert it when the
other link gets fixed.
avg [Tue, 27 Mar 2018 14:31:42 +0000 (14:31 +0000)]
vfs_donmount: in certain cases try r/o mount if r/w mount fails
If the operation is not an update, if neither r/w nor r/o mode is
explicitly requested, if the error code hints at the possibility of the
media being read-only, and if the fallback is allowed, then we can try
to automatically downgrade to the readonly mode.
This is especially useful for auto-mounting of removable media that
sometimes can happen to be write-protected.
The fallback to r/o is not enabled by default. It can be requested on a
per-mount basis with a new mount option, 'autoro'. Or it can be
globally allowed by setting vfs.default_autoro.
jeff [Tue, 27 Mar 2018 03:37:04 +0000 (03:37 +0000)]
Only use CPUs in the domain the device is attached to for default
assignment. Device drivers are able to override the default assignment
if they bind directly. There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.
jeff [Tue, 27 Mar 2018 03:27:02 +0000 (03:27 +0000)]
Move vm_ndomains to vm.h where it can be used with a single header include
rather than requiring a half-dozen. Many non-vm files may want to know
the number of valid domains.
cem [Mon, 26 Mar 2018 23:54:59 +0000 (23:54 +0000)]
Update to Zstandard 1.3.4
Includes our local patch to conditionalize use of __builtin_clz(ll) on
Clang's __has_builtin() (which is just defined to false when building with
GCC).
The issue is tracked upstream at https://github.com/facebook/zstd/pull/884 .
Otherwise, these are vanilla Zstandard 1.3.4 files.
cem [Mon, 26 Mar 2018 20:30:07 +0000 (20:30 +0000)]
cryptodev: Match intent for enc_xform ciphers with blocksize != ivsize
No functional change for Skipjack, AES-ICM, Blowfish, CAST-128, Camellia,
DES3, Rijndael128, DES. All of these have identical IV and blocksizes
declared in the associated enc_xform.
Functional changes for:
* AES-GCM: block len of 1, IV len of 12
* AES-XTS: block len of 16, IV len of 8
* NULL: block len of 4, IV len of 0
For these, it seems like the IV specified in the enc_xform is correct (and
the blocksize used before was wrong).
Additionally, the not-yet-OCFed cipher Chacha20 has a logical block length
of 1 byte, and a 16 byte IV + nonce.
Rationalize references to IV lengths to refer to the declared ivsize, rather
than declared blocksize.
sbruno [Mon, 26 Mar 2018 19:53:36 +0000 (19:53 +0000)]
CC Cubic: fix underflow for cubic_cwnd()
Singed calculations in cubic_cwnd() can result in negative cwnd
value which is then cast to an unsigned value. Values less than
1 mss are generally bad for other parts of the code, also fixed.
cem [Mon, 26 Mar 2018 19:53:02 +0000 (19:53 +0000)]
vmci(4): Fix GCC build and rationalize vmci_kernel_defs.h
To fix the GCC build, remove multiple redundant declarations of
vmci_send_datagram() (the copy in vmci.h as well as the extern definition in
vmci_queue_pair.c were wholly redundant).
Also to fix the GCC build, include a non-empty format string in the vmci(4)
definition of ASSERT(). It seems harmless either way, but adding the
stringified invariant is easier than masking the warning.
The other vmci_kernel_defs.h changes are cosmetic and simply match macros to
existing definitions.
kevans [Mon, 26 Mar 2018 19:06:25 +0000 (19:06 +0000)]
lualoader: Actually re-raise error in try_include
It was previously only printed, but we do actually want to raise it as a
full blown error so that things don't look OK when they've actually gone
wrong.
The second parameter to error, level, is set to 2 here so that the error
message reflects the position of the try_include caller, rather than the
try_include itself. Example:
LUA ERROR: /boot/lua/loader.lua:46: /boot/lua/local.lua:1: attempt to call a
nil value (global 'cxcint').
kevans [Mon, 26 Mar 2018 19:01:22 +0000 (19:01 +0000)]
lualoader: Implement try_include and use it for including the local module
This provides a way to optionally include a module without having to wrap it
in filesystem checks. try_include is a little more robust, using the lua
search path instead of forcing us to explicitly consider all of the places
we could want to include a module. Errors are still generally raised from
trying to load the module, but ENOENT will not get raised unless we're doing
a verbose load.
This will also be used to split out logo/brand graphics into their own files
so that we can safely scale up the number of graphics included without
worrying about the extra memory consumption- opting to lazily load graphics
instead.
Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D14658
manu [Mon, 26 Mar 2018 18:39:38 +0000 (18:39 +0000)]
release: arm: Copy boot.scr from ports
Latest u-boot update need u-boot script to load and start ubldr.
(See D14230 for more details)
Copy this file for our arm release on the fat partition.
kib [Mon, 26 Mar 2018 16:31:12 +0000 (16:31 +0000)]
Allow to specify for vm_fault_quick_hold_pages() that nofault mode
should be honored.
We must not sleep or acquire any MI VM locks if TDP_NOFAULTING is
specified. On the other hand, there were some callers in the tree
which set TDP_NOFAULTING for larger scope than needed, I fixed the
code which I wrote, but I suspect that linuxkpi and out of tree drm
drivers might abuse this still.
So only enable the mode for vm_fault_quick_hold_pages() where
vm_fault_hold() is not called when specifically asked by user. I
decided to use vm_prot_t flag to not change KPI. Since number of
flags in vm_prot_t is limited, I reused the same flag which was
already consumed for vm_map_lookup().
Reported and tested by: pho (as part of the larger patch)
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14825
kevans [Mon, 26 Mar 2018 13:45:17 +0000 (13:45 +0000)]
loader efifb: implement uga_autoresize as a call to text_autoresize
UGA does not have the same kind of mode enumeration that GOP does. Implement
it instead as a call to text_autoresize so that firmwares with only UGA
present still get some kind of autoresizing behavior.
While here, rename a typo'd "gop" to "uga", although it will remain unused
for the time being.
markj [Sun, 25 Mar 2018 23:23:19 +0000 (23:23 +0000)]
Clamp IFLIB_RX_COPY_THRESH to MHLEN in iflib_rxd_pkt_get().
If one has added fields to struct mbuf such that MHLEN is smaller than
this threshold (128), iflib_rxd_pkt_get() may otherwise overrun the
internal mbuf buffer while copying.
imp [Sun, 25 Mar 2018 16:56:49 +0000 (16:56 +0000)]
The PNP info has to follow the module definition. Move it from just
after the array to its proper location. Otherwise, the linker.hints
file has things out of order and we associated it with whatever was
the previous module.
mp [Sun, 25 Mar 2018 00:57:00 +0000 (00:57 +0000)]
Add VMCI (Virtual Machine Communication Interface) driver
In a virtual machine, VMCI is exposed as a regular PCI device. The primary
communication mechanisms supported are a point-to-point bidirectional
transport based on a pair of memory-mapped queues, and asynchronous
notifications in the form of datagrams and doorbells. These features are
available to kernel level components such as vSockets through the VMCI
kernel API. In addition to this, the VMCI kernel API provides support for
receiving events related to the state of the VMCI communication channels,
and the virtual machine itself.
manu [Sat, 24 Mar 2018 21:30:24 +0000 (21:30 +0000)]
Add dtb overlays support
DTB Overlays are useful to change/add nodes to a dtb without the need to
modify it.
Add support for building dtbo during buildkernel.
The goal of DTBO present in the FreeBSD source tree is to fill a gap in
time when we submit changes upstream (Linux). Instead of waiting 2 to 4 months
we can add a DTBO in tree in the meantime.
This is not for adding DTBO for capes/hat/addon boards, those will be
better to put in a ports.
This is also not for enabling a i2c/spi/pwm controller on certain pins,
each user have a different use case for those (which pins to use etc ...)
and we cannot have all possible configuration.
Add a dtbo for sun8i-h3-sid which add the SID node missing in upstream dts.
jtl [Sat, 24 Mar 2018 13:18:09 +0000 (13:18 +0000)]
This change adds a flag to the DAD entry to indicate whether it is
currently on the queue. This prevents accidentally doubly-removing a DAD
entry from the queue, while also simplifying some of the logic in
nd6_dad_stop().
kib [Sat, 24 Mar 2018 12:57:58 +0000 (12:57 +0000)]
Improve the lcall $7,$0 syscall emulation on amd64.
Current code, which copies the potential syscall arguments into the
current frame, puts an arbitrary limit on the number of syscall
arguments. Apparently, mmap(2) and lseek(2) (?) require larger
number. But there is an issue that stack is only need to be mapped to
contain the number of arguments required by the syscall, so copying
arbitrary large number of words from the stack is not completely safe.
Use different approach to convert lcall frame into int $0x80 frame in
place, by doing the retl in kernel. This also allows to stop proceed
vfork case specially, and stop making assumptions about %cs at the
syscall time.
Also, improve comments with the formulations provided by bde.
Reviewed and tested by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week