Mark Johnston [Mon, 19 Oct 2020 16:57:40 +0000 (16:57 +0000)]
uma: Respect uk_reserve in keg_drain()
When a reserve of free items is configured for a zone, the reserve must
not be reclaimed under memory pressure. Modify keg_drain() to simply
respect the reserved pool.
While here remove an always-false uk_freef == NULL check (kegs that
shouldn't be drained should set _NOFREE instead), and make sure that the
keg_drain() KTR statement does not reference an uninitialized variable.
Reviewed by: alc, rlibby
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26772
Mark Johnston [Mon, 19 Oct 2020 16:55:03 +0000 (16:55 +0000)]
uma: Avoid depleting keg reserves when filling a bucket
zone_import() fetches a free or partially free slab from the keg and
then uses its items to populate an array, typically filling a bucket.
If a single allocation causes the keg to drop below its minimum reserve,
the inner loop ends. However, if the bucket is still not full and
M_USE_RESERVE is specified, the outer loop will continue to fetch items
from the keg.
If M_USE_RESERVE is specified and the number of free items is below the
reserved limit, we should return only a single item. Otherwise, if the
bucket size is larger than the reserve, all of the reserved items may
end up in a single per-CPU bucket, invisible to other CPUs.
Reviewed by: rlibby
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26771
Mark Johnston [Mon, 19 Oct 2020 16:54:06 +0000 (16:54 +0000)]
vmem: Allocate btags before looping in vmem_xalloc()
BT_MAXALLOC (4) is the number of boundary tags required to complete an
allocation in the worst case: two to clip a free segment, and two to
import from a parent arena. vmem_xalloc() preallocates four boundary
tags before attempting a search to simplify the segment allocation code.
It implements a loop that:
1) ensures that BT_MAXALLOC boundary tags are available,
2) attempts to find and clip a free segment satisfying the allocation
constraints, and failing that,
3) attempts to import a segment.
On !UMA_MD_SMALL_ALLOC platforms the btag zone has to handle recusion:
it needs boundary tags to allocate boundary tags. Thus we reserve
2 * BT_MAXALLOC * mp_ncpus tags for use when recursing: the factor of 2
is because there are two layers of vmem arenas, the per-domain arena and
global arena. For a single thread, 2 * BT_MAXALLOC tags should be
sufficient.
Because of the way the loop is structured, BT_MAXALLOC tags are not
sufficient. The first bt_fill() call may allocate BT_MAXALLOC tags,
then import a segment (consuming two tags), then attempt to top up the
preallocation before carving into the imported free segment, thus
requiring up to six tags in the worst case. Because we don't
preallocate that many, this bug can cause deadlocks in rare scenarios.
Fix the problem by moving the preallocation out the loop. This assumes
that only a single import is ever required to satisfy an allocation
request.
Thanks to manu, emaste and lwhsu for helping test debug patches.
Reported by: Jenkins (hardware CI lab)
Reviewed by: alc, kib, rlibby
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26770
Mark Johnston [Mon, 19 Oct 2020 15:24:35 +0000 (15:24 +0000)]
vmx: Implement pmap (de)activation in C
Rewrite the code that maintains pm_active and invalidates EPTP-tagged
TLB entries in C. Previously this work was done in vmx_enter_guest(),
in assembly, but there is no good reason for that and it makes the TLB
invalidation algorithm for nested page tables harder to review.
No functional change intended. Now, an error from the invept
instruction results in a kernel panic rather than a vmexit. Such errors
should occur only as a result of VMM bugs.
Reviewed by: grehan, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26830
Andrew Turner [Mon, 19 Oct 2020 12:06:16 +0000 (12:06 +0000)]
Move the arm64 userspace access checks to macros
In the functions that copy between userspace and kernel space we check the
user space address is valid before performing the copy. These are mostly
identical within each type of function so create two macros to perform the
check.
Kyle Evans [Sun, 18 Oct 2020 23:42:00 +0000 (23:42 +0000)]
libbe(3): document be_snapshot()
While toying around with lua bindings for libbe(3), I discovered that I
apparently never documented this, despite having documented
be_is_auto_snapshot_name that references it.
Bjoern A. Zeeb [Sun, 18 Oct 2020 21:34:04 +0000 (21:34 +0000)]
net80211: factor out the priv(9) checks into OS specifc code.
Factor out the priv(9) checks into OS specifc code so other OSes can equally
implement them. This sorts out those XXX in the net80211 code.
We provide 3 arguments (cmd, vap, ifp) where available to the functions, in
order to allow other OSes to use that data but also in case we'd add auditing
to these check to have the information available. For now the arguments are
marked __unused.
Alex Richardson [Sun, 18 Oct 2020 18:35:23 +0000 (18:35 +0000)]
Significantly speed up mkimg_test
It turns out that the majority of the test time for the mkimg tests isn't
mkimg itself but rather the use of jot and hexdump which can be quite slow
on emulated platforms such as QEMU.
On QEMU-RISC-V this reduces the time for `kyua test mkimg_test` from 655
seconds to 200. And for CheriBSD on QEMU-CHERI this saves 4-5 hours (25%
of the time for the entire testsuite!) since jot ends up triggering slow
functions inside the QEMU emulation a lot.
Implement flowid calculation for outbound connections to balance
connections over multiple paths.
Multipath routing relies on mbuf flowid data for both transit
and outbound traffic. Current code fills mbuf flowid from inp_flowid
for connection-oriented sockets. However, inp_flowid is currently
not calculated for outbound connections.
This change creates simple hashing functions and starts calculating hashes
for TCP,UDP/UDP-Lite and raw IP if multipath routes are present in the
system.
If the SIM freezes the queue at exactly the wrong moment, after
another thread has started to send in a CCB and already checked
the queue wasn't frozen, we would end up with iscsi_action()
being called despite the queue is now frozen.
Add a check to make sure this doesn't happen . Perhaps this should
be fixed at the CAM level instead, but given how the send queue and
SIM are governed by two separate mutexes, it is somewhat hard to do.
Reviewed by: imp, mav
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26750
Bjoern A. Zeeb [Sun, 18 Oct 2020 00:27:20 +0000 (00:27 +0000)]
net80211: update for (more) VHT160 support
Implement two macros IEEE80211_VHTCAP_SUPP_CHAN_WIDTH_IS_160MHZ()
and its 80+80 counter part to check in vhtcaps for appropriate
levels of support and use the macros throughout the code.
Add vht160_chan_ranges/is_vht160_valid_freq and handle analogue
to vht80 in various parts of the code.
Add ieee80211_add_channel_cbw() which also takes the CBW flag
fields and make the former ieee80211_add_channel() a wrapper to it.
With the CBW flags we can add HT/VHT channels passing them to
getflags() for the 2/5ghz functions.
In ifconfig(8) add the regdomain_addchans() support for VHT160
and VHT80P80.
With this (+ regdoain.xml updates) VHT160 channels can be
configured, listed, and pass regdomain where appropriate.
Tested with: iwlwifi
Reviewed by: adrian
MFC after: 10 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26712
Bjoern A. Zeeb [Sat, 17 Oct 2020 22:47:08 +0000 (22:47 +0000)]
ddb: add show sysinit command
Add a show sysinit command to ddb (similar to show vnet_sysinit) which
proved to be helpful to debug some ordering issues on early-mid kernel
start panics.
Mateusz Guzik [Sat, 17 Oct 2020 21:22:40 +0000 (21:22 +0000)]
cache: don't automatically evict negative entries if usage is low
The previous scheme only looked at negative entry count in relation to the
total count, leading to tons of spurious evictions if the cache is not
significantly populated.
Instead, only try the above if negative entry count goes beyond namecache
capacity.
Mitchell Horne [Sat, 17 Oct 2020 17:31:06 +0000 (17:31 +0000)]
riscv: zero reserved PTE bits for L2 PTEs
As was done for L3 PTEs in r362853, mask out the reserved bits when
extracting the physical address from an L2 PTE. Future versions of the
spec or custom implementations may make use of these reserved bits, in
which case the resulting physical address could be incorrect.
Mateusz Guzik [Sat, 17 Oct 2020 08:48:58 +0000 (08:48 +0000)]
cache: rework parts of negative entry management
- declutter sysctl vfs.cache by moving relevant entries into
vfs.cache.neg
- add a little more parallelism to eviction by replacing the
global lock with an atomically modified counter
- track more statistics
Matt Macy [Sat, 17 Oct 2020 01:06:04 +0000 (01:06 +0000)]
Update OpenZFS to 2.0.0-rc3-gfc5966
- fix panic due to tqid overflow
- Improve libzfs_error_init messages
- Expose zfetch_max_idistance tunable
- Make dbufstat work on FreeBSD
- Fix EIO after resuming receive of new dataset over an existing one
Ryan Moeller [Fri, 16 Oct 2020 20:27:20 +0000 (20:27 +0000)]
bhyve: Update TX descriptor base address and host mapping on change
bhyve sometimes segfaults when using an e1000 NIC with a Windows guest.
We are only updating our tdba and cached host mapping when the low address
register is written and when tx is set enabled, but not when the high address
or length registers are written. It is observed that Windows 10 is occasionally
enabling tx first then writing the registers in the order low, high, len. This
leaves us with a bogus base address and mapping, which causes a segfault later
when we try to copy from a descriptor that has unpredictable garbage in a
pointer.
Updating the address and mapping when any of those registers change seems to fix
that particular issue.
Kyle Evans [Fri, 16 Oct 2020 17:03:27 +0000 (17:03 +0000)]
MFC r366760: lua: update to 5.3.6
This release contains some minor bugfixes; notably:
- 2x minor Makefile fixes (not used in base)
- Long brackets with a huge number of '=' overflow some internal buffer
arithmetic.
- Joining an upvalue with itself can cause a use-after-free crash.
See here for examples: http://www.lua.org/bugs.html#5.3.5
Mitchell Horne [Fri, 16 Oct 2020 13:37:58 +0000 (13:37 +0000)]
arm64: export a few more HWCAPs
These were missed in the previous pass. The extensions (partially)
supported by this change are:
- ARMv8.2-FHM, Floating-point multiplication variant
- ARMv8.4-LSE, Large System Extensions
- ARMv8.4-DIT, Data Independent Timing instructions
Reviewed by: andrew, markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26707
Kyle Evans [Fri, 16 Oct 2020 13:04:28 +0000 (13:04 +0000)]
lua: update to 5.3.6
This release contains some minor bugfixes; notably:
- 2x minor Makefile fixes (not used in base)
- Long brackets with a huge number of '=' overflow some internal buffer
arithmetic.
- Joining an upvalue with itself can cause a use-after-free crash.
See here for examples: http://www.lua.org/bugs.html#5.3.5
Marcin Wojtas [Fri, 16 Oct 2020 11:27:01 +0000 (11:27 +0000)]
Trigger soft lifetime expiration on sequence number
This patch adds 80% of UINT32_MAX limit on sequence number.
When sequence number reaches limit kernel sends SADB_EXPIRE message to
IKE daemon which is responsible to perform rekeying.
Marcin Wojtas [Fri, 16 Oct 2020 11:25:45 +0000 (11:25 +0000)]
Add support for IPsec ESN and pass relevant information to crypto layer
Implement support for including IPsec ESN (Extended Sequence Number) to
both encrypt and authenticate mode (eg. AES-CBC and SHA256) and combined
mode (eg. AES-GCM). Both ESP and AH protocols are updated. Additionally
pass relevant information about ESN to crypto layer.
For the ETA mode the ESN is stored in separate crp_esn buffer because
the high-order 32 bits of the sequence number are appended after the
Next Header (RFC 4303).
For the AEAD modes the high-order 32 bits of the sequence number
[e.g. RFC 4106, Chapter 5 AAD Construction] are included as part of
crp_aad (SPI + ESN (32 high order bits) + Seq nr (32 low order bits)).
Marcin Wojtas [Fri, 16 Oct 2020 11:24:12 +0000 (11:24 +0000)]
Implement anti-replay algorithm with ESN support
As RFC 4304 describes there is anti-replay algorithm responsibility
to provide appropriate value of Extended Sequence Number.
This patch introduces anti-replay algorithm with ESN support based on
RFC 4304, however to avoid performance regressions window implementation
was based on RFC 6479, which was already implemented in FreeBSD.
To keep things clean and improve code readability, implementation of window
is kept in seperate functions.
Set default stack size for Linux apps to 8MB. This matches Linux'
defaults, makes core files smaller, and fixes applications which use
pthread_join(3) in a wrong way, namely Steam.
This is based on a patch submitted by Jason Yang, which I've reworked
to set the limit instead of only changing the value reported (which
is enough to fix the bug for Linux pthreads, but could be confusing).
PR: 248225
Submitted by: Jason_YH_Yang at wistron.com (earlier version)
Analyzed by: Alex S <iwtcex@gmail.com>
Reviewed by: emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26778
Marcin Wojtas [Fri, 16 Oct 2020 11:21:56 +0000 (11:21 +0000)]
Add support for ESN in AES-NI crypto driver
This patch adds support for IPsec ESN (Extended Sequence Numbers) in
encrypt and authenticate mode (eg. AES-CBC and SHA256) and combined mode
(eg. AES-GCM).
For the encrypt and authenticate mode the ESN is stored in separate
crp_esn buffer because the high-order 32 bits of the sequence number are
appended after the Next Header (RFC 4303).
For the combined modes the high-order 32 bits of the sequence number
[e.g. RFC 4106, Chapter 5 AAD Construction] are part of crp_aad
(prepared by netipsec layer in case of ESN support enabled), therefore
non visible diff around combined modes.
Marcin Wojtas [Fri, 16 Oct 2020 11:18:13 +0000 (11:18 +0000)]
Add support for ESN in cryptosoft
This patch adds support for IPsec ESN (Extended Sequence Numbers) in
encrypt and authenticate mode (eg. AES-CBC and SHA256) and combined mode
(eg. AES-GCM).
For encrypt and authenticate mode the ESN is stored in separate crp_esn
buffer because the high-order 32 bits of the sequence number are
appended after the Next Header (RFC 4303).
For combined modes the high-order 32 bits of the sequence number [e.g.
RFC 4106, Chapter 5 AAD Construction] are part of crp_aad (prepared by
netipsec layer in case of ESN support enabled), therefore non visible
diff around combined modes.
Marcin Wojtas [Fri, 16 Oct 2020 11:06:33 +0000 (11:06 +0000)]
Prepare crypto framework for IPsec ESN support
This permits requests (netipsec ESP and AH protocol) to provide the
IPsec ESN (Extended Sequence Numbers) in a separate buffer.
As with separate output buffer and separate AAD buffer not all drivers
support this feature. Consumer must request use of this feature via new
session flag.
Michael Tuexen [Fri, 16 Oct 2020 10:44:48 +0000 (10:44 +0000)]
Improve the handling of cookie life times.
The staleness reported in an error cause is in us, not ms.
Enforce limits on the life time via sysct; and socket options
consistently. Update the description of the sysctl variable to
use the right unit. Also do some minor cleanups.
This also fixes an interger overflow issue if the peer can
modify the cookie. This was reported by Felix Weinrank by fuzz testing
the userland stack and in
https://oss-fuzz.com/testcase-detail/4800394024452096
Jessica Clarke [Thu, 15 Oct 2020 18:03:14 +0000 (18:03 +0000)]
kldxref: Avoid buffer overflows in parse_pnp_list
We convert a string like "W32:vendor/device" into "I:vendor;I:device",
where the output is longer than the input, but only allocate space equal
to the length of the input, leading to a buffer overflow.
Instead use open_memstream so we get a safe dynamically-grown buffer.
Glen Barber [Thu, 15 Oct 2020 17:12:58 +0000 (17:12 +0000)]
Increase the amd64 ISO ESP file size from 800KB to 1024KB.
At some poing over the last week, the bootx64.efi file has grown
past the 800KB threshold, resulting in being unable to copy it to
the EFI/BOOT directory.
# stat -f %z efiboot.znWo7m
819200
# stat -f %z stand-test.PIEugN/EFI/BOOT/bootx64.efi
842752
The comment in the script that creates the ISOs suggests that 800KB
is the maximum allowed for the boot code, however I was able to
boot an ISO with a 1024KB boot partition. Additionally, I verified
against an ISO from OtherOS, where the boot EFI partition is 2.4MB.
Brooks Davis [Thu, 15 Oct 2020 17:05:21 +0000 (17:05 +0000)]
physio: Don't store user addresses in bio_data
Only assign the address from the iovec to bio_data if it is a kernel
address. This was the single place where bio_data stored (however
briefly) a userspace pointer.
Nathan Whitehorn [Thu, 15 Oct 2020 13:43:43 +0000 (13:43 +0000)]
Provide a slightly more-tolerant set of thermal parameters for PowerMac
motherboard temperatures. In particular, the U4 northbridge die is very
hard to cool or heat effectively with fans and is not responsive to load.
It generally sits around 64C, where it seems happy, so (like Linux) just
declare that to be its target temperature.
This makes the PowerMac G5 much less loud, with no change in the
temperatures of any system components.
With some popular multiplayer games (such as Counter-Strike: Global
Offensive) the Linux Steam client likes to occasionally scan the game
process memory, presumably as part anti-cheat measures. Turns out
the client also expects each inode entry to be followed by a space
character, otherwise the parsing code crashes.
PR: 248216
Submitted by: Alex S <iwtcex@gmail.com>
MFC after: 2 weeks
Wei Hu [Thu, 15 Oct 2020 11:44:28 +0000 (11:44 +0000)]
Hyper-V: hn: Relinquish cpu in HN_LOCK to avoid deadlock
The try lock loop in HN_LOCK put the thread spinning on cpu if the lock
is not available. It is possible to cause deadlock if the thread holding
the lock is sleeping. Relinquish the cpu to work around this problem even
it doesn't completely solve the issue. The priority inversion could cause
the livelock no matter how less likely it could happen. A more complete
solution may be needed in the future.
Wei Hu [Thu, 15 Oct 2020 05:57:20 +0000 (05:57 +0000)]
Hyper-V: pcib: Check revoke status during device attach
It is possible that the vmbus pcib channel is revoked during attach path.
The attach path could be waiting for response from host and this response will never
arrive since the channel has already been revoked from host point of view. Check
this situation during wait complete and return failed if this happens.