Brooks Davis [Fri, 23 Oct 2020 22:27:45 +0000 (22:27 +0000)]
Only use ASAN when using the in-tree compiler
When building FreeBSD 11 on a FreeBSD 12 system with
CROSS_TOOLCHAIN=llvm10 we end up trying to link against the packaged
version of the sanitizer library. This resulted in a requirement for
getentropy(3) which is not present in FreeBSD 11.
Reviewed by: emaste
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D26903
xhci: Handle the case when MSI-X BAR is the same as IO BAR.
PCIe allows for MSI-X BAR to be either dedicated, or MSI-X Table may
be co-located in some functional BAR. In the later case xhci(4) is
unable to allocate active resource for the table because BAR is
already activated.
Handle it by checking for this special case, and not try to alloc
resource if MSI-X BAR is IO.
Reported and tested by: emaste
Reviewed by: emaste, hselasky
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D26913
Ed Maste [Fri, 23 Oct 2020 16:35:23 +0000 (16:35 +0000)]
libelf: add compression header support
GNU and Oracle libelf implementations added support for section
compression, intended to reduce the size of DWARF debug info (which
might be an order of magnitude larger than the code).
There are two compressed ELF section formats:
1. Old GNU - sections are renmaed to start with 'z'. Section contains
a magic number, uncompressed size, and compressed data.
2. Oracle and New GNU - compressed sections use the SHF_COMPRESSED flag.
The compression header contains the compression type, uncompressed
size, and uncompressed alignment.
The second style is preferred and this change implements only that one.
Mateusz Guzik [Fri, 23 Oct 2020 15:56:22 +0000 (15:56 +0000)]
cache: reduce memory waste in struct namecache
The previous scheme for calculating the total size was doing sizeof
on the struct and then adding the wanted space for the buffer.
nc_name is at offset 58 while sizeof(struct namecache) is 64.
With CACHE_PATH_CUTOFF of 39 bytes and 1 byte of padding we were
allocating 104 bytes for the entry and never accounting for the 6
byte padding, wasting that space.
Mark Johnston [Fri, 23 Oct 2020 14:16:52 +0000 (14:16 +0000)]
ntb: Add Intel Xeon Gen3 support
The NTB hardware starting with Skylake has some changes to the register
map and the doorbell interface. Add a new NTB_XEON_GEN3 device type and
use it to conditionalize driver logic that differs from the existing
Xeon code.
Stefan Eßer [Fri, 23 Oct 2020 10:00:56 +0000 (10:00 +0000)]
Udpate calendar man-page to mention the search path added in r366962.
Calendar files in /usr/lcoal/share/calendar take precedence over files in
the base system. They can be provided by a port or package, but since such
a port has not been committed, yet, no specific port name is suggested.
In fact, multiple ports could exist (e.g. per locale) without conflicting
with each other.
Stefan Eßer [Fri, 23 Oct 2020 09:22:23 +0000 (09:22 +0000)]
Add search of LOCALBASE/share/calendar for calendars supplied by a port.
Calendar files in LOCALBASE override similarily named ones in the base
system. This could easily be changed if the base system calendars should
have precedence, but it could lead to a violation of POLA since then the
port's files were ignored unless those in base have been deleted.
There was no definition of _PATH_LOCALBASE in paths.h, but verbatim uses
of /usr/local existed for _PATH_DEFPATH. Use _PATH_LOCALBASE here to ease
a consistent modification of this prefix.
Fix for loading cuse.ko via rc.d . Make sure we declare the cuse(3)
module by name and not only by the version information, so that
"kldstat -q -m cuse" works.
Navdeep Parhar [Fri, 23 Oct 2020 01:36:54 +0000 (01:36 +0000)]
cxgbe(4): refine the values reported in if_ratelimit_query.
- Get the number of classes from chip_params.
- Get the number of ethofld tids from the firmware.
- Do not let tcp_ratelimit allocate all traffic classes.
John Baldwin [Fri, 23 Oct 2020 00:23:54 +0000 (00:23 +0000)]
Handle CPL_RX_DATA on active TLS sockets.
In certain edge cases, the NIC might have only received a partial TLS
record which it needs to return to the driver. For example, if the
local socket was closed while data was still in flight, a partial TLS
record might be pending when the connection is closed. Receiving a
RST in the middle of a TLS record is another example. When this
happens, the firmware returns the the partial TLS record as plain TCP
data via CPL_RX_DATA. Handle these requests by returning an error to
OpenSSL (via so_error for KTLS or via an error TLS record header for
the older Chelsio OpenSSL interface).
Mateusz Guzik [Thu, 22 Oct 2020 19:28:12 +0000 (19:28 +0000)]
vfs: prevent avoidable evictions on mkdir of existing directories
mkdir -p /foo/bar/baz will mkdir each path component and ignore EEXIST.
The NOCACHE lookup will make the namecache unnecessarily evict the existing entry,
and then fallback to the fs lookup routine eventually leading namei to return an
error as the directory is already there.
For invocations like mkdir -p /usr/obj/usr/src/sys/GENERIC/modules this triggers
fallbacks to the slowpath for concurrently executing lookups.
Alan Cox [Thu, 22 Oct 2020 17:47:51 +0000 (17:47 +0000)]
Micro-optimize uma_small_alloc(). Replace bzero(..., PAGE_SIZE) by
pagezero(). Ultimately, they use the same method for bulk zeroing, but
the generality of bzero() requires size and alignment checks that
pagezero() does not.
Add support for IP over infiniband, IPoIB, to lagg(4). Currently only
the failover protocol is supported due to limitations in the IPoIB
architecture. Refer to the lagg(4) manual page for how to configure
and use this new feature. A new network interface type,
IFT_INFINIBANDLAG, has been added, similar to the existing
IFT_IEEE8023ADLAG .
ifconfig(8) has been updated to accept a new laggtype argument when
creating lagg(4) network interfaces. This new argument is used to
distinguish between ethernet and infiniband type of lagg(4) network
interface. The laggtype argument is optional and defaults to
ethernet. The lagg(4) command line syntax is backwards compatible.
Size of the per-process semaphore undo structure (semusz) depends on
the number of the per-process undos. If kern.ipc.semume is adjusted,
semusz must be adjusted as well, and it makes no sense to delegate
adjustment to user. Make it automatic.
Reported and tested by: Olef <o.vandestadt@gmail.com>
PR: 250361
Reviewed by: jhb, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26826
Navdeep Parhar [Thu, 22 Oct 2020 08:40:25 +0000 (08:40 +0000)]
cxgbe(4): fix the size of the iq/eq maps.
The firmware can allocate ingress and egress context ids anywhere from
its configured range. Size the iq/eq maps to match the entire range
instead of assuming that the firmware always allocates the first
available context id.
Use ELR register value instead of LR for PMC_TRAPFRAME_TO_PC macro since
it's the former that indicates PC if the interrupted execution thread.
This fixes a bug where pmcstat lost the leaf function of the call chain
and started with the second function in the chain.
Although this change is an improvement over the previous logic there is still
posibility for incomplete data: if the leaf function does not have stack
variables and does not call any other functions compiler would not generate
a stack frame for it and the FP value would point to the caller's frame, so
instead of the actual "caller1 -> caller2 -> leaf" chain only
"caller1 -> leaf" would be captured.
Sponsored by: Ampere Computing
Submitted by: Klara, Inc.
Add support for stacked VLANs (IEEE 802.1ad, AKA Q-in-Q).
802.1ad interfaces are created with ifconfig using the "vlanproto" parameter.
Eg., the following creates a 802.1Q VLAN (id #42) over a 802.1ad S-VLAN
(id #5) over a physical Ethernet interface (em0).
VLAN_MTU, VLAN_HWCSUM and VLAN_TSO capabilities should be properly
supported. VLAN_HWTAGGING is only partially supported, as there is
currently no IFCAP_VLAN_* denoting the possibility to set the VLAN
EtherType to anything else than 0x8100 (802.1ad uses 0x88A8).
Brooks Davis [Wed, 21 Oct 2020 16:00:15 +0000 (16:00 +0000)]
vmapbuf: don't smuggle address or length in buf
Instead, add arguments to vmapbuf. Since this argument is
always a pointer use a type of void * and cast to vm_offset_t in
vmapbuf. (In CheriBSD we've altered vm_fault_quick_hold_pages to
take a pointer and check its bounds.)
In no other situtation does b_data contain a user pointer and vmapbuf
replaces b_data with the actual mapping.
It helps to reduce complexity with debugging of large ipfw rulesets.
Also define several constants and translators, that can by used by
dtrace scripts with this probe.
Improve FPU Tag Word reconstruction on i386 to indicate register states.
Improve the code reconstructing en_tw in struct fpreg32 from FXSAVE
results so that all register states are indicated correctly. The
previous code unconditionally mapped non-empty register state to
'normalized value' constant. The new code explicitly distinguishes
the 'zero value' and 'special value' constants as well. This improves
consistency between real FSAVE and translation from FXSAVE, and
ensures that tests using PT_GETFPREGS can rely on a single correct
value independently of the underlying implementation.
PR: 250454
Sponsored by: The FreeBSD Foundation
Obtained from: Moritz Systems
Submitted by: Michał Górny <mgorny@moritz.systems>
Discussed with: emaste
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26856
John Baldwin [Tue, 20 Oct 2020 17:50:18 +0000 (17:50 +0000)]
Add a kernel crypto driver using assembly routines from OpenSSL.
Currently, this supports SHA1 and SHA2-{224,256,384,512} both as plain
hashes and in HMAC mode on both amd64 and i386. It uses the SHA
intrinsics when present similar to aesni(4), but uses SSE/AVX
instructions when they are not.
Note that some files from OpenSSL that normally wrap the assembly
routines have been adapted to export methods usable by 'struct
auth_xform' as is used by existing software crypto routines.
John Baldwin [Tue, 20 Oct 2020 16:48:45 +0000 (16:48 +0000)]
Use a template assembly file to generate the embedded MFS.
This uses the .incbin directive to pull in the MFS image contents.
Using assembly directly ensures that symbols can be defined with the
name and properties (such as .size) desired without having to rename
symbols, etc. via a second objcopy invocation. Since it is compiled
by the C compiler driver, it also avoids the need for all of the
EMBEDFS* make variables.
Xin LI [Tue, 20 Oct 2020 01:29:45 +0000 (01:29 +0000)]
Further refinements of ptsname_r(3) interface:
- Hide ptsname_r under __BSD_VISIBLE for now as the specification
is not finalized at this time.
- Keep Symbol.map sorted.
- Avoid the interposing of ptsname_r(3) from an user application
from breaking ptsname(3) by making the implementation a static
method and call the static function from ptsname(3) instead.
Cy Schubert [Mon, 19 Oct 2020 20:37:38 +0000 (20:37 +0000)]
Destroy cloned interfaces at netif stop, netif restart and shutdown.
This is especially important during shutdown because a child interface
of lagg with WOL enabled will not enable WOL at interface shutdown and
thus no WOL to wake up the device (and machine).
PR: 158734, 109980
Reported by: Antonio Huete Jimenez <tuxillo at quantumachine.net>
Marat N.Afanasyev <marat at zealot.ksu.ru>
reviewed by: kp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D26797
John Baldwin [Mon, 19 Oct 2020 20:08:50 +0000 (20:08 +0000)]
Re-enable receive flow control for TOE TLS sockets.
Flow control was disabled during initial TOE TLS development to
workaround a hang (and to match the Linux TOE TLS support for T6).
The rest of the TOE TLS code maintained credits as if flow control was
enabled which was inherited from before the workaround was added with
the exception that the receive window was allowed to go negative.
This negative receive window handling (rcv_over) was because I hadn't
realized the full implications of disabling flow control.
To clean this up, re-enable flow control on TOE TLS sockets. The
existing TPF_FORCE_CREDITS workaround is sufficient for the original
hang. Now that flow control is enabled, remove the rcv_over
workaround and instead assert that the receive window never goes
negative matching plain TCP TOE sockets.
Alex Richardson [Mon, 19 Oct 2020 19:50:57 +0000 (19:50 +0000)]
Major improvement to build parallelism for googletest internal tests
Currently the googletest internal tests build after the matching library.
However, each of these is serialized at the top level makefile.
Additionally some of the tests (e.g. the gmock-matches-test) take up to
90 seconds to build with clang -O2. Having to wait for this test to
complete before continuing to the next directory seriously slows down the
parllelism of a -j32 build.
Before this change running `make -C lib/googletest -j32 -s` in buildenv
took 202 seconds, now it's 153 due to improved parallelism.
Reviewed By: emaste (no objection)
Differential Revision: https://reviews.freebsd.org/D26748
nullfs: ensure correct lock is taken after bypass.
If lower VOP relocked the lower vnode, it is possible that nullfs
vnode was reclaimed meantime. In this case nullfs vnode no longer
shares lock with lower vnode, which breaks locking protocol.
Check for the condition and acquire nullfs vnode lock if detected.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
It is a common pattern for filesystems' VOP_INACTIVE() implementation
to forcibly reclaim the vnode when its state is final. For instance,
UFS vnode with zero link count is removed, and since it is
inactivated, the last open reference on it is dropped.
On the other hand, vnode might get spurious usecount reference for
many reasons. If the spurious reference exists while vgonel() checks
for active state of the vnode, it would recurse into VOP_INACTIVE().
Fix it by checking and not doing inactivation when vgone() was called
from inactive VOP.
Reported and tested by: pho
Discussed with: mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Mateusz Guzik [Mon, 19 Oct 2020 18:51:51 +0000 (18:51 +0000)]
cache: promote negative entries based on more than one hit
During tinderbox and similar workloads negative entries get at least one
hit before they get evicted. In the current scheme this avoidably promotes
them.
John Baldwin [Mon, 19 Oct 2020 18:24:06 +0000 (18:24 +0000)]
Check TF_TOE not the tod pointer to determine if TOE is active.
The TF_TOE flag is the check used in the rest of the network stack to
determine if TOE is active on a socket. There is at least one path in
the cxgbe(4) TOE driver that can leave the tod pointer non-NULL on a
socket not using TOE.
Reported by: Sony Arpita Das <sonyarpitad@chelsio.com>
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D26803
John Baldwin [Mon, 19 Oct 2020 18:21:41 +0000 (18:21 +0000)]
Mark asymmetric cryptography via OCF deprecated for 14.0.
Only one MIPS-specific driver implements support for one of the
asymmetric operations. There are no in-kernel users besides
/dev/crypto. The only known user of the /dev/crypto interface was the
engine in OpenSSL releases before 1.1.0. 1.1.0 includes a rewritten
engine that does not use the asymmetric operations due to lack of
documentation.
Mark Johnston [Mon, 19 Oct 2020 17:07:19 +0000 (17:07 +0000)]
icmp6: Count packets dropped due to an invalid hop limit
Pad the icmp6stat structure so that we can add more counters in the
future without breaking compatibility again, last done in r358620.
Annotate the rarely executed error paths with __predict_false while
here.
Reviewed by: bz, melifaro
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26578
Mark Johnston [Mon, 19 Oct 2020 16:57:59 +0000 (16:57 +0000)]
link_elf_obj: Colour VM objects
This will cause the VM to back sufficiently large .text sections, such
as those in zfs.ko or amdgpu.ko on amd64, with superpage mappings when
possible.
Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26802
Mark Johnston [Mon, 19 Oct 2020 16:57:40 +0000 (16:57 +0000)]
uma: Respect uk_reserve in keg_drain()
When a reserve of free items is configured for a zone, the reserve must
not be reclaimed under memory pressure. Modify keg_drain() to simply
respect the reserved pool.
While here remove an always-false uk_freef == NULL check (kegs that
shouldn't be drained should set _NOFREE instead), and make sure that the
keg_drain() KTR statement does not reference an uninitialized variable.
Reviewed by: alc, rlibby
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26772
Mark Johnston [Mon, 19 Oct 2020 16:55:03 +0000 (16:55 +0000)]
uma: Avoid depleting keg reserves when filling a bucket
zone_import() fetches a free or partially free slab from the keg and
then uses its items to populate an array, typically filling a bucket.
If a single allocation causes the keg to drop below its minimum reserve,
the inner loop ends. However, if the bucket is still not full and
M_USE_RESERVE is specified, the outer loop will continue to fetch items
from the keg.
If M_USE_RESERVE is specified and the number of free items is below the
reserved limit, we should return only a single item. Otherwise, if the
bucket size is larger than the reserve, all of the reserved items may
end up in a single per-CPU bucket, invisible to other CPUs.
Reviewed by: rlibby
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26771
Mark Johnston [Mon, 19 Oct 2020 16:54:06 +0000 (16:54 +0000)]
vmem: Allocate btags before looping in vmem_xalloc()
BT_MAXALLOC (4) is the number of boundary tags required to complete an
allocation in the worst case: two to clip a free segment, and two to
import from a parent arena. vmem_xalloc() preallocates four boundary
tags before attempting a search to simplify the segment allocation code.
It implements a loop that:
1) ensures that BT_MAXALLOC boundary tags are available,
2) attempts to find and clip a free segment satisfying the allocation
constraints, and failing that,
3) attempts to import a segment.
On !UMA_MD_SMALL_ALLOC platforms the btag zone has to handle recusion:
it needs boundary tags to allocate boundary tags. Thus we reserve
2 * BT_MAXALLOC * mp_ncpus tags for use when recursing: the factor of 2
is because there are two layers of vmem arenas, the per-domain arena and
global arena. For a single thread, 2 * BT_MAXALLOC tags should be
sufficient.
Because of the way the loop is structured, BT_MAXALLOC tags are not
sufficient. The first bt_fill() call may allocate BT_MAXALLOC tags,
then import a segment (consuming two tags), then attempt to top up the
preallocation before carving into the imported free segment, thus
requiring up to six tags in the worst case. Because we don't
preallocate that many, this bug can cause deadlocks in rare scenarios.
Fix the problem by moving the preallocation out the loop. This assumes
that only a single import is ever required to satisfy an allocation
request.
Thanks to manu, emaste and lwhsu for helping test debug patches.
Reported by: Jenkins (hardware CI lab)
Reviewed by: alc, kib, rlibby
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26770
Mark Johnston [Mon, 19 Oct 2020 15:24:35 +0000 (15:24 +0000)]
vmx: Implement pmap (de)activation in C
Rewrite the code that maintains pm_active and invalidates EPTP-tagged
TLB entries in C. Previously this work was done in vmx_enter_guest(),
in assembly, but there is no good reason for that and it makes the TLB
invalidation algorithm for nested page tables harder to review.
No functional change intended. Now, an error from the invept
instruction results in a kernel panic rather than a vmexit. Such errors
should occur only as a result of VMM bugs.
Reviewed by: grehan, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26830