Add support for Multi-Physical Function Switch, MPFS, in mlx5en.
MPFS is a logical switch in the Mellanox device which forward packets
based on a hardware driven L2 address table, to one or more physical-
or virtual- functions. The physical- or virtual- function is required
to tell the MPFS by using the MPFS firmware commands, which unicast
MAC addresses it is requesting from the physical port's traffic.
Broadcast and multicast traffic however, is copied to all listening
physical- and virtual- functions and does not need a rule in the MPFS
switching table.
Make sure the transmit loop doesn't get starved in ipoib.
When the software send queue gets filled up, callbacks to
if_transmit will stop. Make sure the transmit callback
routine checks the send queue and outputs any remaining
mbufs. Else the remaining mbufs may simply sit in the
output queue blocking the transmit path.
MFC after: 3 days
Sponsored by: Mellanox Technologies
Convert pnmatch to single element array in regexec calls
The regexec function is declared as taking an array of regmatch_t
elements, and passing in the pointer to singleton element, while
correct, triggers a Coverity warning. Convert the singleton into
an array of one to silence the warning.
Kyle Evans [Wed, 2 Oct 2019 01:27:50 +0000 (01:27 +0000)]
caroot: add @generated tags to extracted .pem
As is the current trend; while these files are manually curated, they are
still generated. If they end up in a review, it would be helpful to also
take the hint and hide them.
Kyle Evans [Wed, 2 Oct 2019 01:06:37 +0000 (01:06 +0000)]
[3/3] etcupdate and mergemaster support for certctl
This commit add support for certctl in mergemaster and etcupdate. Both will
either rehash or prompt for rehash as new certificates are
trusted/blacklisted.
This work was done primarily by allanjude@, with minor contributions by
myself.
No objection from: secteam
Differential Revision: https://reviews.freebsd.org/D17389
Kyle Evans [Wed, 2 Oct 2019 01:05:29 +0000 (01:05 +0000)]
[1/3] Initial infrastructure for SSL root bundle in base
This setup will add the trusted certificates from the Mozilla NSS bundle
to base.
This commit includes:
- CAROOT option to opt out of installation of certs
- mtree amendments for final destinations
- infrastructure to fetch/update certs, along with instructions
A follow-up commit will add a certctl(8) utility to give the user control
over trust specifics. Another follow-up commit will actually commit the
initial result of updatecerts.
This work was done primarily by allanjude@, with minor contributions by
myself.
No objection from: secteam
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D16856
Alexander Motin [Tue, 1 Oct 2019 20:09:25 +0000 (20:09 +0000)]
Improve latency of synchronous 128KB writes.
Before my ZIL space optimization few years ago 128KB writes were logged
as two 64KB+ records in two 128KB log blocks. After that change it became
~124KB+/4KB+ in two 128KB log blocks to free space in the second block
for another record. Unfortunately in case of 128KB only writes, when space
in the second block remained unused, that change increased write latency by
imbalancing checksum computation time between parallel threads.
This change introduces new 68KB log block size, used for both writes below
67KB and 128KB-sharp writes. Writes of 68-127KB are still using one 128KB
block to not increase processing overhead. Writes above 131KB are still
using full 128KB blocks, since possible saving there is small. Mixed loads
will likely also fall back to previous 128KB, since code uses maximum of
the last 10 requested block sizes.
On a simple 128KB write test with queue depth of 1 this change demonstrates
~15-20% performance improvement.
Ian Lepore [Tue, 1 Oct 2019 19:39:00 +0000 (19:39 +0000)]
Add 8 and 16 bit versions of atomic_cmpset and atomic_fcmpset for arm.
This adds 8 and 16 bit versions of the cmpset and fcmpset functions. Macros
are used to generate all the flavors from the same set of instructions; the
macro expansion handles the couple minor differences between each size
variation (generating ldrexb/ldrexh/ldrex for 8/16/32, etc).
In addition to handling new sizes, the instruction sequences used for cmpset
and fcmpset are rewritten to be a bit shorter/faster, and the new sequence
will not return false when *dst==*old but the store-exclusive fails because
of concurrent writers. Instead, it just loops like ldrex/strex sequences
normally do until it gets a non-conflicted store. The manpage allows LL/SC
architectures to bogusly return false, but there's no reason to actually do
so, at least on arm.
Kyle Evans [Tue, 1 Oct 2019 18:14:37 +0000 (18:14 +0000)]
Move httpd to simple_httpd...
This avoids PATH conflicts with a real httpd, as a user will likely almost
always prefer the more fully-featured httpd. This also lines up with the
historical name of the program.
Alan Cox [Tue, 1 Oct 2019 15:33:47 +0000 (15:33 +0000)]
In short, pmap_enter_quick_locked("user space", ..., VM_PROT_READ) doesn't
work. More precisely, it doesn't set ATTR_AP(ATTR_AP_USER) in the page
table entry, so any attempt to read from the mapped page by user space
generates a page fault. This problem has gone unnoticed because the page
fault handler, vm_fault(), will ultimately call pmap_enter(), which
replaces the non-working page table entry with one that has
ATTR_AP(ATTR_AP_USER) set.
This change reduces the number of page faults during a "buildworld" by
about 19.4%.
Kyle Evans [Tue, 1 Oct 2019 14:55:16 +0000 (14:55 +0000)]
Move simple_httpd out of picobsd, add HTTPD option (default OFF)
picobsd/tinyware has had this compact HTTPD server for a long time, and some
people do use it. Move it out into usr.sbin well in advance of any action
being taken on picobsd.
This has been gated behind an HTTPD option defaulted to *off*, primarily for
two reasons:
1.) This code likely needs a good audit, as it's been living off in picobsd
land for a long time, and
2.) We don't currently ship an httpd and this may not be a welcome surprise.
Reviewed by: eugen
Differential Revision: https://reviews.freebsd.org/D21724
Currently only suspend requests are acknowledged by writing an empty
string back to the xenstore control node, but poweroff or reboot
requests are not acknowledged and FreeBSD simply proceeds to perform
the desired action.
Fix this by acknowledging all requests, and remove the suspend specific
ack done in the handler.
Sponsored by: Citrix Systems R&D
MFC after: 3 days
Matt Macy [Mon, 30 Sep 2019 22:00:48 +0000 (22:00 +0000)]
Add iflag=fullblock to dd
Normally, count=n means read(2) will be called n times on the input to dd. If
the read() returns short, as may happen when reading from a pipe, fewer bytes
will be copied from the input. With conv=sync the buffer is padded with zeros
to fill the rest of the block.
iflag=fullblock causes dd to continue reading until the block is full, so that
count=n means n full blocks are copied. This flag is compatible with illumos
and GNU dd and is used in the ZFS test suite.
Submitted by: Ryan Moeller
Reviewed by: manpages, mmacy@
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D21441
Matt Macy [Mon, 30 Sep 2019 21:56:42 +0000 (21:56 +0000)]
Add oflag=fsync and oflag=sync capability to dd
Sets the O_FSYNC flag on the output file. oflag=fsync and oflag=sync are
synonyms just as O_FSYNC and O_SYNC are synonyms. This functionality is
intended to improve portability of dd commands in the ZFS test suite.
Submitted by: Ryan Moeller
Reviewed by: manpages, mmacy@
MFC after: 1 week
Sponsored by: iXsytems, Inc.
Differential Revision: https://reviews.freebsd.org/D21422
Pull in r357528 from upstream llvm trunk (by Craig Topper):
[X86] Check MI.isConvertibleTo3Addr() before calling
convertToThreeAddress in X86FixupLEAs.
X86FixupLEAs just assumes convertToThreeAddress will return nullptr
for any instruction that isn't convertible.
But the code in convertToThreeAddress for X86 assumes that any
instruction coming in has at least 2 operands and that the second one
is a register. But those properties aren't guaranteed of all
instructions. We should check the instruction property first.
Pull in r365720 from upstream llvm trunk (by Craig Topper):
[X86] Don't convert 8 or 16 bit ADDs to LEAs on Atom in FixupLEAPass.
We use the functions that convert to three address to do the
conversion, but changing an 8 or 16 bit will cause it to create a
virtual register. This can't be done after register allocation where
this pass runs.
I've switched the pass completely to a white list of instructions
that can be converted to LEA instead of a blacklist that was
incorrect. This will avoid surprises if we enhance the three address
conversion function to include additional instructions in the future.
Fixes PR42565.
This should fix assertions/segfaults when compiling certain ports with
CPUTYPE=atom.
Mark Johnston [Mon, 30 Sep 2019 15:59:07 +0000 (15:59 +0000)]
Add IFLIB_SINGLE_IRQ_RX_ONLY.
As of r347221 the iflib legacy interrupt mode setup assumes that drivers
perform both receive and transmit processing from the interrupt handler.
This assumption is invalid in the vmxnet3 driver, so introduce the
IFLIB_SINGLE_IRQ_RX_ONLY flag to make iflib avoid tx processing in the
interrupt handler.
PR: 239118
Reported and tested by: Juraj Lutter <otis@sk.freebsd.org>
Obtained from: marius
Reviewed by: gallatin
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D21831
Michael Tuexen [Mon, 30 Sep 2019 12:06:57 +0000 (12:06 +0000)]
Don't use stack memory which is not initialized.
Thanks to Mark Wodrich for reporting this issue for the userland stack in
https://github.com/sctplab/usrsctp/issues/380
This issue was also found for usrsctp by OSS-fuzz in
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17778
ections into expected offset in binary format.
Calculate binary file offset using address field, bacause software know only offset to known data, not where to load segment.
With that patch, kernel .data section can have any alignment/offset - kernel boor fine.
amd64 pmap: batch chunk removal in pmap_remove_pages
pv list lock is the main bottleneck during poudriere -j 104 and
pmap_remove_pages is the most impactful consumer. It frees chunks with the lock
held even though it plays no role in correctness. Moreover chunks are often
freed in groups, sample counts during buildkernel (0-sized frees removed):
Michael Tuexen [Sun, 29 Sep 2019 10:45:13 +0000 (10:45 +0000)]
RFC 7112 requires a host to put the complete IP header chain
including the TCP header in the first IP packet.
Enforce this in tcp_output(). In addition make sure that at least
one byte payload fits in the TCP segement to allow making progress.
Without this check, a kernel with INVARIANTS will panic.
This issue was found by running an instance of syzkaller.
Reviewed by: jtl@
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21665
This API is still young enough that I would expect no one to be dependant on
this yet... Swap the ordering while it's young to match Linux values to
potentially ease implementation of linuxolator syscall, being able to reuse
existing constants.
fdt_slicer: bump to SI_ORDER_THIRD following r347183
r347183 bumped GEOM classes to SI_ORDER_SECOND to resolve a race between
them and the initialization of devsoftc.mtx in devinit, but missed this
dependency on g_flashmap that may now lose the race against GEOM
classes/g_init.
There's a great comment that describes the situation that has also been
updated with the new ordering of GEOM classes.
This driver is for the usb phy present on rockchip SoC.
It only support RK3399 and host mode for now.
The driver expose the usb clock needed by the usb controller.
On rockchip board it seems that the value in the DTS
are not enough for reseting the chip, I don't know if
the value are really incorrect or if DELAY is not precise
enough or if the rockchip gpio driver have some "lag" of some
kind or not.
For now just add more delay.
Module resets where not implemented when rockchip clocks were commited.
Implement them.
Since all resets registers are contiguous a driver only need to give
the start offset and the number of resets. This avoid to have to declare
every resets.
PLL_MIPI is the last important PLL that we missed.
Add support for it.
Since it's one of the possible parent for TCON0 also add this clock
now that we can.
While here add some info about what video related clocks should be
enabled at boot and with what frequency.
Warner Losh [Sat, 28 Sep 2019 17:15:48 +0000 (17:15 +0000)]
Revert the mode_t -> int changes and add a warning in the BUGS section instead.
While FreeBSD's implementation of these expect an int inside of libc, that's an
implementation detail that we can hide from the user as it's the natural
promotion of the current mode_t type and before it is used in the kernel, it's
converted back to the narrower type that's the current definition of mode_t. As
such, documenting int is at best confusing and at worst misleading. Instead add
a note that these args are variadic and as such calling conventions may differ
from non-variadic arguments.
John Baldwin [Sat, 28 Sep 2019 14:20:28 +0000 (14:20 +0000)]
Disable build of LOCAL_MODULES for cross-builds by default.
WITHOUT_LOCAL_MODULES can be set to disable LOCAL_MODULES for native
builds. WITH_LOCAL_MODULES can be set to leave it enabled for cross
builds.
This does not use a knob in kern.opts.mk because the options framework
does not currently support options whose default varies on the build
type. I discussed a few options there with Warner (e.g. maybe having
a tri-state where the default value is "auto" and having Makefile.inc1
apply logic when MK_LOCAL_MODULES is set to "auto"), but Warner ok'd
this approach for now until a better solution is implemented.
Requested by: many
Reviewed by: imp (in person at EuroBSDCon)
Differential Revision: https://reviews.freebsd.org/D21608
John Baldwin [Sat, 28 Sep 2019 14:14:42 +0000 (14:14 +0000)]
Disable REPRODUCIBLE_BUILD for kernel builds.
The REPRODUCIBLE_BUILD option is actually managed in two separate
files. src.opts.mk governs the setting for world builds and
kern.opts.mk governs it for kernel builds. r350550 only changed the
default for world builds.
Michael Tuexen [Sat, 28 Sep 2019 13:13:23 +0000 (13:13 +0000)]
Replacing MD5 by SipHash improves the performance of the TCP time stamp
initialisation, which is important when the host is dealing with a
SYN flood.
This affects the computation of the initial TCP sequence number for
the client side.
This has been discussed with secteam@.
Reviewed by: gallatin@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21616
Michael Tuexen [Sat, 28 Sep 2019 13:05:37 +0000 (13:05 +0000)]
Ensure that the INP lock is released before leaving [gs]etsockopt()
for RACK specific socket options.
These issues were found by a syzkaller instance.
Reviewed by: rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21825
bhyve: support for enabling/disabling the net backend
Extend the net backend interface with two functions, namely netbe_rx_disable()
and netbe_rx_enable(), which can be used by the net device emulators to stop
the backend from invoking the receive callback. This is useful for device
emulators, i.e., on hardware resets or to implement receive backpressure.
The mevent module has been extendede to support the addition of a disabled
event. To prevent race conditions, the net backends will start with receive
operation disabled. A follow-up patch will use the new functionalities in
the virtio-net device.
Move the SysV IPC stuff out of the 'abi' rc script, into a new one:
'sysvipc' - it has nothing to do with ABIs, and I'd like to later
rename 'abi' to 'linux', which better describes its purpose and also
matches the rcvar name.
Reviewed by: emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21615
powerpc/booke64: Align initial stack setting to match that of aim64's
Clang9/LLD9 appears to get quite confused with the instruction stream used
to obtain the tmpstack pointer, almost as though it thinks this is a C
function, so tries to optimize it. Since the AIM64 method doesn't use the
TOC to obtain the tmpstack, just follow that model, and lld won't get
confused.
dpaa(4): Fix memcpy size for threshold copy in NCSW contrib
On 64-bit platforms uintptr_t makes the copy twice as large as it should be.
This code isn't actually used in FreeBSD, since it's for guest mode only,
not hypervisor mode, but fixing it for completeness sake.
Mark Johnston [Sat, 28 Sep 2019 01:42:59 +0000 (01:42 +0000)]
Fix some problems with the SPARSE_MAPPING option in the kernel linker.
- Ensure that the end of the mapping passed to vm_page_wire() is
page-aligned. vm_page_wire() expects this.
- Wire pages before reading data into them.
- Apply protections specified in the segment descriptor using
vm_map_protect() once relocation processing is done.
- On amd64, ensure that we load KLDs above KERNBASE, since they
are compiled with the "kernel" memory model by default.
Warner Losh [Fri, 27 Sep 2019 20:56:44 +0000 (20:56 +0000)]
Push and pop xtrace correctly for run_early_customize
run_early_customize is run as a shell list, not as a subshell, so that the side
effects of setting variables can affect later stages of the build (for better or
worse, it's been like this since it was introduced). It therefore has the side
effect of turning off xtrace always, which limits the usefulness of sh -x
nanobsd.sh. Remember the old setting and only turn off tracing after the command
if tracing was off before. All the other places where we do similar things we use
a subshell, so we don't need to do this.
Warner Losh [Fri, 27 Sep 2019 20:56:31 +0000 (20:56 +0000)]
Remove workaround for building on FreeBSD hosts prior to FreeBSD 10.
rm -x was introduced in the FreeBSD 10 time frame. 4 years ago I added a
function to cope with building nanobsd images on hosts as old FreeBSD 7 that
lacked rm -x. The workaround is no longer needed as FreeBSD 9 hasn't been
supported for almost 3 years. Eliminate the wrapper and use rm -x directly
again.
Make fractional delays for top(1) work for interactive mode.
In r334906, the -s option was changed to allow fractional times, but
this only functioned correctly for batch mode. In interactive mode, any
delay below 1.0 would get floored to zero. This would put top(1) into a
tight loop, which could be difficult to interrupt.
Fix this by storing the -s option value (after validation) into a struct
timeval, and using that struct consistently for delaying with select(2).
Next up is to allow interactive entry of a fractional delay value.
Andrew Gallatin [Fri, 27 Sep 2019 20:08:19 +0000 (20:08 +0000)]
kTLS: Fix a bug where we would not encrypt anon data inplace.
Software Kernel TLS needs to allocate a new destination crypto
buffer when encrypting data from the page cache, so as to avoid
overwriting shared clear-text file data with encrypted data
specific to a single socket. When the data is anonymous, eg, not
tied to a file, then we can encrypt in place and avoid allocating
a new page. This fixes a bug where the existing code always
assumes the data is private, and never encrypts in place. This
results in unneeded page allocations and potentially more memory
bandwidth consumption when doing socket writes.
When the code was written at Netflix, ktls_encrypt() looked at
private sendfile flags to determine if the pages being encrypted
where part of the page cache (coming from sendfile) or
anonymous (coming from sosend). This was broken internally at
Netflix when the sendfile flags were made private, and the
M_WRITABLE() check was added. Unfortunately, M_WRITABLE() will
always be false for M_NOMAP mbufs, since one cannot just mtod()
them.
This change introduces a new flags field to the mbuf_ext_pgs
struct by stealing a byte from the tls hdr. Note that the current
header is still 2 bytes larger than the largest header we
support: AES-CBC with explicit IV. We set MBUF_PEXT_FLAG_ANON
when creating an unmapped mbuf in m_uiotombuf_nomap() (which is
the path that socket writes take), and we check for that flag in
ktls_encrypt() when looking for anon pages.
Andrew Gallatin [Fri, 27 Sep 2019 19:17:40 +0000 (19:17 +0000)]
kTLS support for TLS 1.3
TLS 1.3 requires a few changes because 1.3 pretends to be 1.2
with a record type of application data. The "real" record type is
then included at the end of the user-supplied plaintext
data. This required adding a field to the mbuf_ext_pgs struct to
save the record type, and passing the real record type to the
sw_encrypt() ktls backend functions.
The current mechanism is bogus in several ways:
- the limit is a percentage of total entries added, which means negative
entries get evicted all the time even if there are plenty of resources
- evicting code is almost not concurrent, which makes it unable to
remove entries fast enough when doing something as simple as -j 104
buildworld
- there is no support for performing mass removal if necessary
Vast majority of negative entries never get any hits. Only evicting
them when the filesystem demands it results in a significant growth of
the namecache with almost no improvement in the hit ratio.
Sample result about afer 90 minutes of poudriere -j 104:
cache: make negative list shrinking a little bit concurrent
Continue protecting demotion from the hotlist and selection of the
target list with the ncneg_shrink_lock lock, but drop it before
relocking to zap the node.
While here count how many times we skipped shrinking due to the lock
being already taken.