Andrew Gallatin [Sat, 19 Dec 2020 22:04:46 +0000 (22:04 +0000)]
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Andrew Gallatin [Sat, 19 Dec 2020 21:46:09 +0000 (21:46 +0000)]
Optionally bind ktls threads to NUMA domains
When ktls_bind_thread is 2, we pick a ktls worker thread that is
bound to the same domain as the TCP connection associated with
the socket. We use roughly the same code as netinet/tcp_hpts.c to
do this. This allows crypto to run on the same domain as the TCP
connection is associated with. Assuming TCP_REUSPORT_LB_NUMA
(D21636) is in place & in use, this ensures that the crypto source
and destination buffers are local to the same NUMA domain as we're
running crypto on.
This change (when TCP_REUSPORT_LB_NUMA, D21636, is used) reduces
cross-domain traffic from over 37% down to about 13% as measured
by pcm.x on a dual-socket Xeon using nginx and a Netflix workload.
Gordon Bergling [Sat, 19 Dec 2020 14:54:28 +0000 (14:54 +0000)]
libc: Fix most issues reported by mandoc
- varios "new sentence, new line" warnings
- varios "sections out of conventional order" warnings
- varios "unusual Xr order" warnings
- varios "missing section argument" warnings
- varios "no blank before trailing delimiter" warnings
- varios "normalizing date format" warnings
Gordon Bergling [Sat, 19 Dec 2020 13:51:46 +0000 (13:51 +0000)]
zonectl(8): Fix a few issues reported by mandoc
- Add missing quotation mark for a comment above the .Dd
- inserting missing end of block: Sh breaks Bd
- skipping paragraph macro: Pp before Bl
- skipping paragraph macro: Pp before Bd
- empty block: Bd
Gordon Bergling [Sat, 19 Dec 2020 13:36:59 +0000 (13:36 +0000)]
bluetooth: Fix a mandoc related issues
- new sentence, new line
- sections out of conventional order: Sh FILES
- unusual Xr order: bthost(1) after bthidd(8)
- no blank before trailing delimiter
- whitespace at end of input line
- sections out of conventional order: Sh EXIT STATUS
Gordon Bergling [Sat, 19 Dec 2020 12:47:40 +0000 (12:47 +0000)]
ipfw(8): Fix a few mandoc related issues
- no blank before trailing delimiter
- missing section argument: Xr inet_pton
- skipping paragraph macro: Pp before Ss
- unusual Xr order: syslogd after sysrc
- tab in filled text
There were a few multiline NAT examples which used the .Dl macro with
tabs. I converted them to .Bd, which is a more suitable macro for that case.
Gordon Bergling [Sat, 19 Dec 2020 11:47:38 +0000 (11:47 +0000)]
nvmecontrol(8): Fix a few mandoc related issues and add a SEE ALSO section
- inserting missing end of block: Ss breaks Bl
- skipping paragraph macro: Pp before Ss
- referenced manual not found: Xr nvme 4 (2 times)
- unknown standard specifier: St The
The macro .St can only be used for standards known by mdoc(7). So add a
SEE ALSO section and add a reference to the NVM Express Base Specification.
Gordon Bergling [Sat, 19 Dec 2020 10:15:58 +0000 (10:15 +0000)]
bhnd_erom(9): Fix a few mandoc related issues
- skipping paragraph macro: Pp before Bl
- skipping paragraph macro: Pp after Ss
- skipping paragraph macro: Pp at the end of Ss
- unusual Xr punctuation: none before bhnd_driver_get_erom_class(9)
- unusual Xr punctuation: none before bus_space(9)
Gordon Bergling [Sat, 19 Dec 2020 10:11:37 +0000 (10:11 +0000)]
bhnd(9): Fix a few mandoc related issues
- skipping paragraph macro: Pp before Bl
- skipping paragraph macro: Pp at the end of Ss
- missing section argument: Xr device_set_desc
- unusual Xr punctuation: none before bhnd_erom(9)
Gordon Bergling [Sat, 19 Dec 2020 09:55:02 +0000 (09:55 +0000)]
disk(9): Fix a few mandoc related errors
- function name without markup: g_io_deliver()
- function name without markup: disk_gone()
- sections out of conventional order: Sh SEE ALSO
- referenced manual not found: Xr MAKE_DEV 9
Actually the man page of MAKE_DEV has never existed.
Ryan Libby [Sat, 19 Dec 2020 08:38:31 +0000 (08:38 +0000)]
rtld-elf: link udivmoddi4 from compiler_rt
This fixes the gcc9 build of rtld-elf32 on amd64, which needed an
implementation of udivmoddi4.
rtld-elf uses certain functions normally found in libc, and so it
includes certain files from libc in its own build. It has two
mechanisms to include files from libc: one that rebuilds source files in
the rtld-elf environment, and one that extracts object files from a
purpose-built no-SSP PIC archive.
In addition to libc functions, rtld-elf may need to link functions
normally found in libcompiler_rt (formerly libgcc). Now, add an ability
to rebuild libcompiler_rt source files in the rtld-elf environment. We
don't yet have a need for an object file extraction mechanism.
libcompiler_rt could also supply udivdi3 and umoddi3, but leave them
alone for now.
Ryan Libby [Sat, 19 Dec 2020 08:38:27 +0000 (08:38 +0000)]
rtld-libc: fix incremental build
ar cr is an update of an archive, not a creation of a new one. During
incremental builds (e.g. with meta mode) the archive was not getting
cleaned, and so could retain now-deleted objects from previous builds.
Now, delete the archive before creating/updating it.
Kyle Evans [Sat, 19 Dec 2020 03:30:06 +0000 (03:30 +0000)]
kern: cpuset: allow jails to modify child jails' roots
This partially lifts a restriction imposed by r191639 ("Prevent a superuser
inside a jail from modifying the dedicated root cpuset of that jail") that's
perhaps beneficial after r192895 ("Add hierarchical jails."). Jails still
cannot modify their own cpuset, but they can modify child jails' roots to
further restrict them or widen them back to the modifying jails' own mask.
As a side effect of this, the system root may once again widen the mask of
jails as long as they're still using a subset of the parent jails' mask.
This was previously prevented by the fact that cpuset_getroot of a root set
will return that root, rather than the root's parent -- cpuset_modify uses
cpuset_getroot since it was introduced in r327895, previously it was just
validating against set->cs_parent which allowed the system root to widen
jail masks.
Pedro F. Giffuni [Sat, 19 Dec 2020 03:07:38 +0000 (03:07 +0000)]
login(1): when exporting variables check the result of setenv(3)
When exporting a variable we correctly check all the preconditions that
could make setenv(3) fail. Checking the setenv(3) return value seems
redundant, but given that login(1) is critical, it doesn't hurt to have
a post-check.
This change is based on the "Principles of Secure Coding" course by
Matthew Bishop, PhD., which specifically discusses this code in FreeBSD.
(This change redoes r368776 due to a silly mistake)
Pedro F. Giffuni [Sat, 19 Dec 2020 02:23:53 +0000 (02:23 +0000)]
login(1): when exporting variables check the result of setenv(3)
When exporting a variable we correctly check all the preconditions that
could make setenv(3) fail. Checking the setenv(3) return value seems
redundant, but given that login(1) is critical, it doesn't hurt to have
a post-check.
This change is based on the "Principles of Secure Coding" course by
Matthew Bishop, PhD., which specifically discusses this code in FreeBSD.
Jessica Clarke [Fri, 18 Dec 2020 23:31:36 +0000 (23:31 +0000)]
usb: Replace ITUNERNET vendor with MICROCHIP and improve product names
These Mini-Box LCDs are using Microchip components and sub-licensed product
IDs. Whilst here, update the constant names and descriptions for the products
to use the names listed on the manufacturer's website rather than vague ones.
The picoLCD 4x20 is named that on the manufacturer's website so prefer that
name, even though linux-usb.org lists it with the numbers reversed as one might
expect.
Kirk McKusick [Fri, 18 Dec 2020 23:28:27 +0000 (23:28 +0000)]
Rename pass4check() to freeblock() and move from pass4.c to inode.c.
The new name more accurately describes what it does and the file move
puts it with other similar functions. Done in preparation for future
cleanups. No functional differences intended.
Sponsored by: Netflix
Historic Footnote: my last FreeBSD svn commit
Switch direct rt fields access in rtsock.c to newly-create field acessors.
rtsock code was build around the assumption that each rtentry record
in the system radix tree is a ready-to-use sockaddr. This assumptions
turned out to be not quite true:
* masks have their length tweaked, so we have rtsock_fix_netmask() hack
* IPv6 addresses have their scope embedded, so we have another explicit
deembedding hack.
Change the code to decouple rtentry internals from rtsock code using
newly-created rtentry accessors. This will allow to eventually eliminate
both of the hacks and change rtentry dst/mask format.
Mark Johnston [Fri, 18 Dec 2020 16:04:48 +0000 (16:04 +0000)]
acpi: Ensure that adjacent memory affinity table entries are coalesced
The SRAT may contain multiple distinct entries that together describe a
contiguous region of physical memory. In this case we were not
coalescing the corresponding entries in the memory affinity table, which
led to fragmented phys_avail[] entries. Since r338431 the vm_phys_segs[]
entries derived from phys_avail[] will be coalesced, resulting in a
situation where vm_phys_segs[] entries do not have a covering
phys_avail[] entry. vm_page_startup() will not add such segments to the
physical memory allocator, leaving them unused.
Reported by: Don Morris <dgmorris@earthlink.net>
Reviewed by: kib, vangyzen
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27620
Jessica Clarke [Fri, 18 Dec 2020 15:07:14 +0000 (15:07 +0000)]
virtio_mmio: Fix feature negotiation copy-paste issue in r361943
This caused us to write to the low half of the feature word twice, once with
the high bits and once with the low bits. Common legacy device implementations
seem to be fairly lenient about being able to write to the feature bits
multiple times, but Arm's models use a stricter implementation that will ignore
the second write. This fixes using vtnet(4) on those models.
Marcin Wojtas [Fri, 18 Dec 2020 10:09:21 +0000 (10:09 +0000)]
Fix abort in jemalloc extent coalescing.
Fix error in extent_try_coalesce_impl(), which could cause abort
to happen when trying to coalesce extents backwards. The error could
happen because of how extent_before_get() function works. This function
gets address of previous extent, by subtracting page size from current
extent address. If current extent is located at PAGE_SIZE offset, this
address resolved to 0x0000. An assertion in rtree_leaf_elm_lookup
then caused the running program to abort.
This problem was discovered when trying to build world on 32-bit
machines with ASLR and PIE enabled. The problem was encountered
on armv7 and i386 machines, but most likely other 32-bit
architectures are affected as well.
While this patch fixes one problem with buildworld on 32-bit platforms
with ASLR, the build still fails, however it happens much later
and due to lack of memory.
The change is aligned with accepted fix in the upstream Jemalloc
repository (https://github.com/jemalloc/jemalloc/pull/1973).
As it doesn't apply on top of Jemalloc tree, its updated version
was eventually merged: https://github.com/jemalloc/jemalloc/pull/2003
pci_iov: When pci_iov_detach(9) is called, destroy VF children
instead of bailing out with EBUSY if there are any.
If driver module is unloaded, or just device is forcibly detached from
the driver, there is no way for driver to correctly unload otherwise.
Esp. if there are resources dedicated to the VFs which prevent turning
down other resources.
Peter Grehan [Fri, 18 Dec 2020 00:38:48 +0000 (00:38 +0000)]
Fix issues with various VNC clients.
- support VNC version 3.3 (macos "Screen Sharing" builtin client)
- wait until client has requested an update prior to sending framebuffer data
- don't send an update if no framebuffer updates detected
- increase framebuffer poll frequency to 30Hz, and double that when
kbd/mouse input detected
- zero uninitialized array elements in rfb_send_server_init_msg()
- fix overly large allocation in rfb_init()
- use atomics for flags shared between input and output threads
- use #defines for constants
This work was contributed by Marko Kiiskila, with reuse of some earlier
work by Henrik Gulbrandsen.
John Baldwin [Thu, 17 Dec 2020 20:31:17 +0000 (20:31 +0000)]
Use a template assembly file for firmware object files.
Similar to r366897, this uses the .incbin directive to pull in a
firmware file's contents into a .fwo file. The same scheme for
computing symbol names from the filename is used as before to maximize
compatiblity and not require rebuilding existing .fwo files for
NO_CLEAN builds. Using ld -o binary requires extra hacks in linkers
to either specify ABI options (e.g. soft- vs hard-float) or to ignore
ABI incompatiblities when linking certain objects (e.g. object files
with only data). Using the compiler driver avoids the need for these
hacks as the compiler driver is able to set all the appropriate ABI
options.
Ed Maste [Thu, 17 Dec 2020 19:58:29 +0000 (19:58 +0000)]
Add initial version of git commit message preparation hook
Start with a slightly modified version of the SVN commit template, to
allow developers to experiment. This will be updated in the future as
our process and techniques evolve.
This can be installed by copying or symlinking into the .git/hooks/
directory.
Feedback from: cem, jhb
Reviewed by: imp
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27633
Fix a race in tty_signal_sessleader() with unlocked read of s_leader.
Since we do not own the session lock, a parallel killjobc() might
reset s_leader to NULL after we checked it. Read s_leader only once
and ensure that compiler is not allowed to reload.
While there, make access to t_session somewhat more pretty by using
local variable.
Kyle Evans [Thu, 17 Dec 2020 18:24:36 +0000 (18:24 +0000)]
lualoader: cli: provide a show-module-options loader command
This effectively dumps everything lualoader knows about to the console using
the libsa pager; that particular lua interface was added in r368591.
A pager stub implementation has been added that just dumps the output as-is
as a compat shim for older loader binaries that do not have lpager. This
stub should be moved into a more appropriate .lua file if we add anything
else that needs the pager.
Emmanuel Vadot [Thu, 17 Dec 2020 17:11:14 +0000 (17:11 +0000)]
Add IRQ resource to SPIBUS
Add capability to SPIBUS to have child device with IRQ.
For example many ADC chip have a dedicated pin to signal "data ready"
and the host can just wait for a interrupt to go out and read the result.
It is the same code as in R282674 and R282702 for IICBUS by Michal Meloun
Submitted by: Oskar Holmund <oskar.holmlund@ohdata.se>
Differential Revision: https://reviews.freebsd.org/D27396
Change POSIX compliance level for visibility of strerror_l(3).
Third-party code tests for strerror_l(3) without specifying
_POSIX_SOURCE, and then expects that the function is prototyped with
_POSIX_SOURCE set to 200112.
Reported and tested by: antoine
Sponsored by: The FreeBSD Foundation
MFC after: 13 days
[bhyve] virtio-net: Do not allow receiving packets until features have been negotiated.
Enforce the requirement that the RX callback cannot be called after a reset until the features have been negotiated.
This fixes a race condition where the receive callback is called during a device reset.
Mitchell Horne [Thu, 17 Dec 2020 15:00:19 +0000 (15:00 +0000)]
bsdinstall: remove VTOC8 partition scheme option
Now that sparc64 has been removed, there are no kernels built with
support for the VTOC8 partitioning scheme by default. Remove the option
from the installer, as it is unsupported on all installer images
produced by re@.
The existing logic is nice in theory, but in practice freebsd-update will
not preserve the timestamps on these files. When doing a major upgrade, e.g.
from 12.1-RELEASE -> 12.2-RELEASE, pwd.mkdb et al. appear in the INDEX and
we clobber the timestamp several times in the process of packaging up the
existing system into /var/db/freebsd-update/files and extracting for
comparisons. This leads to these files not getting regenerated when they're
most likely to be needed.
Measures could be taken to preserve timestamps, but it's unclear whether
the complexity and overhead of doing so is really outweighed by the marginal
benefit.
I observed this issue when pkg subsequently failed to install a package that
wanted to add a user, claiming that the user was removed in the process.
bapt@ pointed to this pre-existing bug with freebsd-update as the cause.
Pedro F. Giffuni [Thu, 17 Dec 2020 02:54:32 +0000 (02:54 +0000)]
/etc/services: attempt to bring the database to this century 2/2.
This is the final half of splitting r358153 in two, in order to avoid a build
system bugs and being able to merge an earlier change to previous releases.
Add a note to UPDATING to avoid people building from very old systems from
having issues with mergemaster
Rick Macklem [Thu, 17 Dec 2020 00:20:57 +0000 (00:20 +0000)]
Make mountcritremote dependent upon nfscbd.
Although it is not often needed, the nfscbd(8) should be running when
NFSv4 mounts are done if callback functionality is required.
Callback functionality is required for the NFSv4 server to issue
delegations or pNFS layouts.
This patch adds nfscbd to the mountcritremote's REQUIRED line
to ensure it is started before NFS mounts specified in /etc/fstab
are done.
Brooks Davis [Thu, 17 Dec 2020 00:00:21 +0000 (00:00 +0000)]
newvers.sh: Speed up git_tree_modified
We're looking for file content differences, so ask the question of git
more directly. This helps a lot, saving tens of thousands of fork()s,
when the builder and editor see different stat() results (e.g., UIDs),
as they might with containers.
Mateusz Guzik [Wed, 16 Dec 2020 18:01:41 +0000 (18:01 +0000)]
fd: remove redundant saturation check from fget_unlocked_seq
refcount_acquire_if_not_zero returns true on saturation.
The case of 0 is handled by looping again, after which the originally
found pointer will no longer be there.
Michal Meloun [Wed, 16 Dec 2020 14:39:24 +0000 (14:39 +0000)]
Use the standard method for localizing of MSI-X table bar.
Current way, hardcoded value plus heuristic is not conform to the PCI(e)
specification and it fails on systems where MSI-X bar is not initialized by
BIOS/ACPI (many arm or arm64 systems for example).
Instead, use the standard PCI(e) capability for determining of
MSIX table bar address.