prison0's hostuuid will get set by the hostid rc script, either after
generating it and saving it to /etc/hostid or by simply reading /etc/hostid.
Some things (e.g. arbitrary MAC address generation) may use the hostuuid as
a factor in early boot, so providing a way to read /etc/hostid (if it's
available) and using it before userland starts up is desirable. The code is
written such that the preload doesn't *have* to be /etc/hostid, thus not
assuming that there will be newline at the end of the buffer or even the
exact shape of the newline. White trailing whitespace/non-printables
trimmed, the result will be validated as a valid uuid before it's used for
early boot purposes.
The preload can be turned off with hostuuid_load="NO" in /boot/loader.conf,
just as other preloads; it's worth noting that this is a 37-byte file, the
overhead is believed to be generally minimal.
It doesn't seem necessary at this time to be concerned with kern.hostid.
One does wonder if we should consider validating hostuuids coming in
via jail_set(2); some bits seem to care about uuid form and we bother
validating format of smbios-provided uuid and in-fact whatever uuid comes
from /etc/hostid.
Stop attempting to use FADT legacy hardware flag, it is almost always
a lie.
Instead, unless user explicitly disabled the calibration, calibrate
against 8254 ISA clock. Then, obtain the rough value of the expected
TSC frequency from CPUID leafs 0x15/0x16 or even from the CPU
marketing name string. If calibration results look unbelievably bogus
comparing with CPUID leafs report, use CPUID one.
Intel does not recommend to use CPUID leaf 0x16 for the value of the
system time frequency, indeed the error there might be up to 1% which
e.g. makes ntpd give up. If ISA clock is present, we win, if not, we
get some frequency that allows the machine to boot without enormous
delay.
Next improvement would be to use HPET for re-calibration if we decided
that ISA clock gives bogus results, after HPETs are enumerated. This
is a much bigger change since we probably would need to re-evaluate
some constants depending on TSC frequency.
Reviewed by: emaste, jhb, scottl
Tested by: scottl
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D24426
Rick Macklem [Wed, 15 Apr 2020 21:27:52 +0000 (21:27 +0000)]
Fix the NFSv4.2 extended attribute support for remove extended attrbute.
I missed the "atomic" field of the RemoveExtendedAttribute operation's
reply when I implemented it. It worked between FreeBSD client and server,
since it was missed for both, but it did not conform to RFC 8276.
This patch adds the field for both client and server.
Thanks go to Frank for doing interoperability testing of the extended
attribute support against patches for Linux.
Submitted by: Frank van der Linden <fllinden@amazon.com>
Reported by: Frank van der Linden <fllinden@amazon.com>
Default moea64_bpvo_pool_size 327680 was insufficient for initial
memory mapping at boot time on systems with, for example, 64G and
no huge pages enabled.
Submitted by: Andre Silva <afscoelho@gmail.com>
Reviewed by: jhibbits, alfredo
Approved by: jhibbits (mentor)
Sponsored by: Eldorado Research Institute (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D24102
Brooks Davis [Wed, 15 Apr 2020 20:23:55 +0000 (20:23 +0000)]
Export argc, argv, envc, envv, and ps_strings in auxargs.
This simplifies discovery of these values, potentially with reducing the
number of syscalls we need to make at runtime. Longer term, we wish to
convert the startup process to pass an auxargs pointer to _start() and
use that rather than walking off the end of envv. This is cleaner,
more C-friendly, and for systems with strong bounds (e.g. CHERI)
necessary.
Brooks Davis [Wed, 15 Apr 2020 20:19:59 +0000 (20:19 +0000)]
Introduce an AUXARGS_ENTRY_PTR() macro.
As the name implys, it uses the a_ptr member of the auxarg entry (except
in compat32 where it uses a_val). This is more correct and required for
systems where a_val is not the same size or hardware type as a_ptr (e.g.
CHERI).
John Baldwin [Wed, 15 Apr 2020 19:28:51 +0000 (19:28 +0000)]
Set inp_flowid's for TOE connections.
KTLS uses the flowid to distribute software encryption tasks among its
pool of worker threads. Without this change, all software KTLS
requests for TOE sockets ended up on the first worker thread.
Note that the flowid for TOE sockets created via connect() is not a
hash of the 4-tuple, but is instead the id of the TOE pcb (tid). The
flowid of TOE sockets created from TOE listen sockets do use the
4-tuple RSS hash as the flowid since the firmware provides the hash in
the message containing the original SYN.
Revert commit a9ad65a2b from llvm git (by Nemanja Ivanovic):
[PowerPC] Change default for unaligned FP access for older subtargets
This is a fix for https://bugs.llvm.org/show_bug.cgi?id=40554
Some CPU's trap to the kernel on unaligned floating point access and
there are kernels that do not handle the interrupt. The program then
fails with a SIGBUS according to the PR. This just switches the
default for unaligned access to only allow it on recent server CPUs
that are known to allow this.
This upstream commit causes a compiler hang when building certain ports
(e.g. security/nss, multimedia/x264) for powerpc64. The hang has been
reported in https://bugs.llvm.org/show_bug.cgi?id=45186, but in the mean
time it is more convenient to revert the commit.
validate_uuid: absorb the rest of parse_uuid with a flags arg
This makes the naming annoyance (validate_uuid vs. parse_uuid) less of an
issue and centralizes all of the functionality into the new KPI while still
making the extra validation optional. The end-result is all the same as far
as hostuuid validation-only goes.
Brooks Davis [Wed, 15 Apr 2020 18:15:58 +0000 (18:15 +0000)]
Fix -Wvoid-pointer-to-enum-cast warnings.
This pattern is used in callbacks with void * data arguments and seems
both relatively uncommon and relatively harmless. Silence the warning
by casting through uintptr_t.
sshd: Warn about missing ssh-keygen only when necessary
The sshd service is using ssh-keygen to generate missing SSH keys.
If ssh-keygen is missing, it prints the following message:
> /etc/rc.d/sshd: WARNING: /usr/bin/ssh-keygen does not exist.
It makes sense when the key is not generated yet and
cannot be created because ssh-keygen is missing.
The problem is that even if the key is present on the host,
the sshd service would still warn about missing ssh-keygen
(even though it does not need it).
mmc_fdt_helpers: Do not schedule a card detection is there is no cd gpio
If the fdt node doesn't have a cd-gpios properties or if the node is set
as non-removable we do not init the card detection timeout task as it is
useless so don't schedule it too.
kern uuid: break format validation out into a separate KPI
This new KPI, validate_uuid, strictly validates the formatting of the input
UUID and, optionally, populates a given struct uuid.
As noted in the header, the key differences are that the new KPI won't
recognize an empty string as a nil UUID and it won't do any kind of semantic
validation on it. Also key is that populating a struct uuid is optional, so
the caller doesn't necessarily need to allocate a bogus one on the stack
just to validate the string.
This KPI has specifically been broken out in support of D24288, which will
preload /etc/hostid in loader so that early boot hostuuid users (e.g.
anything that calls ether_gen_addr) can have a valid hostuuid to work with
once it's been stashed in /etc/hostid.
Add an implementatation of the 'Virtual Machine Generation ID' spec to
Bhyve. The spec provides a randomly generated GUID (at bhyve start) in
device memory, along with an ACPI device with _CID VM_Gen_Counter and ADDR
evaluating to a Package pointing at that GUID.
A GPE is defined which Notifies the ACPI Device when the generation changes
(such as when a snapshot is rolled back). At this time, Bhyve does not
support snapshotting, so the GPE is never actually raised.
To allow more general use of the bootrom region, separate initialization from
allocation, and allocation from loading a file.
The bootrom segment is the high 16MB of the low 4GB region.
Each allocation in the segment creates a new mapping with specified protection.
By default, allocation begins at the low end of the range. However, the
BOOTROM_ALLOC_TOP flag is provided to locate a provided bootrom in the high
region it is expected to be in.
The existing ROM-file loading code is refactored to use the new interface.
It is not valid to pass BUS_SPACE_UNRESTRICTED to bus_dma_tag_create()'s
nsegments parameter as it is interpreted as a very large segment count.
Subsequent allocation operations on the tag will preallocate some multiple of
that count. BUS_SPACE_UNRESTRICTED therefore indicates something like:
malloc(infinity).
John Baldwin [Wed, 15 Apr 2020 00:14:50 +0000 (00:14 +0000)]
Remove support for geli(4) algorithms deprecated in r348206.
This removes support for reading and writing volumes using the
following algorithms:
- Triple DES
- Blowfish
- MD5 HMAC integrity
In addition, this commit adds an explicit whitelist of supported
algorithms to give a better error message when an invalid or
unsupported algorithm is used by an existing volume.
Reviewed by: cem
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D24343
tests: audit: mark closefrom test an expected fail for now
closefrom has been converted to close_range internally; remediation is
underway for this, marking it as an expected fail for now while proper
course is determined.
Rick Macklem [Tue, 14 Apr 2020 22:57:21 +0000 (22:57 +0000)]
Fix the NFSv2 extended attribute support to handle 0 length attributes.
I did not realize that zero length attributes are allowed, but they are.
This patch fixes the NFSv4.2 client and server to handle zero length
extended attributes correctly.
Submitted by: Frank van der Linden <fllinden@amazon.com> (earlier version)
Reported by: Frank van der Linden <fllinder@amazon.com>
Reorganise nd6 notification code to avoid direct rtentry field access.
One of the goals of the new routing KPI defined in r359823 is to entirely hide
`struct rtentry` from the consumers. Doing so will allow to improve routing
subsystem internals and deliver features more easily. This change is one of
the ongoing changes to eliminate direct struct rtentry field accesses.
It introduces rtfree_func() wrapper around RTFREE() and reorganises nd6 notification
code to avoid accessing most of the rtentry fields.
Brooks Davis [Tue, 14 Apr 2020 20:30:48 +0000 (20:30 +0000)]
Centralize compatability translation macros.
Copy the CP, PTRIN, etc macros from freebsd32.h into a sys/abi_compat.h
and replace existing definitation with includes where required. This
eliminates duplicate code and allows Linux and FreeBSD compatability
headers to be included in the same files.
The upstream DTS now include the thermal device node and the SID
calibration entry.
Update our driver to cope with this change and remove the DTB
overlays that aren't needed anymore.
Michael Tuexen [Tue, 14 Apr 2020 16:35:05 +0000 (16:35 +0000)]
Improve the TCP blackhole detection. The principle is to reduce the
MSS in two steps and try each candidate two times. However, if two
candidates are the same (which is the case in TCP/IPv6), this candidate
was tested four times. This patch ensures that each candidate actually
reduced the MSS and is only tested 2 times. This reduces the time window
of missclassifying a temporary outage as an MTU issue.
The fdt properties are now parsed via the help of mmc_fdt_helper functions.
This also adds card detection.
Note that on some boards (like the Pine64) card detection is broken due to
a missing resistor on the cd pin.
Those functions are here to help fdt mmc controller drivers to parse
the dts to find the supported speeds and the regulators.
Not all DTS have every settings properly defined so host controller
will still have to add some caps themselves.
It also add a mmc_fdt_gpio_setup function which will read the cd-gpios
property and register it as the CD pin.
If the pin support interrupts one will be registered and the cd_helper
function will be called.
If the pin doesn't support interrupts the internal taskqueue will poll
for change and call the same cd_helper function.
mmc_fdt_gpio_setup will also parse the wp-gpio property and MMC drivers
can know the write-protect pin value by calling the
mmc_fdt_gpio_get_readonly function.
Make sonewconn() overflow messages have per-socket rate-limits and values.
sonewconn() emits debug-level messages when a listen socket's queue
overflows. Currently, sonewconn() tracks overflows on a global basis. It
will only log one message every 60 seconds, regardless of how many sockets
experience overflows. And, when it next logs at the end of the 60 seconds,
it records a single message referencing a single PCB with the total number
of overflows across all sockets.
This commit changes to per-socket overflow tracking. The code will now
log one message every 60 seconds per socket. And, the code will provide
per-socket queue length and overflow counts. It also provides a way to
change the period between log messages using a sysctl.
Print more detail as part of the sonewconn() overflow message.
When a socket's listen queue overflows, sonewconn() emits a debug-level
log message. These messages are sometimes useful to systems administrators
in highlighting a process which is not keeping up with its listen queue.
This commit attempts to enhance the usefulness of this message by printing
more details about the socket's address. If all else fails, it will at
least print the domain name of the socket.
Andrew Gallatin [Tue, 14 Apr 2020 14:46:06 +0000 (14:46 +0000)]
KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself.
While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.
This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.
In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.
This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,
This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.
In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.
Many thanks to glebius for helping to make this better in
the Netflix tree.
Similar to mmap'ing vnodes, posixshm should count any mapping where maxprot
contains VM_PROT_WRITE (i.e. fd opened r/w with no write-seal applied) as
writable and thus blocking of any write-seal.
The memfd tests have been amended to reflect the fixes here, which notably
includes:
1. Fix for error return bug; EPERM is not a documented failure mode for mmap
2. Fix rejection of write-seal with active mappings that can be upgraded via
mprotect(2).
Rick Macklem [Tue, 14 Apr 2020 00:01:26 +0000 (00:01 +0000)]
Re-organize the NFS file handle affinity code for the NFS server.
The file handle affinity code was configured to be used by both the
old and new NFS servers. This no longer makes sense, since there is
only one NFS server.
This patch copies a majority of the code in sys/nfs/nfs_fha.c and
sys/nfs/nfs_fha.h into sys/fs/nfsserver/nfs_fha_new.c and
sys/fs/nfsserver/nfs_fha_new.h, so that the files in sys/nfs can be
deleted. The code is simplified by deleting the function callback pointers
used to call functions in either the old or new NFS server and they were
replaced by calls to the functions.
As well as a cleanup, this re-organization simplifies the changes
required for handling of external page mbufs, which is required for KERN_TLS.
This patch should not result in a semantic change to file handle affinity.
Now that armv4/v5 is gone, remove the bits that implemented atomic operations
by disabling interrupts.
Those were specific to FreeBSD and never reached upstream.
Andrew Gallatin [Mon, 13 Apr 2020 23:06:56 +0000 (23:06 +0000)]
lagg: stop double-counting output errors and counting drops as errors
Before this change, lagg double-counted errors from lagg members, and counted
every drop by a lagg member as an error. Eg, if lagg sent a packet, and the
underlying hardware driver dropped it, a counter would be incremented by both
lagg and the underlying driver.
This change attempts to fix that by incrementing lagg's counters only for
errors that do not come from underlying drivers.
Warner Losh [Mon, 13 Apr 2020 21:04:33 +0000 (21:04 +0000)]
Checks here against useracc are not useful and are racy.
copyin/copyout are sufficient to guard against bad addresses. They will return
EFAULT if the user is up to no good (by choice or ignorance). There's no point
in checking, since it doesn't even improve the error messages.
John Baldwin [Mon, 13 Apr 2020 20:59:09 +0000 (20:59 +0000)]
Export a sysctl count of RX FIFO overrun events.
uart(4) backends currently detect RX FIFO overrun errors and report
them to the uart(4) core layer. They are then reported to the generic
TTY layer which promptly ignores them. As a result, there is
currently no good way to determine if a uart is experiencing RX FIFO
overruns. One could add a generic per-tty counter, but there did not
appear to be a good way to export those. Instead, add a sysctl under
the uart(4) sysctl tree to export the count of overruns.
John Baldwin [Mon, 13 Apr 2020 20:43:57 +0000 (20:43 +0000)]
Correct baud rate error calculation.
Shifting right by 1 is not the same as dividing by 2 for signed
values. In particular, dividing a signed value by 2 gives the integer
ceiling of the (e.g. -5 / 2 == -2) whereas shifting right by 1 always
gives the floor (-5 >> 1 == -3).
An embedded board with a 25 Mhz base clock results in an error of
-30.5% when used with a baud rate of 115200. Using division, this
truncates to -30% and is permitted. Using the shift, this fails and
is rejected causing TIOCSETA requests to fail with EINVAL and breaking
getty(8).
Using division gives the same error range for both over and under baud
rates and also makes the code match the behavior documented in the
existing comment about supporting boards with 25 Mhz clocks.
It changes the size of TAILQ_ENTRY, which obviously impacts ABI in a variety of
ways. Some of these things are _Static_asserted. For now, mask the option
from LINT.
Mark Johnston [Mon, 13 Apr 2020 19:22:05 +0000 (19:22 +0000)]
Fix sendto() on unconnected SOCK_STREAM/SEQPACKET unix sockets.
Previously the unpcb pointer of the newly connected remote socket was
not initialized correctly, so attempting to lock it would result in a
null pointer dereference.
Reported by: syzkaller
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Mark Johnston [Mon, 13 Apr 2020 19:20:39 +0000 (19:20 +0000)]
Relax restrictions on private mappings of POSIX shm objects.
When creating a private mapping of a POSIX shared memory object,
VM_PROT_WRITE should always be included in maxprot regardless of
permissions on the underlying FD. Otherwise it is possible to open a
shm object read-only, map it with MAP_PRIVATE and PROT_WRITE, and
violate the invariant in vm_map_insert() that (prot & maxprot) == prot.
Reported by: syzkaller
Reviewed by: kevans, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24398
close_range/closefrom: fix regression from close_range introduction
close_range will clamp the range between [0, fdp->fd_lastfile], but failed
to take into account that fdp->fd_lastfile can become -1 if all fds are
closed. =-( In this scenario, just return because there's nothing further we
can do at the moment.
Add a test case for this, fork() and simply closefrom(0) twice in the child;
on the second invocation, fdp->fd_lastfile == -1 and will trigger a panic
before this change.
This had been introduced to ease any pain for using slightly older kernels
with a newer libc, e.g., for bisecting a kernel across the introduction of
shm_open2(2). 6 months has passed, retire the fallback and let shm_open()
unconditionally call shm_open2().
Xin LI [Mon, 13 Apr 2020 08:42:13 +0000 (08:42 +0000)]
Sync with OpenBSD:
arc4random.c: In the incredibly unbelievable circumstance where
_rs_init() fails to allocate pages, don't call abort() because of
corefile data leakage concerns, but simply _exit(). The reasoning
is _rs_init() will only fail if someone finds a way to apply
specific pressure against this failure point, for the purpose of
leaking information into a core which they can read. We don't
need a corefile in this instance to debug that. So take this
"lever" away from whoever in the future wants to do that.
arc4random.3: reference random(4)
arc4random_uniform.c: include stdint.h over sys/types.h
Rick Macklem [Mon, 13 Apr 2020 00:07:37 +0000 (00:07 +0000)]
Delete the mbuf macros that were used for the Mac OS/X port.
When the code was ported to Mac OS/X, mbuf handling functions were
converted to using the Mac OS/X accessor functions. For FreeBSD, they
are a simple set of macros in sys/fs/nfs/nfskpiport.h.
Since r359757, r359780, r359785, r359810, r359811 have removed all uses
of these macros, this patch deleted the macros from the .h files.
My eventual goal is deleting nfskpiport.h, but that will take some more
editting to replace uses of the remaining macros.
close_range(min, max, flags) allows for a range of descriptors to be
closed. The Python folk have indicated that they would much prefer this
interface to closefrom(2), as the case may be that they/someone have special
fds dup'd to higher in the range and they can't necessarily closefrom(min)
because they don't want to hit the upper range, but relocating them to lower
isn't necessarily feasible.
sys_closefrom has been rewritten to use kern_close_range() using ~0U to
indicate closing to the end of the range. This was chosen rather than
requiring callers of kern_close_range() to hold FILEDESC_SLOCK across the
call to kern_close_range for simplicity.
The flags argument of close_range(2) is currently unused, so any flags set
is currently EINVAL. It was added to the interface in Linux so that future
flags could be added for, e.g., "halt on first error" and things of this
nature.
This patch is based on a syscall of the same design that is expected to be
merged into Linux.
Add QUEUE_MACRO_DEBUG_TRACE and QUEUE_MACRO_DEBUG_TRASH as proper kernel
options. While here, alpha-sort the debug section of sys/conf/options.
Enable QUEUE_MACRO_DEBUG_TRASH in amd64 GENERIC (but not GENERIC-NODEBUG)
kernels. It is similar in nature and cost to other use-after-free pointer
trashing we do in GENERIC. It is probably reasonable to enable in any arch
GENERIC kernel that defines INVARIANTS.