lltable: Add support for "child" LLEs holding encap for IPv4oIPv6 entries.
Currently we use pre-calculated headers inside LLE entries as prepend data
for `if_output` functions. Using these headers allows saving some
CPU cycles/memory accesses on the fast path.
However, this approach makes adding L2 header for IPv4 traffic with IPv6
nexthops more complex, as it is not possible to store multiple
pre-calculated headers inside lle. Additionally, the solution space is
limited by the fact that PCB caching saves LLEs in addition to the nexthop.
Thus, add support for creating special "child" LLEs for the purpose of holding
custom family encaps and store mbufs pending resolution. To simplify handling
of those LLEs, store them in a linked-list inside a "parent" (e.g. normal) LLE.
Such LLEs are not visible when iterating LLE table. Their lifecycle is bound
to the "parent" LLE - it is not possible to delete "child" when parent is alive.
Furthermore, "child" LLEs are static (RTF_STATIC), avoding complex state
machine used by the standard LLEs.
nd6_lookup() and nd6_resolve() now accepts an additional argument, family,
allowing to return such child LLEs. This change uses `LLE_SF()` macro which
packs family and flags in a single int field. This is done to simplify merging
back to stable/. Once this code lands, most of the cases will be converted to
use a dedicated `family` parameter.
When a prefix gets deleted from the RIB, dpdk_lpm algo needs to know
the nexthop of the "parent" prefix to update its internal state.
The glue code, which utilises RIB as a backing route store, uses
fib[46]_lookup_rt() for the prefix destination after its deletion
to fetch the desired nexthop.
This approach does not work when deleting less-specific prefixes
with most-specific ones are still present. For example, if
10.0.0.0/24, 10.0.0.0/23 and 10.0.0.0/22 exist in RIB, deleting
10.0.0.0/23 would result in 10.0.0.0/24 being returned as a search
result instead of 10.0.0.0/22. This, in turn, results in the failed
datastructure update: part of the deleted /23 prefix will still
contain the reference to an old nexthop. This leads to the
use-after-free behaviour, ending with the eventual crashes.
Fix the logic flaw by properly fetching the prefix "parent" via
newly-created rt_get_inet[6]_parent() helpers.
Allow consistency validation of the inet6 fib based on rib data.
Validation can be kicked off by loading test_lookup module and
running sysctl net.route.test.run_inet6_scan=1
Consistently use `nh` instead of always dereferencing
ro->ro_nh inside the if block.
Always use nexthop mtu, as it provides guarantee that mtu is accurate.
Pass `nh` pointer to rt_update_ro_flags() to allow upcoming uses
of updating ro flags based on different nexthop.
Factor out lltable locking logic from lltable_try_set_entry_addr()
into a separate lltable_acquire_wlock(), so the latter can be used
in other parts of the code w/o duplication.
Create nd6_try_set_entry_addr() to avoid code duplication in nd6.c
and nd6_nbr.c.
Move lle creation logic from nd6_resolve_slow() into a separate
nd6_get_llentry() to simplify the former.
These changes serve as a pre-requisite for implementing
RFC8950 (IPv4 prefixes with IPv6 nexthops).
Use newly-create llentry_request_feedback(),
llentry_mark_used() and llentry_get_hittime() to
request datapatch usage check and fetch the results
in the same fashion both in IPv4 and IPv6.
While here, simplify llentry_provide_feedback() wrapper
by eliminating 1 condition check.
Simplify ifa/ifp refcounting in the routing stack.
The routing stack control depends on quite a tree of functions to
determine the proper attributes of a route such as a source address (ifa)
or transmit ifp of a route.
When actually inserting a route, the stack needs to ensure that ifa and ifp
points to the entities that are still valid.
Validity means slightly more than just pointer validity - stack need guarantee
that the provided objects are not scheduled for deletion.
Currently, callers either ignore it (most ifp parts, historically) or try to
use refcounting (ifa parts). Even in case of ifa refcounting it's not always
implemented in fully-safe manner. For example, some codepaths inside
rt_getifa_fib() are referencing ifa while not holding any locks, resulting in
possibility of referencing scheduled-for-deletion ifa.
Instead of trying to fix all of the callers by enforcing proper refcounting,
switch to a different model.
As the rib_action() already requires epoch, do not require any stability guarantees
other than the epoch-provided one.
Use newly-added conditional versions of the refcounting functions
(ifa_try_ref(), if_try_ref()) and fail if any of these fails.
Add if_try_ref() to simplify refcount handling inside epoch.
When we have an ifp pointer and the code is running inside epoch,
epoch guarantees the pointer will not be freed.
However, the following case can still happen:
* in thread 1 we drop to refcount=0 for ifp and schedule its deletion.
* in thread 2 we use this ifp and reference it
* destroy callout kicks in
* unhappy user reports a bug
This can happen with the current implementation of ifnet_byindex_ref(),
as we're not holding any locks preventing ifnet deletion by a parallel thread.
To address it, add if_try_ref(), allowing to return failure when
referencing ifp with refcount=0.
Additionally, enforce existing if_ref() is with KASSERT to provide a
cleaner error in such scenarios.
Finally, fix ifnet_byindex_ref() by using if_try_ref() and returning NULL
if the latter fails.
Mark Johnston [Tue, 31 Aug 2021 11:43:47 +0000 (07:43 -0400)]
sctp: Fix racy UNBOUND flag check in sctp_inpcb_bind()
SCTP needs to avoid binding a given socket twice. The check used to
avoid this is racy since neither the inpcb lock nor the global info lock
is held. Fix it by synchronizing using the global info lock. In
particular, sctp_inpcb_bind() may drop the inpcb lock in some cases, but
the info lock is sufficient to prevent double insertion into PCB hash
tables.
Reported by: syzbot+548a8560d959669d0e12@syzkaller.appspotmail.com
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
Mark Johnston [Tue, 31 Aug 2021 20:38:05 +0000 (16:38 -0400)]
itimer: Serialize access to the p_itimers array
Fix the following race between itimer_proc_continue() and process exit.
itimer_proc_continue() may be called via realitexpire(), the real
interval timer. Note that exit1() drains this timer _after_ draining
and freeing itimers. Moreover, itimers_exit() is called without the
process lock held; it only acquires the proc lock when deleting
individual itimers, so once they are drained we free p->p_itimers
without any synchronization. Thus, itimer_proc_continue() may load a
non-NULL p->p_itimers array and iterate over it after it has been freed.
Fix the problem by using the process lock when clearing p->p_itimers, to
synchronize with itimer_proc_continue(). Formally, accesses to this
field should be protected by the process lock anyway, and since the
array is allocated lazily this will not incur any overhead in the common
case.
Reported by: syzbot+c40aa8bf54fe333fc50b@syzkaller.appspotmail.com
Reported by: syzbot+929be2f32503bbc3844f@syzkaller.appspotmail.com
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Mark Johnston [Tue, 31 Aug 2021 19:35:08 +0000 (15:35 -0400)]
md: Clamp to a multiple of the sector size when resizing
We do this when creating md(4) devices, in kern_mdattach_locked(), but
not when resizing the provider. Apply the same policy when resizing, as
many GEOM classes do not expect to deal with providers for which
pp->mediasize % pp->sectorsize != 0.
Reported by: syzkaller
Sponsored by: The FreeBSD Foundation
Mark Johnston [Tue, 31 Aug 2021 21:09:52 +0000 (17:09 -0400)]
graid: Avoid tasting devices with small sector sizes
The RAID metadata parsers effectively assume a sector size of 512 bytes
or larger, but md(4) devices can be created with a sector size that's
any power of 2. Add some seatbelts to graid tasting routines to ensure
that the requested sector(s) are large enough for the device to
plausibly contain RAID metadata.
Thomas Skibo [Mon, 30 Aug 2021 20:39:20 +0000 (21:39 +0100)]
sifive_spi: Add missing case for SPIBUS_MODE_NONE
Otherwise sckmode is left uninitialised, not zero. This mode is used for
the on-board flash on the HiFive Unmatched board. Whilst here, catch
unknown modes and return an error rather than silently continuing.
Upstream's clang/CMakeLists.txt specifically (not LLVM as a whole)
passes -fno-strict-aliasing if the compiler is not Clang, and this fixes
the above issue.
This was seen when cross-building from Linux using a bootstrap
compiler, but likely also affects worlds built with a new enough
external GCC toolchain.
MFC after: 1 week
Reviewed by: dim
Differential Revision: https://reviews.freebsd.org/D31533
Jessica Clarke [Tue, 24 Aug 2021 13:59:18 +0000 (14:59 +0100)]
clang: Support building with GCC and DEBUG_FILES disabled
If MK_DEBUG_FILES=no then the Clang link rule has clang as .TARGET,
rather than clang.full, causing the implicit ${CFLAGS.${.TARGET:T}} to
be CFLAGS.clang, and thus pull in flags intended for when your compiler
is Clang, not when linking Clang itself. This doesn't matter if your
compiler is in fact Clang, but it breaks using GCC as, for example,
bsd.sys.mk adds -Qunused-arguments to CFLAGS.clang. This is seen when
trying to build a bootstrap toolchain on Linux where GCC is the system
compiler.
Thus, introduce a new internal NO_TARGET_FLAGS variable that is set by
Clang to disable the addition of these implicit flags. This is a bigger
hammer than necessary, as flags for .o files would be safe, but that is
not needed for Clang.
Note that the same problem does not arise for LDFLAGS when building LLD
with BFD, since our build produces a program called ld.lld, not plain
lld (unlike upstream, where ld.lld is a symlink to lld so they can
support multiple different flavours in one binary).
Jessica Clarke [Tue, 24 Aug 2021 13:59:04 +0000 (14:59 +0100)]
Fix bootstrapping to actually build lldb-tblgen for later use
Because MK_LLDB=no is in BSARGS, the bootstrap-tools recursive make does
not add lldb-tblgen to _clang_tblgen, causing it to not be built. This
means that the build currently always uses the host's lldb-tblgen
(which, whilst currently it appears to work, could in future break if
TableGen backends are added or altered) and, if it doesn't exist (either
because the current FreeBSD system was built with it disabled, or you're
building on macOS/Linux), fails. Linux and macOS cross-builds used to
work simply because LLDB was previously in BROKEN_OPTIONS when building
on non-FreeBSD.
Instead, move MK_LLDB=no from BSARGS to XMAKE. This ensures that the
lib/clang build in cross-tools continues to not build LLDB parts for the
bootstrap toolchain (both to save time/space on FreeBSD, and because our
vendored LLDB does not include the macOS and Linux host files so those
would fail to build).
The DIRDEPS target is updated to move MK_LLDB=no from the BSARGS block
that mirrors Makefile.inc1 to the line that disables additional
toolchain components. The DIRDEPS build likely suffers from the same
issue currently, but having never used it and not being familiar with
how it works I am leaving that as-is. If it does suffer from the same
issue it should be easily reproducible by renaming /usr/bin/lldb-tblgen
or moving it to a directory not in PATH.
Jessica Clarke [Tue, 24 Aug 2021 13:55:31 +0000 (14:55 +0100)]
Makefile.inc1: Make sure sub-makes see MK_CLANG_BOOTSTRAP=no when XCC is a path
Currently we override MK_CLANG_BOOTSTRAP to no so we don't build a
bootstrap compiler, but subdirectories don't see that and so the hack in
bsd.sys.mk to prefer our includes over Clang's resource dir for external
toolchains is not enabled unless you use -DWITHOUT_CLANG_BOOTSTRAP
explicitly on top of XCC (which tools/build/make.py does not do),
causing duplicate definition errors when building rtld-elf due to the
use of -ffreestanding (Clang's stdint.h will use the system one when
hosted, but its own when freestanding, and only has glibc's preprocessor
guards, not FreeBSD's).
This broke when dropping CLANG_BOOTSTRAP from BROKEN_OPTIONS.
Fixes: 31ba4ce8898f ("Allow bootstrapping llvm-tblgen on macOS and Linux")
MFC after: 1 week
Reviewed by: imp, arichardson
Differential Revision: https://reviews.freebsd.org/D31529
Jessica Clarke [Thu, 12 Aug 2021 22:50:48 +0000 (23:50 +0100)]
tools/build/cross-build: Fix building libllvmminimal on Linux
There is a __used member in glibc's posix_spawn_file_actions_t in
spawn.h, so we must temporarily undefine __used when including it,
otherwise Support/Unix/Program.inc fails to build. This is based on
similar handling for __unused in other headers.
Fixes: 31ba4ce8898f ("Allow bootstrapping llvm-tblgen on macOS and Linux")
MFC after: 1 week
Jessica Clarke [Mon, 9 Aug 2021 19:28:37 +0000 (20:28 +0100)]
riscv: Fix pmap_alloc_l2 when it should allocate a new L1 entry
The current code checks the RWX bits are 0 but does not check the V bit
is non-zero, meaning not-yet-allocated L1 entries that are still zero
are regarded as being allocated. This is likely due to copying the arm64
code that checks ATTR_DESC_MASK is L1_TABLE, which emcompasses both the
type and the validity in a single field, and erroneously translating
that to a check of just PTE_RWX being 0 to indicate non-leaf, forgetting
about the V bit. This then results in the following panic:
panic: Fatal page fault at 0xffffffc0005cf292: 0x00000000000050
cpuid = 1
time = 1628379581
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x38
kdb_backtrace() at kdb_backtrace+0x2c
vpanic() at vpanic+0x148
panic() at panic+0x2a
page_fault_handler() at page_fault_handler+0x1ba
do_trap_supervisor() at do_trap_supervisor+0x7a
cpu_exception_handler_supervisor() at
cpu_exception_handler_supervisor+0x70
--- exception 13, tval = 0x50
pmap_enter_l2() at pmap_enter_l2+0xb2
pmap_enter_object() at pmap_enter_object+0x15e
vm_map_pmap_enter() at vm_map_pmap_enter+0x228
vm_map_insert() at vm_map_insert+0x4ec
vm_map_find() at vm_map_find+0x474
vm_map_find_min() at vm_map_find_min+0x52
vm_mmap_object() at vm_mmap_object+0x1ba
vn_mmap() at vn_mmap+0xf8
kern_mmap() at kern_mmap+0x4c4
sys_mmap() at sys_mmap+0x38
do_trap_user() at do_trap_user+0x208
cpu_exception_handler_user() at cpu_exception_handler_user+0x72
--- exception 8, tval = 0x1dd
Instead, we should just check the V bit, as on amd64, and assert that
any valid L1 entries are not leaves, since an L1 leaf would render the
entire range allocated and thus we should not have attempted to map that
VA in the first place.
Jessica Clarke [Sat, 7 Aug 2021 20:25:36 +0000 (21:25 +0100)]
pci_dw: Drop unconditional explicit DEBUG define
This has been present since the first revision of the file. The debugf
macros have always been unused so it doesn't actually do anything
useful, and besides, debugging should not be unconditionally turned on
for a production driver. Moreover, this breaks the riscv LINT kernel
build as sys/conf/NOTES includes options DEBUG, resulting in a macro
redefinition error. This does not show up in the arm64 LINT kernel build
since that has an explicit nooptions DEBUG, which is dubious and should
be revisited. Rather than copy such a hack to riscv's NOTES, fix this
specific instance of DEBUG breaking.
Jessica Clarke [Sat, 7 Aug 2021 18:27:27 +0000 (19:27 +0100)]
fu540_prci: Rename to sifive_prci and use ocd_data for FU540 specificity
The FU740 has a very similar controller and will reuse most of the
driver. This also drops the dependency on the device-tree include for
the binding indices; the header doesn't namespace its contents (and nor
does the FU740 one) so using both would require seperate translation
units which would be unnecessarily complicated just to avoid defining
local copies of the small number of constants.
Whilst here, add the missing l to gemgxlclk's name and drop the prci_
prefix from tlclk's name as we don't prefix any of the others and it's
entirely unnecessary.
riscv: Fix pmap_kextract racing with concurrent superpage promotion/demotion
This repeats amd64's cfcbf8c6fd3b (r180498) and i386's cf3508519c5e
(r202894) but for riscv; pmap_kextract must be lock-free and so it can
race with superpage promotion and demotion, thus the L2 entry must only
be loaded once to avoid using inconsistent state.
pci_dw: Detect number of outbound regions automatically
Currently we use the num-viewports property to decide how many outbound
regions there are we can use, defaulting to 2. However, Linux has
stopped using that and so it no longer appears in new device trees, such
as for the SiFive FU740. Instead, it's possible to just probe the
hardware directly.
This supersedes the old legacy mode where a viewport register was used
to mux multiple regions behind a single set of registers, and is used on
the SiFive FU740.
Currently we assume there is only one memory and one prefetch memory
window, and ignore the latter. However, the SiFive FU740 has two normal
memory windows.
As part of this, the viewports are rearranged. Previously the viewports
were memory, config then optionally I/O. Both to simplify the config
index calculation and to ensure it can always be mapped even if we have
too many memory windows for the number of viewports, config is moved to
being the first viewport.
This generalisation now also naturally supports mapping prefetch memory
windows.
Wojciech Macek [Fri, 9 Apr 2021 07:28:44 +0000 (09:28 +0200)]
pci_dw: Trim ATU windows bigger than 4GB
The size of the ATU MEM/IO windows is implicitly casted to uint32_t.
Because of that some window sizes were silently demoted to 0 and ignored.
Check the size if its too large, trim it to 4GB and print a warning message.
Makefile: Fix MAKEOBJDIRPREFIX command-line variable check for bmake
Unlike the old fmake, running make FOO=bar when using bmake doesn't put
FOO=bar in .MAKEFLAGS at the top level, it instead just puts FOO in
.MAKEOVERRIDES and the full MAKEFLAGS will be formed for sub-makes.
Moreover, this only applies for sub-makes in rules, so this doesn't
apply to those in shell assignments. This means that the current check
does not catch make MAKEOBJDIRPREFIX=..., only those defined in config
files. Thus we must also check .MAKEOVERRIDES explicitly.
The pindex values are assigned from the L3 leaves upwards, meaning there
are NUL2E L3 tables and then NUL1E L2 tables (with a futher NUL0E L1
tables in future when we implement Sv48 support). Therefore anything
below NUL2E is an L3 table's page and anything above or equal to NUL2E
is an L2 table's page (with the threshold of NUL2E + NUL1E marking the
start of the L1 tables' pages in Sv48). Thus all the comparisons and
arithmetic operations must use NUL2E to handle the L3/L2 allocation (and
thus L2/L1 entry) transition point, not NUL1E as all but pmap_alloc_l2
were doing.
To make matters confusing, the NUL1E and NUL2E definitions in the RISC-V
pmap are based on a 4-level page hierarchy but we currently use the
3-level Sv39 format (as that's the only required one, and hardware
support for the 4-level Sv48 is not widespread). This means that, in
effect, the above bug cancels out with the bloated NULxE definitions
such that things "work" (but are still technically wrong, and thus would
break when adding Sv48 support), with one exception. pmap_enter_l2 is
currently the only function to use the correct constant, but since
_pmap_alloc_l3 uses the incorrect constant, it will do complete nonsense
when it needs to allocate a new L2 table (which is rather rare). In this
instance, _pmap_alloc_l3, whilst it would correctly determine the pindex
was for an L2 table, would only subtract NUL1E when computing l1index
and thus go way out of bounds (by 511*512*512 bytes, or 127.75 GiB) of
its own L1 table and, thanks to pmap_distribute_l1, of every other
pmap's L1 table in the whole system. This has likely never been hit as
it would presumably instantly fault and panic.
sifive_uart: Fix input character dropping in ddb and at a mountroot prompt
These use the raw console interface and poll. Unfortunately, the SiFive
UART puts the FIFO empty bit inside the FIFO data register, which means
that the act of checking whether a character is available also dequeues
any character from the FIFO, requiring the user to press each key twice.
However, since we configure the watermark to be 0 and, when the UART has
been grabbed for the console, we have interrupts off, we can abuse the
interrupt pending register to act as a substitute for the FIFO empty
bit.
This perhaps suggests that the console interface should move from having
rxready and getc to having getc_nonblock and getc (or make getc take a
bool), as all the places that call rxready do so to avoid blocking on
getc when there is no character available.
Note that currently Linux's device tree uses the FU540's compatible
string, as does upstream U-Boot, but the U-Boot shipped with the board
based on an older patch series has the correct FU740 name. Thankfully
they are the same, at least as far as software is concerned.
This is required for the SiFive FU740's PCIe controller. Copied from
arm64 with the only difference being changing pmap_mapdev_attr to
pmap_mapdev as riscv only has the latter.
makefs: Cast daddr_t to off_t before multiplication
Apparently some large-file systems out there, such as my powerpc64le
Linux box, define daddr_t as a 32-bit type, which is sad and stymies
cross-building disk images. Cast daddr_t to off_t before doing
arithmetic that overflows.
Alexander Motin [Tue, 31 Aug 2021 17:34:48 +0000 (13:34 -0400)]
nvme(4): Add MSI and single MSI-X support.
If we can't allocate more MSI-X vectors, accept using single shared.
If we can't allocate any MSI-X, try to allocate 2 MSI vectors, but
accept single shared. If still no luck, fall back to shared INTx.
This provides maximal flexibility in some limited scenarios. For
example, vmd(4) does not support INTx and can handle only limited
number of MSI/MSI-X vectors without sharing.
Mark Johnston [Mon, 30 Aug 2021 18:22:20 +0000 (14:22 -0400)]
aesni: Avoid a potential out-of-bounds load in aes_encrypt_icm()
Given a partial block at the end of a payload, aes_encrypt_icm() would
perform a 16-byte load of the residual into a temporary variable. This
is unsafe in principle since the full block may cross a page boundary.
Fix the problem by copying the residual into a stack buffer first.
Dimitry Andric [Sun, 29 Aug 2021 15:31:28 +0000 (17:31 +0200)]
Fix -Wformat errors in pfctl on 32-bit architectures
Use PRIu64 to printf(3) uint64_t quantities, otherwise this will result
in "error: format specifies type 'unsigned long' but the argument has
type 'uint64_t' (aka 'unsigned long long') [-Werror,-Wformat]" on 32-bit
architectures.
Kristof Provost [Mon, 16 Aug 2021 19:55:27 +0000 (21:55 +0200)]
pf: Introduce nvlist variant of DIOCGETSTATUS
Make it possible to extend the GETSTATUS call (e.g. when we want to add
new counters, such as for syncookie support) by introducing an
nvlist-based alternative.
Check that the bridge module is loaded before running this test.
It likely will be (as a result of running the bridge tests), but if it's
not we'll get spurious failures.
MFC after: 3 days
Sponsored by: Rubicon Communications, LLC ("Netgate")
Kyle Evans [Thu, 12 Aug 2021 02:49:17 +0000 (21:49 -0500)]
pxeboot: improve and simplify rx handling
This pushes the bulk of the rx servicing into a single loop that's only
slightly convoluted, and it addresses a problem with rx handling in the
process. If we hit a tx interrupt while we're processing, we'd
previously drop the frame on the floor completely and ultimately
timeout, increasing boot time on particularly busy hosts as we keep
having to backoff and resend.
After this patch, we don't seem to hit timeouts at all on zoo anymore
though loading a 27M kernel is still relatively slow (~1m20s).
Sponsored By: National Bureau of Economic Research
Sponsored by: Klara, Inc.
Alexander Motin [Sat, 21 Aug 2021 13:31:41 +0000 (09:31 -0400)]
cam(4): Fix quick unplug/replug for SCSI.
If some device is plugged back in after unplug before the probe periph
destroyed, it will just restart the probe process. But I've found that
PROBE_INQUIRY_CKSUM flag not cleared between the iterations may cause
AC_FOUND_DEVICE not reported on the second iteration, and because of
AC_LOST_DEVICE reported during the first iteration, the device end up
configured, but without any periphs attached.
We've found that enabled serial console and 102-disk JBOD cause enough
probe delays to easily trigger the issue for half of the disks. This
change fixes it reliably on my tests.