This patch is part of an effort to make bhyve networking (in particular TCP)
faster. The key strategy to enhance TCP throughput is to let the whole packet
datapath work with TSO/LRO packets (up to 64KB each), so that the per-packet
overhead is amortized over a large number of bytes.
This capability is supported in the guest by means of the vtnet(4) driver,
which is able to handle TSO/LRO packets leveraging the virtio-net header
(see struct virtio_net_hdr and struct virtio_net_hdr_mrg_rxbuf).
A bhyve VM exchanges packets with the host through a network backend,
which can be vale(4) or if_tap(4).
While vale(4) supports TSO/LRO packets, if_tap(4) does not.
This patch extends if_tap(4) with the ability to understand the virtio-net
header, so that a tapX interface can process TSO/LRO packets.
A couple of ioctl commands have been added to configure and probe the
virtio-net header. Once the virtio-net header is set, the tapX interface
acquires all the IFCAP capabilities necessary for TSO/LRO.
Dimitry Andric [Fri, 18 Oct 2019 19:30:12 +0000 (19:30 +0000)]
Provide a src.conf(5) description for the new WITHOUT_CAROOT option, and
rename the WITH_LOADER_VERIEXEC_PASS_MANFIEST description to its correct
name. Also correct a bunch of spelling errors in that description.
Mark Johnston [Fri, 18 Oct 2019 17:36:42 +0000 (17:36 +0000)]
Further constrain the use of per-CPU caches for free pages.
In low memory conditions a significant number of pages may end up stuck
in the caches, and currently these caches cannot be reaped, leading to
spurious memory allocation failures and OOM kills. So:
- Take into account the fact that we may cache up to two full buckets
of pages per CPU, not just one.
- Increase the amount of RAM required per CPU to enable the caches.
This is a temporary measure until the page cache management policy is
improved.
PR: 241048
Reported and tested by: Kevin Oberman <rkoberman@gmail.com>
Reviewed by: alc, kib
Discussed with: jeff
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22040
Mark Johnston [Fri, 18 Oct 2019 17:01:27 +0000 (17:01 +0000)]
Abbreviate softdep lock names.
The softdep lock names were unusually long and tended to stick out in
lock profiling reports. Abbreviate them and make them consistent with
our conventional style for lock names.
Reviewed by: mckusick
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22042
Mark Johnston [Fri, 18 Oct 2019 14:05:13 +0000 (14:05 +0000)]
Tighten mapping protections on preloaded files on amd64.
- We load the kernel at 0x200000. Memory below that address need not
be executable, so do not map it as such.
- Remove references to .ldata and related sections in the kernel linker
script. They come from ld.bfd's default linker script, but are not
used, and we now use ld.lld to link the amd64 kernel. lld does not
contain a default linker script.
- Pad the .bss to a 2MB as we do between .text and .data. This
forces the loader to load additional files starting in the following
2MB page, preserving the use of superpage mappings for kernel data.
- Map memory above the kernel image with NX. The kernel linker now
upgrades protections as needed, and other preloaded file types
(e.g., entropy, microcode) need not be mapped with execute permissions
in the first place.
Mark Johnston [Fri, 18 Oct 2019 13:56:45 +0000 (13:56 +0000)]
Apply mapping protections to preloaded kernel modules on amd64.
With an upcoming change the amd64 kernel will map preloaded files RW
instead of RWX, so the kernel linker must adjust protections
appropriately using pmap_change_prot().
Mark Johnston [Fri, 18 Oct 2019 13:53:14 +0000 (13:53 +0000)]
Apply mapping protections to .o kernel modules.
Use the section flags to derive mapping protections. When multiple
sections overlap within a page, the union of their protections must be
applied. With r353701 the .text and .rodata sections are padded to
ensure that this does not happen on amd64.
Andriy Gapon [Fri, 18 Oct 2019 12:32:01 +0000 (12:32 +0000)]
ddb: use 'textdump dump' instead of 'call doadump'
The change is for the example in textdump.4 and the default ddb.conf.
First of all, doadump now requires an argument and it won't do a
textdump if the argument is not 'true'.
And 'textdump dump' is more idiomatic anyway.
For what it's worth, ddb 'dump' command seems to always request a vmcore
dump even if a textdump was requested earlier, e.g., by 'textdump set'.
Finally, ddb 'call' command is not documented.
Yuri Pankov [Fri, 18 Oct 2019 10:28:08 +0000 (10:28 +0000)]
linux: provide just one instance of futex_list
Move futex_list definition to linux.c which is included once
in linux.ko (i386) and in linux_common.ko (amd64 and aarch64)
allowing 32/64 bit linux programs to access the same futexes
in the latter case.
Kristof Provost [Fri, 18 Oct 2019 03:36:26 +0000 (03:36 +0000)]
pf: Must be in NET_EPOCH to call icmp_error
icmp_reflect(), called through icmp_error() requires us to be in NET_EPOCH.
Failure to hold it leads to the following panic (with INVARIANTS):
panic: Assertion in_epoch(net_epoch_preempt) failed at /usr/src/sys/netinet/ip_icmp.c:742
cpuid = 2
time = 1571233273
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e0977920
vpanic() at vpanic+0x17e/frame 0xfffffe00e0977980
panic() at panic+0x43/frame 0xfffffe00e09779e0
icmp_reflect() at icmp_reflect+0x625/frame 0xfffffe00e0977aa0
icmp_error() at icmp_error+0x720/frame 0xfffffe00e0977b10
pf_intr() at pf_intr+0xd5/frame 0xfffffe00e0977b50
ithread_loop() at ithread_loop+0x1c6/frame 0xfffffe00e0977bb0
fork_exit() at fork_exit+0x80/frame 0xfffffe00e0977bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00e0977bf0
Note that we now enter NET_EPOCH twice if we enter ip_output() from pf_intr(),
but ip_output() will soon be converted to a function that requires epoch, so
entering NET_EPOCH directly from pf_intr() makes more sense.
Conrad Meyer [Thu, 17 Oct 2019 22:37:25 +0000 (22:37 +0000)]
gdb(4): Implement support for NoAckMode
When the underlying debugport transport is reliable, GDB's additional
checksums and acknowledgements are redundant. NoAckMode eliminates the
the acks and allows us to skip checking RX checksums. The GDB packet
framing does not change, so unfortunately (valid) checksums are still
included as message trailers.
The gdb(4) stub in FreeBSD advertises support for the feature in response to
the client's 'qSupported' request IFF the current debugport has the
gdb_dbfeatures flag GDB_DBGP_FEAT_RELIABLE set. Currently, only netgdb(4)
supports this feature.
If the remote GDB client supports the feature and does not have it disabled
via a GDB configuration knob, it may instruct our gdb(4) stub to enter
NoAckMode. Unless and until it issues that command, we must continue to
transmit acks as usual (and for now, we continue to wait until we receive
them as well, even if we know the debugport is on a reliable transport).
In the kernel sources, the sense of the flag representing the state of the
feature is reversed from that of the GDB command. (I.e., it is
'gdb_ackmode', not 'gdb_noackmode.') This is to avoid confusing double-
negative conditions.
For reference, see:
* https://sourceware.org/gdb/onlinedocs/gdb/Packet-Acknowledgment.html
* https://sourceware.org/gdb/onlinedocs/gdb/General-Query-Packets.html#QStartNoAckMode
Mark Johnston [Thu, 17 Oct 2019 21:39:23 +0000 (21:39 +0000)]
Add an ldscript for amd64 kernel modules.
Use it to pad the text and read-only data sections to a 4KB boundary.
This will be used to enforce strict memory protections for some
sections of loadable kernel modules.
Conrad Meyer [Thu, 17 Oct 2019 21:33:01 +0000 (21:33 +0000)]
Implement NetGDB(4)
NetGDB(4) is a component of a system using a panic-time network stack to
remotely debug crashed FreeBSD kernels over the network, instead of
traditional serial interfaces.
There are three pieces in the complete NetGDB system.
First, a dedicated proxy server must be running to accept connections from
both NetGDB and gdb(1), and pass bidirectional traffic between the two
protocols.
Second, the NetGDB client is activated much like ordinary 'gdb' and
similarly to 'netdump' in ddb(4) after a panic. Like other debugnet(4)
clients (netdump(4)), the network interface on the route to the proxy server
must be online and support debugnet(4).
Finally, the remote (k)gdb(1) uses 'target remote <proxy>:<port>' (like any
other TCP remote) to connect to the proxy server.
The NetGDB v1 protocol speaks the literal GDB remote serial protocol, and
uses a 1:1 relationship between GDB packets and sequences of debugnet
packets (fragmented by MTU). There is no encryption utilized to keep
debugging sessions private, so this is only appropriate for local
segments or trusted networks.
Submitted by: John Reimer <john.reimer AT emc.com> (earlier version)
Discussed some with: emaste, markj
Relnotes: sure
Differential Revision: https://reviews.freebsd.org/D21568
Gleb Smirnoff [Thu, 17 Oct 2019 20:18:07 +0000 (20:18 +0000)]
Revert two parts of r353292 that enter epoch when processing vlan capabilities.
It could be that entering epoch isn't necessary here, but better take a
conservative approach.
Loosen requirements for connecting to debugnet-type servers. Only require a
destination address; the rest can theoretically be inferred from the routing
table.
Relax corresponding constraints in netdump(4) and move ifp validation to
debugnet connection time.
Submitted by: John Reimer <john.reimer AT emc.com> (earlier version)
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D21482
Conrad Meyer [Thu, 17 Oct 2019 19:49:20 +0000 (19:49 +0000)]
Add ddb(4) 'netdump' command to netdump a core without preconfiguration
Add a 'X -s <server> -c <client> [-g <gateway>] -i <interface>' subroutine
to the generic debugnet code. The imagined use is both netdump, shown here,
and NetGDB (vaporware). It uses the ddb(4) lexer, with some new extensions,
to parse out IPv4 addresses.
'Netdump' uses the generic debugnet routine to load a configuration and
start a dump, without any netdump configuration prior to panic.
Loosely derived from work by: John Reimer <john.reimer AT emc.com>
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D21460
Conrad Meyer [Thu, 17 Oct 2019 18:45:11 +0000 (18:45 +0000)]
acpica: Match ID_PROBE default implementation to interface
After r339754, the additional interface parameter was accidentally left out
of the default acpi_generic_id_probe implementation. Apparently this does
not cause any real problems, so this fix is mostly stylistic.
Conrad Meyer [Thu, 17 Oct 2019 18:29:44 +0000 (18:29 +0000)]
Add a very limited DDB dumpon(8)-alike to MI dumper code
This allows ddb(4) commands to construct a static dumperinfo during
panic/debug and invoke doadump(false) using the provided dumper
configuration (always inserted first in the list).
The intended usecase is a ddb(4)-time netdump(4) command.
Conrad Meyer [Thu, 17 Oct 2019 17:48:32 +0000 (17:48 +0000)]
debugnet: Respond to broadcast ARP requests
The in-tree netdump code has always ignored non-directed ARP requests, and
that seems to work most of the time for netdump.
In my work and testing on NetGDB, it seems like sometimes the remote FreeBSD
conversant (the non-panic system) will send broadcast-destination ARP
requests to the debugnet kernel; without this change, those are dropped and
the remote will see EHOSTDOWN "Host is down" errors from the userspace
interface of the network stack.
Similar to INET checksums, lazily validate UDP checksums when the driver has
already performed the check for us. Like debugnet(4) INET checksums,
validation in software is left as future work.
Conrad Meyer [Thu, 17 Oct 2019 16:23:03 +0000 (16:23 +0000)]
Split out a more generic debugnet(4) from netdump(4)
Debugnet is a simplistic and specialized panic- or debug-time reliable
datagram transport. It can drive a single connection at a time and is
currently unidirectional (debug/panic machine transmit to remote server
only).
It is mostly a verbatim code lift from netdump(4). Netdump(4) remains
the only consumer (until the rest of this patch series lands).
The INET-specific logic has been extracted somewhat more thoroughly than
previously in netdump(4), into debugnet_inet.c. UDP-layer logic and up, as
much as possible as is protocol-independent, remains in debugnet.c. The
separation is not perfect and future improvement is welcome. Supporting
INET6 is a long-term goal.
Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to
'debugnet_' or 'dn_' -- sorry. I thought keeping the netdump name on the
generic module would be more confusing than the refactoring.
The only functional change here is the mbuf allocation / tracking. Instead
of initiating solely on netdump-configured interface(s) at dumpon(8)
configuration time, we watch for any debugnet-enabled NIC for link
activation and query it for mbuf parameters at that time. If they exceed
the existing high-water mark allocation, we re-allocate and track the new
high-water mark. Otherwise, we leave the pre-panic mbuf allocation alone.
In a future patch in this series, this will allow initiating netdump from
panic ddb(4) without pre-panic configuration.
No other functional change intended.
Reviewed by: markj (earlier version)
Some discussion with: emaste, jhb
Objection from: marius
Differential Revision: https://reviews.freebsd.org/D21421
Andriy Gapon [Thu, 17 Oct 2019 06:32:34 +0000 (06:32 +0000)]
provide a way to assign taskqueue threads to a kernel process
This can be used to group all threads belonging to a single logical
entity under a common kernel process.
I am planning to use the new interface for ZFS threads.
Andriy Gapon [Thu, 17 Oct 2019 06:21:09 +0000 (06:21 +0000)]
wbwd: small clean-ups and improvements
This change applies some suggestions by delphij from D21979.
A write-only variable is removed.
There is a diagnostic message if the driver does not recognize the chip.
A chained if-statement is converted to a switch.
Mark Johnston [Wed, 16 Oct 2019 22:12:34 +0000 (22:12 +0000)]
Introduce pmap_change_prot() for amd64.
This updates the protection attributes of subranges of the kernel map.
Unlike pmap_protect(), which is typically used for user mappings,
pmap_change_prot() does not perform lazy upgrades of protections.
pmap_change_prot() also updates the aliasing range of the direct map.
Mark Johnston [Wed, 16 Oct 2019 22:03:27 +0000 (22:03 +0000)]
Remove page locking from pmap_mincore().
After r352110 the page lock no longer protects a page's identity, so
there is no purpose in locking the page in pmap_mincore(). Instead,
if vm.mincore_mapped is set to the non-default value of 0, re-lookup
the page after acquiring its object lock, which holds the page's
identity stable.
The change removes the last callers of vm_page_pa_tryrelock(), so
remove it.
Chuck Silvers [Wed, 16 Oct 2019 21:49:44 +0000 (21:49 +0000)]
Make all the gnop parameters optional in the request from userland,
filling in the same defaults that the current userland module uses.
This allows an old geom_nop.so userland module to work with a new kernel.
Chuck Silvers [Wed, 16 Oct 2019 21:49:39 +0000 (21:49 +0000)]
Add a new gctl_get_paraml_opt() interface to extract optional parameters from
the request. It is the same as gctl_get_paraml() except that the request
is not marked with an error if the parameter is not present.
Kyle Evans [Wed, 16 Oct 2019 18:33:31 +0000 (18:33 +0000)]
libbe(3): Fix destroy of imported BE w/ AUTOORIGIN
Imported BE, much like the activated BE, will not have an origin that we can
fetch/examine for destruction. be_destroy should not return BE_ERR_NOORIGIN
for failure to get the origin property for BE_DESTROY_AUTOORIGIN, because
we don't really know going into it that there's even an origin to be
destroyed.
BE_DESTROY_NEEDORIGIN has been renamed to BE_DESTROY_WANTORIGIN because only
a subset of it *needs* the origin, so 'need' is too strong of verbiage.
This was caught by jenkins and the bectl tests, but kevans failed to run the
bectl tests prior to commit.
Eric Joyner [Wed, 16 Oct 2019 17:19:17 +0000 (17:19 +0000)]
ixl: report whether device received pause frames
From Jake:
When updating the device statistics, report whether or not we have
received any pause frames to the iflib stack. This allows the iflib
stack to avoid generating a Tx hang message while the device is paused.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: gallatin@
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21870
Eric Joyner [Wed, 16 Oct 2019 17:16:32 +0000 (17:16 +0000)]
ix: report isc_pause_frames during stat update
From Jake:
Notify the iflib stack of whether we received any pause frames during
the timer window. This allows the stack to avoid reporting a Tx hang due
to the device being paused.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: gallatin@
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21869
Eric Joyner [Wed, 16 Oct 2019 17:13:46 +0000 (17:13 +0000)]
e1000: correctly set isc_pause_frames only when XOFF increases
From Jake:
The e1000 driver sets the iflib shared context isc_pause_frames value to
the number of received xoff frames. This is done so that the iflib
watchdog timer won't trigger a Tx Hang due to pause frames.
Unfortunately, the function simply sets it to the value of the xoffrxc
counter. Once the device has received a single XOFF packet, the driver
always reports that we received pause frames. This will prevent the Tx
hang detection entirely from that point on.
Fix this by assigning isc_pause_frames to a non-zero value if we
received any XOFF packets in the last timer interval.
We could attempt to calculate the total number of received packets by
doing a subtraction, but the iflib stack only seems to check if
isc_pause_frames is non-zero.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: gallatin@
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21868
Dimitry Andric [Wed, 16 Oct 2019 17:11:18 +0000 (17:11 +0000)]
Ensure lld respects the WITH/WITHOUT_SHARED_TOOLCHAIN option
Traditionally, toolchain components such as cc, as, and ld have been
built as static executables. The WITH_SHARED_TOOLCHAIN option from
src.conf(5) is meant to link these as regular executables, e.g. using
shared libraries.
The build of ld.lld did not yet check this option. Fix the Makefile so
it will do so now.
Reported by: Mike Cui <cuicui@gmail.com>
PR: 241257
MFC after: 3 days
Gleb Smirnoff [Wed, 16 Oct 2019 16:32:58 +0000 (16:32 +0000)]
do_link_state_change() is executed in taskqueue context and in
general is allowed to sleep. Don't enter the epoch for the
whole duration. If some event handlers need the epoch, they
should handle that theirselves.
Ian Lepore [Wed, 16 Oct 2019 16:26:35 +0000 (16:26 +0000)]
Update some comments; no functional changes. Some historical old comments
in this driver indicate that the SD_CAPA register is write-once and after
being set one time the values in it cannot be changed. That turns out not
to be the case -- the values written to it survive a reset, but they can
be rewritten/changed at any time.
Ian Lepore [Wed, 16 Oct 2019 16:19:21 +0000 (16:19 +0000)]
Revert r351218 (by manu). While the changes in r351218 appear to be (and
should be) correct, they lead to the eMMC on a Beaglebone failing to work
in some situations.
The TI sdhci hardware is kind of strange. The first device inherently
supports 1.8v and 3.3v and the abililty to switch between them, and the
other two devices must be set to 1.8v in the sdhci power control register to
operate correctly, but doing so actually makes them run at 3.3v (unless an
external level-shifter is present in the signal path). Even the 1.8v on the
first device may actually be 3.3v (or any other value), depending on what
voltage is fed to the VDDS1-VDDS7 power supply pins on the am335x chip.
Another strange quirk is that the convention for am335x sdhci drivers in
linux and uboot and the am335x boot ROM seems to be to set the voltage in
the sdhci capabilities register to 3.0v even though the actual voltage is
3.3v. Why this is done is a complete mystery to me, but it seems to be
required for correct operation.
If we had complete modern support for the am335x chip we could get the
actual voltages from the FDT data and the regulator framework. But our
am335x code currently doesn't have any regulator framework support.
Reverting to the prior code will get the popular Beaglebone boards working
again.
This is part of the fix for PR 241301, but also requires r353651 for a
complete fix.
Ian Lepore [Wed, 16 Oct 2019 16:03:19 +0000 (16:03 +0000)]
Relax the sdhci(4) check that filters out the 1.8v voltage option unless
the slot is flagged as 'embedded'.
The features related to embedded and shared slots were added in v3.0 of
the sdhci spec. Hardware prior to v3 sometimes supported 1.8v on non-
removable devices in embedded systems, but had no way to indicate that
via the standard sdhci registers (instead they use out of band metadata
such as FDT data).
This change adds the controller specification version to the check for
whether to filter out the 1.8v selection. On older hardware, the 1.8v
option is allowed to remain. On 3.0 or later it still requires the
embedded-slot flag to remain.
This is part of the fix for PR 241301 (eMMC not detected on Beaglebone).
Changes to the sdhci_ti driver are also needed for a full fix.
Mark Johnston [Wed, 16 Oct 2019 15:50:12 +0000 (15:50 +0000)]
Clear PGA_WRITEABLE in moea_pvo_remove().
moea_pvo_remove() might remove the last mapping of a page, in which case
it is clearly no longer writeable. This can happen via pmap_remove(),
or when a CoW fault removes the last mapping of the old page.
Reported and tested by: bdragon
Reviewed by: alc, bdragon, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22044
Kyle Evans [Wed, 16 Oct 2019 14:55:56 +0000 (14:55 +0000)]
bectl(8): destroy: use BE_DESTROY_AUTOORIGIN if -o is not specified
-o will force the origin to be destroyed unconditionally.
BE_DESTROY_AUTOORIGIN, on the other hand, will only destroy the origin if it
matches the format used by be_snapshot. This lets us clean up the snapshots
that are clearly not user-managed (because we're creating them) while
leaving user-created snapshots in place and warning that they're still
around when the BE created goes away.
Andriy Gapon [Wed, 16 Oct 2019 14:46:04 +0000 (14:46 +0000)]
wbwd: move to superio(4) bus
This allows to remove a bunch of low level code.
Also, superio(4) provides safer interaction with other drivers
that work with Super I/O configuration registers.
Tested only on PCengines APU2:
superio0: <Nuvoton NCT5104D/NCT6102D/NCT6106D (rev. B+)> at port 0x2e-0x2f on isa0
wbwd0: <Nuvoton NCT6102 (0xc4/0x53) Watchdog Timer> at WDT ldn 0x08 on superio0
The watchdog output is incorrectly wired on that system and the watchdog
does not really do it its job, but the pulse can be seen with a signal
analyzer.
Kyle Evans [Wed, 16 Oct 2019 14:43:05 +0000 (14:43 +0000)]
libbe(3): add needed bits for be_destroy to auto-destroy some origins
New BEs can be created from either an existing snapshot or an existing BE.
If an existing BE is chosen (either implicitly via 'bectl create' or
explicitly via 'bectl create -e foo bar', for instance), then bectl will
create a snapshot of the current BE or "foo" with be_snapshot, with a name
formatted like: strftime("%F-%T") and a serial added to it.
This commit adds the needed bits for libbe or consumers to determine if a
snapshot names matches one of these auto-created snapshots (with some light
validation of the date/time/serial), and also a be_destroy flag to specify
that the origin should be automatically destroyed if possible.
A future commit to bectl will specify BE_DESTROY_AUTOORIGIN by default so we
clean up the origin in the most common case, non-user-managed snapshots.
Andrew Turner [Wed, 16 Oct 2019 13:30:28 +0000 (13:30 +0000)]
Use tables to store the information to decode the arm64 ID registers.
Arm updates these with each new architecture revision. To help keep them
updated use a collection of tables to hold the needed information to
decode these registers.
Andrew Turner [Wed, 16 Oct 2019 13:21:01 +0000 (13:21 +0000)]
Stop leaking information from the kernel through timespec
The timespec struct holds a seconds value in a time_t and a nanoseconds
value in a long. On most architectures these are the same size, however
on 32-bit architectures other than i386 time_t is 8 bytes and long is
4 bytes.
Most ABIs will then pad a struct holding an 8 byte and 4 byte value to
16 bytes with 4 bytes of padding. When copying one of these structs the
compiler is free to copy the padding if it wishes.
In this case the padding may contain kernel data that is then leaked to
userspace. Fix this by copying the timespec elements rather than the
entire struct.
This doesn't affect Tier-1 architectures so no SA is expected.
Warner Losh [Wed, 16 Oct 2019 13:20:36 +0000 (13:20 +0000)]
bsd.compat.mk isn't setup to be included outside of Makefile.inc so comment it
out here until that's sorted out. Otherwise the build is broken. when
TARGET_ARCH isn't defined.
https://www.illumos.org/issues/10809
Port ZoL ee36c709c3d Performance optimization of AVL tree comparator functions
This is a followup to r337567 that imported the ZoL commit directly into
FreeBSD. It seems that at the time we did not have some of the earlier
changes, so some pieces of the ZoL change were not applicable. Also,
the illumos version got a few style cleanups. Some changes were missed
or incorrectly merged (e.g., vdev_cache_lastused_compare and
metaslab_rangesize_compare).
Fix panic in network stack due to use after free when receiving
partial fragmented packets before a network interface is detached.
When sending IPv4 or IPv6 fragmented packets and a fragment is lost
before the network device is freed, the mbuf making up the fragment
will remain in the temporary hashed fragment list and cause a panic
when it times out due to accessing a freed network interface
structure.
1) Make sure the m_pkthdr.rcvif always points to a valid network
interface. Else the rcvif field should be set to NULL.
2) Use the rcvif of the last received fragment as m_pkthdr.rcvif for
the fully defragged packet, instead of the first received fragment.
Replace rdma_is_upper_dev_rcu() with rdma_vlan_dev_real_dev() in ibcore.
This reduces the number of references to VLAN_TRUNKDEV() in ibcore.
Currently only VLAN is supported as a child interface in FreeBSD.
Remove superfluous RCU locking.
https://www.illumos.org/issues/10809
Port ZoL ee36c709c3d Performance optimization of AVL tree comparator functions
From the ZoL commit msg:
perf: 2.75x faster ddt_entry_compare()
First 256bits of ddt_key_t is a block checksum, which are expected
to be close to random data. Hence, on average, comparison only needs to
look at first few bytes of the keys. To reduce number of conditional
jump instructions, the result is computed as: sign(memcmp(k1, k2)).
Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1},
which is computed efficiently. Synthetic performance evaluation
of original and new algorithm over 1G random keys on 2.6GHz
Intel(R) Xeon(R) CPU E5-2660 v3:
old 6.85789 s
new 2.49089 s
perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare()
Compute the result directly instead of using conditionals
perf: zfs_range_compare()
Speedup between 1.1x - 2.5x, depending on compiler version and
optimization level.
perf: spa_error_entry_compare()
`bcmp()` is not suitable for comparator use. Use `memcmp()` instead.
perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare()
perf: 2.8x faster zil_bp_compare()
Portions contributed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Author: Gvozden Neskovic <neskovic@gmail.com>
https://www.illumos.org/issues/10841
ZoL 944a37248a0 predictive prefetch disabled on new pools until export/reboot
When a pool is initially created (by `zpool create`), predictive
prefetch is inadvertently disabled, until the pool is
export/import-ed, or the machine is rebooted.
When device removal was introduced, we added some code to disable
predictive prefetching until indirect vdevs have been loaded. This
resulted in the "default state" of prefetch being disabled, until we
proactively enable it after indirect vdevs are loaded. Unfortunately
this resulted in a few bugs where in some code paths we neglect to
enable predictive prefetch. The first of these was fixed by
https://github.com/zfsonlinux/zfs/commit/20507534d4ede14d4dd82c99fc8d461704ce7419
This commit fixes another case where we also need to explicitly enable
predictive prefetch, when the pool is initially created.
Fix assert in PowerPC pmaps after introduction of object busy.
The VM_PAGE_OBJECT_BUSY_ASSERT() in pmap_enter() implementation should
be only asserted when the code is executed as result of pmap_enter(),
not when the same code is entered from e.g. pmap_enter_quick(). This
is relevant for all PowerPC pmap variants, because mmu_*_enter() is
used as the backend, and assert is located there.
Add a PowerPC private pmap_enter() PMAP_ENTER_QUICK_LOCKED flag to
indicate that the call is not from pmap_enter(). For non-quick-locked
calls, assert that the object is locked.
Reported and tested by: bdragon
Reviewed by: alc, bdragon, markj
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D22041
https://www.illumos.org/issues/9691
When iterating over a ZAP object, we're almost always certain to
iterate over the entire object. If there are multiple leaf blocks, we
can realize a performance win by issuing reads for all the leaf blocks
in parallel when the iteration begins.
For example, if we have 10,000 snapshots, "zfs destroy -nv
pool/fs@1%9999" can take 30 minutes when the cache is cold. This
change provides a >3x performance improvement, by issuing the reads
for all ~64 blocks of each ZAP object in parallel.
https://www.illumos.org/issues/9691
When iterating over a ZAP object, we're almost always certain to
iterate over the entire object. If there are multiple leaf blocks, we
can realize a performance win by issuing reads for all the leaf blocks
in parallel when the iteration begins.
For example, if we have 10,000 snapshots, "zfs destroy -nv
pool/fs@1%9999" can take 30 minutes when the cache is cold. This
change provides a >3x performance improvement, by issuing the reads
for all ~64 blocks of each ZAP object in parallel.
https://www.illumos.org/issues/9425
Problem Statement
ZFS Channel program scripts currently require a timeout, so that hung
or long-running scripts return a timeout error instead of causing ZFS
to get wedged. This limit can currently be set up to 100 million Lua
instructions. Even with a limit in place, it would be desirable to
have a sys admin (support engineer) be able to cancel a script that is
taking a long time.
Proposed Solution
Make it possible to abort a channel program by sending an interrupt
signal.In the underlying txg_wait_sync function, switch the cv_wait to
a cv_wait_sig to catch the signal. Once a signal is encountered, the
dsl_sync_task function can install a Lua hook that will get called
before the Lua interpreter executes a new line of code. The
dsl_sync_task can resume with a standard txg_wait_sync call and wait
for the txg to complete. Meanwhile, the hook will abort the script and
indicate that the channel program was canceled. The kernel returns a
EINTR to indicate that the channel program run was canceled.
FreeBSD note: the return value of cv_wait_sig() has inverted meaning
between us and illumos.