karels [Tue, 12 Nov 2019 01:03:08 +0000 (01:03 +0000)]
Fix netstat -gs with ip_mroute module and/or vnet
The code for "netstat -gs -f inet" failed if the kernel namelist did not
include the _mrtstat symbol. However, that symbol is not in a standard
kernel even with the ip_mroute module loaded, where the functionality is
available. It is also not in a kernel with MROUTING but also VIMAGE, as
there can be multiple sets of stats. However, when running the command
on a live system, the symbol is not used; a sysctl is used. Go ahead
and try the sysctl in any case, and complain that IPv4 MROUTING is not
present only if the sysctl fails with ENOENT. Also fail if _mrtstat is
not defined when running on a core file; netstat doesn't know about vnets,
so can only work if MROUTING was included, and VIMAGE was not.
kib [Mon, 11 Nov 2019 21:59:20 +0000 (21:59 +0000)]
amd64: Issue MFENCE on context switch on AMD CPUs when reusing address space.
On some AMD CPUs, in particular, machines that do not implement
CLFLUSHOPT but do provide CLFLUSH, the CLFLUSH instruction is only
synchronized with MFENCE.
Code using CLFLUSH typicall needs to brace it with MFENCE both before
and after flush, see for instance pmap_invalidate_cache_range(). If
context switch occurs while inside the protected region, we need to
ensure visibility of flushes done on the old CPU, to new CPU.
For all other machines, locked operation done to lock switched thread,
should be enough. For case of different address spaces, reload of
%cr3 is serializing.
Reviewed by: cem, jhb, scottph
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D22007
markj [Mon, 11 Nov 2019 20:44:30 +0000 (20:44 +0000)]
Fix handling of PIPE_EOF in the direct write path.
Suppose a writing thread has pinned its pages and gone to sleep with
pipe_map.cnt > 0. Suppose that the thread is woken up by a signal (so
error != 0) and the other end of the pipe has simultaneously been
closed. In this case, to satisfy the assertion about pipe_map.cnt in
pipe_destroy_write_buffer(), we must mark the buffer as empty.
avg [Mon, 11 Nov 2019 19:06:04 +0000 (19:06 +0000)]
db_nextframe/i386: reduce the number of special frame types
This change removes TRAP_INTERRUPT and TRAP_TIMERINT frame types.
Their names are a bit confusing: trap + interrupt, what is that?
The TRAP_TIMERINT name is too specific -- can it only be used for timer
"trap-interrupts"? What is so special about them?
My understanding of the code is that INTERRUPT, TRAP_INTERRUPT and
TRAP_TIMERINT differ only in how an offset from callee's frame pointer to a
trap frame on the stack is calculated. And that depends on a number of
arguments that a special handler passes to a callee (a function with a
normal C calling convention).
So, this change makes that logic explicit and collapses all interrupt frame
types into the INTERRUPT type.
vangyzen [Mon, 11 Nov 2019 17:41:52 +0000 (17:41 +0000)]
tip/cu: check for EOF on input on the local side
If cu reads an EOF on the input side, it goes into a tight loop
sending a garbage byte to the remote. With this change, it exits
gracefully, along with its child.
imp [Mon, 11 Nov 2019 17:36:57 +0000 (17:36 +0000)]
Add asserts for some state transitions
For the PROBEWP and PROBERC* states, add assertiosn that both the da device
state is in the right state, as well as the ccb state is the right one when we
enter dadone_probe{wp,rc}. This will ensure that we don't sneak through when
we're re-probing the size and write protection status of the device and thereby
leak a reference which can later lead to an invalidated peripheral going away
before all references are released (and resulting panic).
Reviewed by: scottl, ken
Differential Revision: https://reviews.freebsd.org/D22295
imp [Mon, 11 Nov 2019 17:36:52 +0000 (17:36 +0000)]
Update the softc state of the da driver before releasing the CCB.
There are contexts where releasing the ccb triggers dastart() to be run
inline. When da was written, there was always a deferral, so it didn't matter
much. Now, with direct dispatch, we can call dastart from the dadone*
routines. If the probe state isn't updated, then dastart will redo things with
stale information. This normally isn't a problem, because we run the probe state
machine once at boot... Except that we also run it for each open of the device,
which means we can have multiple threads racing each other to try to kick off
the probe. However, if we update the state before we release the CCB, we can
avoid the race. While it's needed only for the probewp and proberc* states, do
it everywhere because it won't hurt the other places.
The race here happens because we reprobe dozens of times on boot when drives
have lots of partitions. We should consider caching this info for 1-2 seconds
to avoid this thundering hurd.
Reviewed by: scottl, ken
Differential Revision: https://reviews.freebsd.org/D22295
avg [Mon, 11 Nov 2019 17:11:49 +0000 (17:11 +0000)]
db_nextframe/amd64: remove TRAP_INTERRUPT frame type
Besides the confusing name, this type is effectively unused.
In all cases where it could be set, the INTERRUPT type is set by the
earlier code. The conditions for TRAP_INTERRUPT are a subset of the
conditions for INTERRUPT.
dougm [Mon, 11 Nov 2019 16:59:49 +0000 (16:59 +0000)]
swap_pager_meta_free() frees allocated blocks in a way that
exploits the sparsity of allocated blocks in a range, without
issuing an "are you there?" query for every block in the range.
swap_pager_copy() is not so smart. Modify the implementation
of swap_pager_meta_free() slightly so that swap_pager_copy()
can use that smarter implementation too.
Based on an observation of: Yoshihiro Ota (ota_j.email.ne.jp)
Reviewed by: kib,alc
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22280
glebius [Mon, 11 Nov 2019 06:28:25 +0000 (06:28 +0000)]
It is unclear why in6_pcblookup_local() would require write access
to the PCB hash. The function doesn't modify the hash. It always
asserted write lock historically, but with epoch conversion this
fails in some special cases.
cognet [Mon, 11 Nov 2019 00:21:05 +0000 (00:21 +0000)]
linprocfs: Make sure to report -1 as tty when we have no controlling tty.
When reporting a process' stats, we can't just provide the tty as an
unsigned long, as if we have no controlling tty, the tty would be NODEV, or
-1. Instaed, just special-case NODEV.
Submitted by: Juraj Lutter <otis@sk.FreeBSD.org>
MFC after: 1 week
jhibbits [Sun, 10 Nov 2019 22:08:07 +0000 (22:08 +0000)]
Consolidate powerpcspe CFLAGS
Don't depend on CPUTYPE to define powerpcspe CFLAGS, they should be set
unconditionally. This reduces duplication. Also, set some CFLAGS as
gcc-only, because clang's SPE support always uses the SPE ABI, it's not an
optional feature.
kib [Sun, 10 Nov 2019 09:28:18 +0000 (09:28 +0000)]
amd64: move common_tss into pcpu.
This saves some memory, around 256K I think. It removes some code,
e.g. KPTI does not need to specially map common_tss anymore. Also,
common_tss become domain-local.
Reviewed by: jhb
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D22231
alc [Sun, 10 Nov 2019 05:22:01 +0000 (05:22 +0000)]
Eliminate a redundant pmap_load() from pmap_remove_pages().
There is no reason why the pmap_invalidate_all() in pmap_remove_pages()
must be performed before the final PV list lock release. Move it past
the lock release.
Eliminate a stale comment from pmap_page_test_mappings(). We implemented
a modified bit in r350004.
mav [Sun, 10 Nov 2019 03:37:45 +0000 (03:37 +0000)]
Add compact scraptchpad protocol for ntb_transport(4).
Previously ntb_transport(4) required at least 6 scratchpad registers,
plus 2 more for each additional memory window. That is too much for some
configurations, where several drivers have to share resources of the same
NTB hardware. This patch introduces new compact version of the protocol,
requiring only 3 scratchpad registers, plus one more for each additional
memory window. The optimization is based on fact that neither of version,
number of windows or number of queue pairs really need more then one byte
each, and window sizes of 4GB are not very useful now. The new protocol
is activated automatically when the configuration is low on scratchpad
registers, or it can be activated explicitly with loader tunable.
mav [Sun, 10 Nov 2019 03:24:53 +0000 (03:24 +0000)]
Allow splitting PLX NTB BAR2 into several memory windows.
Address Lookup Table (A-LUT) being enabled allows to specify separate
translation for each 1/128th or 1/256th of the BAR2. Previously it was
used only to limit effective window size by blocking access through some
of A-LUT elements. This change allows A-LUT elements to also point
different memory locations, providing to upper layers several (up to 128)
independent memory windows. A-LUT hardware allows even more flexible
configurations than this, but NTB KPI have no way to manage that now.
kevans [Sun, 10 Nov 2019 03:06:03 +0000 (03:06 +0000)]
bcm2835_sdhci: don't panic in DMA interrupt if curcmd went away
This is an exceptional case; generally found during controller errors.
A panic when we attempt to acess slot->curcmd->data is less ideal than
warning, and other verbiage will be emitted to indicate the exact error.
rmacklem [Sun, 10 Nov 2019 01:08:14 +0000 (01:08 +0000)]
Update copy_file_range(2) to be Linux5 compatible.
The current linux man page and testing done on a fairly recent linux5.n
kernel have identified two changes to the semantics of the linux
copy_file_range system call.
Since the copy_file_range(2) system call is intended to be linux compatible
and is only currently in head/current and not used by any commands,
it seems appropriate to update the system call to be compatible with
the current linux one.
The first of these semantic changes was changed to be compatible with
linux5.n by r354564.
For the second semantic change, the old linux man page stated that, if
infd and outfd referred to the same file, EBADF should be returned.
Now, the semantics is to allow infd and outfd to refer to the same file
so long as the byte ranges defined by the input file offset, output file offset
and len does not overlap. If the byte ranges do overlap, EINVAL should be
returned.
This patch modifies copy_file_range(2) to be linux5.n compatible for this
semantic change.
trasz [Sat, 9 Nov 2019 17:30:19 +0000 (17:30 +0000)]
Add GEOM attribute to report physical device name, and report it
via 'diskinfo -v'. This avoids the need to track it down via CAM,
and should also work for disks that don't use CAM. And since it's
inherited thru the GEOM hierarchy, in most cases one doesn't need
to walk the GEOM graph either, eg you can use it on a partition
instead of disk itself.
dougm [Sat, 9 Nov 2019 17:08:27 +0000 (17:08 +0000)]
For vm_map, #defining DIAGNOSTIC to turn on full assertion-based
consistency checking slows performance dramatically. This change
reduces the number of assertions checked by completely walking the
vm_map tree only when the write-lock is released, and only then if the
number of modifications to the tree since the last walk exceeds the
number of tree nodes.
rmacklem [Fri, 8 Nov 2019 23:39:17 +0000 (23:39 +0000)]
Update copy_file_range(2) to be Linux5 compatible.
The current linux man page and testing done on a fairly recent linux5.n
kernel have identified two changes to the semantics of the linux
copy_file_range system call.
Since the copy_file_range(2) system call is intended to be linux compatible
and is only currently in head/current and not used by any commands,
it seems appropriate to update the system call to be compatible with
the current linux one.
The old linux man page stated that, if the
offset + len exceeded file_size for the input file, EINVAL should be returned.
Now, the semantics is to copy up to at most file_size bytes and return that
number of bytes copied. If the offset is at or beyond file_size, a return
of 0 bytes is done.
This patch modifies copy_file_range(2) to be linux compatible for this
semantic change.
A separate patch will change copy_file_range(2) for the other semantic
change, which allows the infd and outfd to refer to the same file, so
long as the byte ranges do not overlap.
kevans [Fri, 8 Nov 2019 20:14:36 +0000 (20:14 +0000)]
bcm2835_sdhci: remove unused power_id field
This was once set, but I removed it by the time I committed it because both
configurations use the same POWER_ID. This can be separated back out if the
situation changes.
kevans [Fri, 8 Nov 2019 20:12:57 +0000 (20:12 +0000)]
bcm2835_sdhci: add some very basic support for rpi4
DMA is currently disabled while I work out why it's broken, but this is
enough for upstream U-Boot + rpi-firmware + our rpi3-psci-monitor to boot
with the right config.
The RPi 4 is still not in a good "supported" state, as we have no
USB/PCI-E/Ethernet drivers, but if air-gapped pies only able to operate over
cereal is your thing, here's your guy.
Submitted by: Robert Crowston (with modifications)
manu [Fri, 8 Nov 2019 20:08:44 +0000 (20:08 +0000)]
loader.efi: Default to serial if we don't have a ConOut variable
In the EFI implementation in U-Boot no ConOut efi variable is created,
this cause loader to fallback to TERM_EMU implementation which is very
very very slow (and uses the ConOut device in the system table anyway).
The UEFI spec aren't clear as if this variable needs to exists or not.
mmel [Fri, 8 Nov 2019 18:57:41 +0000 (18:57 +0000)]
Implement support for (soft)linked clocks.
This kind of clock nodes represent temporary placeholder for clocks
defined later in boot process. Also, these are necessary to break
circular dependencies occasionally occurring in complex clock graphs.
vmaffione [Fri, 8 Nov 2019 17:57:03 +0000 (17:57 +0000)]
bhyve: add support for virtio-net mergeable rx buffers
Mergeable rx buffers is a virtio-net feature that allows the hypervisor
to use multiple RX descriptor chains to receive a single receive packet.
Without this feature, a TSO-enabled guest is compelled to publish only
64K (or 32K) long chains, and each of these large buffers is consumed
to receive a single packet, even a very short one. This is a waste of
memory, as a RX queue has room for 256 chains, which means up to 16MB
of buffer memory for each (single-queue) vtnet device.
With the feature on, the guest can publish 2K long chains, and the
hypervisor will merge them as needed.
This change also enables the feature in the netmap backend, which
supports virtio-net offloads. We plan to add support for the
tap backend too.
Note that differently from QEMU/KVM, here we implement one-copy receive,
while QEMU uses two copies.
emaste [Fri, 8 Nov 2019 14:59:41 +0000 (14:59 +0000)]
elfcopy/strip: Ensure sections have required alignment on output
Object files may specify insufficient alignment on certain sections, for
example due to a bug in NASM[1]. When we detect that case in elfcopy or
strip, emit a warning and increase the alignment to the minimum
required.
The NASM bug was fixed in 2015[2], but we might as well have this fixup
(and warning) in elfcopy in case we encounter such a file for any other
reason.
This might be reworked somewhat upstream - see ELF Tool Chain
ticket 485[3].
emaste [Fri, 8 Nov 2019 14:51:09 +0000 (14:51 +0000)]
kvm: fix types for cross-debugging
As with other libkvm interfaces use maximum-sized types to support
cross-debugging (e.g. a 64-bit vmcore on a 32-bit host). See
https://lists.freebsd.org/pipermail/svn-src-all/2019-February/176051.html
for further discussion.
This is an API-breaking change, but there are few consumers of this
interface today.
Reviewed by: will
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21945
frag6: properly handle atomic fragments according to RFCs.
RFC 8200 says:
"If the fragment is a whole datagram (that is, both the Fragment
Offset field and the M flag are zero), then it does not need
any further reassembly and should be processed as a fully
reassembled packet (i.e., updating Next Header, adjust Payload
Length, removing the Fragment header, etc.). .."
That means we should remove the fragment header and make all the adjustments
rather than just skipping over the fragment header. The difference should
be noticeable in that a properly handled atomic fragment triggering an ICMPv6
message at an upper layer (e.g. dest unreach, unreachable port) will not
include the fragment header.
Update the test cases to also test for an unfragmentable part. That is
needed so that the next header is properly updated (not just lengths).
kevans [Fri, 8 Nov 2019 14:28:39 +0000 (14:28 +0000)]
csu: Fix dynamiclib/init_test:jcr_test on !HAVE_CTORS archs
.jcr still needs a 0-entry added in crtend, even on !HAVE_CTORS archs, as
we're still getting .jcr sections added -- presumably due to the reference
in crtbegin. Without this terminal, the .jcr section (without data) overlaps
with the next section and register_classes in crtbegin will be examining the
wrong item.
PR: 241439
Reviewed by: andrew
Differential Revision: https://reviews.freebsd.org/D22132
emaste [Fri, 8 Nov 2019 14:25:26 +0000 (14:25 +0000)]
add reference to PR for sparc64 BSD_CRTBEGIN in BROKEN_OPTIONS
We will soon remove the BSD_CRTBEGIN option (and will use the new CRT
files always) as part of the GCC 4.2.1 removal. Right now BSD_CRTBEGIN
works everywhere but sparc64; add a reference to the PR in case anyone
stumbles across this and is looking for more information.
emaste [Fri, 8 Nov 2019 14:11:25 +0000 (14:11 +0000)]
makefs: avoid warning when creating FAT filesystem on existing file
Previously the mkfs_msdos function (from newfs_msdos) emitted warnings
in the case that an image size is specified and the target is not a
file, or no size is specified and the target is not a character device.
The latter warning (not a character device) doesn't make sense when this
code is used in makefs, regardless of whether an image size is specified
or not.
emaste [Fri, 8 Nov 2019 14:06:48 +0000 (14:06 +0000)]
suggest xtoolchain package if binutils and GCC bootstraps are both broken
Previously we checked for only BINUTILS_BOOTSTRAP as a broken option
and suggested installing the binutils package. This was originally done
for arm64 where we used the in-tree Clang with external binutils package.
Add a case to the warning to suggest instead the full xtoolchain package
if we have no in-tree compiler either.
trasz [Fri, 8 Nov 2019 11:09:50 +0000 (11:09 +0000)]
Humanize more columns in the vmstat(8) output and adjust widths.
The few columns that are not humanized are usually 0. This makes
the output mostly aligned.
rmacklem [Fri, 8 Nov 2019 06:40:17 +0000 (06:40 +0000)]
Fix the man page to correctly describe the use of the "len" argument.
The man page incorrectly described the use of the"len" argument, which
is updated to the number of bytes copied and not reduced by the number
of bytes copied.
jhibbits [Fri, 8 Nov 2019 04:26:19 +0000 (04:26 +0000)]
powerpc/booke: Only handle kernel page faults in KVA range
The memory range between VM_MAXUSER_ADDRESS and VM_MIN_KERNEL_ADDRESS is
reserved for devices currently, which are always mapped in TLB1, and
therefore do not exist in the kernel page table. Any page fault in this
range is therefore automatically a fatal fault.
jhibbits [Fri, 8 Nov 2019 03:45:13 +0000 (03:45 +0000)]
powerpc/booke: Make the TLB save area and mask match
Since TLB_MAXNEST is 3, the insert mask should only be 2 bits. Given that 2
bits counts to 4, and that we already have plenty of space wasted in
padding, make the nest level 4 to match the mask.
jhibbits [Fri, 8 Nov 2019 03:36:19 +0000 (03:36 +0000)]
powerpc/mpc85xx: Add MSI support for Freescale PowerPC SoCs
Freescale SoCs use a set of IRQs at the high end of the OpenPIC IRQ
list, not counted in the NIRQs of the Feature reporting register. Some
SoCs include a MSI inbound window in the PCIe controller configuration
registers as well, but some don't. Currently, this only handles the
SoCs *with* the MSI window.
There are 256 MSIs per MSI bank (32 per MSI IRQ, 8 IRQs per MSI bank).
The P5020 has 3 banks, yielding up to 768 MSIs; older SoCs have only one
bank.
kevans [Fri, 8 Nov 2019 03:27:56 +0000 (03:27 +0000)]
bcm2835_dma: Mark IRQs shareable
On the RPi4, some of these IRQs are shared. Start moving toward a mode where
we accept that shared IRQs happen and simply ignore interrupts that are
seemingly for no reason.
I would like to be more verbose here, but my 30-minute assessment of the
current world order is that mapping a resource/rid to an actual IRQ number
(as found in FDT) data is not a simple matter. Determining if more than one
handler is attached to an IRQ is closer to feasible, but it's unclear which
way is the cleaner path. Beyond that, we're only really using it to be
slightly more verbose when something's going wrong, so for now just suppress
and drop a complaint-comment.
This was originally submitted (via freebsd-arm@) by Robert Crowston; the
additional verbosity was dropped by kevans@.
Submitted by: Robert Crowston <crowston@protonmail.com>
markj [Thu, 7 Nov 2019 23:37:17 +0000 (23:37 +0000)]
iwm: Fix scheduler configuration for aux and cmd queue configuration.
- Configure the scheduler only for the management queue.
- Fix a bug when enabling the schduler: the queues are specified using a
bitmask.
- Fix style in the area.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
markj [Thu, 7 Nov 2019 23:37:02 +0000 (23:37 +0000)]
iwm: Implement the new receive path.
This is the multiqueue receive code required for 9000-series chips.
Note that we still only configure a single RX queue for now. Multiqueue
support will require MSI-X configuration and a scheme for managing a
global pool of RX buffers.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
markj [Thu, 7 Nov 2019 23:36:46 +0000 (23:36 +0000)]
iwm: Enable all 31 tx queues.
For now iwm only ever uses queue 0 and the management queue, but my 9560
raises a software error interrupt during initialization if this flag is
not set. iwlwifi sets it for all 7000- and 8000-series hardware, so we
might as well do it unconditionally.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation