Jeff Roberson [Thu, 16 Jan 2020 05:01:21 +0000 (05:01 +0000)]
Simplify VM and UMA startup by eliminating boot pages. Instead use careful
ordering to allocate early pages in the same way boot pages were but only
as needed. After the KVA allocator has started up we allocate the KVA that
we consumed during boot. This also makes the boot pages freeable since they
have vm_page structures allocated with the rest of memory.
Parts of this patch were written and tested by markj.
Leandro Lupori [Wed, 15 Jan 2020 20:25:52 +0000 (20:25 +0000)]
[PPC64] memcpy/memmove/bcopy optimization
For copies shorter than 512 bytes, the data is copied using plain
ld/std instructions.
For 512 bytes or more, the copy is done in 3 phases:
Phase 1: copy from the src buffer until it's aligned at a 16-byte boundary
Phase 2: copy as many aligned 64-byte blocks from the src buffer as possible
Phase 3: copy the remaining data, if any
In phase 2, this code uses VSX instructions when available. Otherwise,
it uses ldx/stdx.
Kirk McKusick [Wed, 15 Jan 2020 18:53:32 +0000 (18:53 +0000)]
Peter Holm reports that his test that does an umount(8) on an active
mount point while numerous tests are running that are writing to
files on that mount point cause the unmount(8) to hang forever.
The unmount(8) system call is handled in the kernel by the dounmount()
function. The cause of the hang is that prior to dounmount() calling
VFS_UNMOUNT() it is calling VFS_SYNC(mp, MNT_WAIT). The MNT_WAIT
flag indicates that VFS_SYNC() should not return until all the dirty
buffers associated with the mount point have been written to disk.
Because user processes are allowed to continue writing and can do
so faster than the data can be written to disk, the call to VFS_SYNC()
can never finish.
Unlike VFS_SYNC(), the VFS_UNMOUNT() routine can suspend all processes
when they request to do a write thus having a finite number of dirty
buffers to write that cannot be expanded. There is no need to call
VFS_SYNC() before calling VFS_UNMOUNT(), because VFS_UNMOUNT() needs
to flush everything again anyway after suspending writes, to catch
anything that was dirtied between the VFS_SYNC() and writes being
suspended.
The fix is to simply remove the unnecessary call to VFS_SYNC() from
dounmount().
Reported by: Peter Holm
Analysis by: Chuck Silvers
Tested by: Peter Holm
MFC after: 7 days
Sponsored by: Netflix
Kyle Evans [Wed, 15 Jan 2020 15:59:32 +0000 (15:59 +0000)]
mips trampoline: don't bother with unwind tables
The utility here seems somewhat limited, but clang will attempt to generate
.eh_frame and actively fail in doing so. It is perhaps worth investigating
why it's being generated in the first place (GCC doesn't do so), but this
isn't a high priority.
Ed Maste [Wed, 15 Jan 2020 13:52:13 +0000 (13:52 +0000)]
Update WITHOUT_BINUTILS* descriptions
In the WITHOUT_ descriptions we don't need to mention that ld.bfd is
limited to powerpc. When WITHOUT_BINUTILS is specified ld.bfd is not
installed on any CPU architecture.
Gleb Smirnoff [Wed, 15 Jan 2020 06:05:20 +0000 (06:05 +0000)]
Introduce NET_EPOCH_CALL() macro and use it everywhere where we free
data based on the network epoch. The macro reverses the argument
order of epoch_call(9) - first function, then its argument. NFC
Gleb Smirnoff [Wed, 15 Jan 2020 03:40:32 +0000 (03:40 +0000)]
Since this code dereferences struct ifnet, it must include if_var.h
explicitly, not via header pollution. While here move TCPSTATES
declaration right above the include that is going to make use of it.
Gleb Smirnoff [Wed, 15 Jan 2020 03:34:21 +0000 (03:34 +0000)]
- Move global network epoch definition to epoch.h, as more different
subsystems tend to need to know about it, and including if_var.h is
huge header pollution for them. Polluting possible non-network
users with single symbol seems much lesser evil.
- Remove non-preemptible network epoch. Not used yet, and unlikely
to get used in close future.
Mateusz Guzik [Wed, 15 Jan 2020 01:30:32 +0000 (01:30 +0000)]
rtld: remove hand rolled memset and bzero
They were introduced to take care of ifunc, but right now no architecture
provides ifunc'ed variants. Since rtld uses memset extensively this results in
a pessmization. Should someone want to use ifunc here they should provide a
mandatory symbol (e.g., rtld_memset).
Kirk McKusick [Tue, 14 Jan 2020 22:27:46 +0000 (22:27 +0000)]
When sync'ing a mount point, the mount point's vnodes were scanned
twice. Once to update the changed inodes, and a second time to update
changed quota information. This change merges these two scans into a
single scan which does both inode and quota updates.
Kyle Evans [Tue, 14 Jan 2020 17:50:13 +0000 (17:50 +0000)]
Revert r353140: Re-add ALLOW_MIPS_SHARED_TEXTREL, sprinkle it around
arichardson has an actual fix for the same issue that this was working
around; given that we don't build with llvm today, go ahead and revert the
workaround in advance.
Ed Maste [Tue, 14 Jan 2020 17:35:34 +0000 (17:35 +0000)]
Update WITH_/WITHOUT_CLANG_IS_CC descriptions
Describe /usr/bin/cc etc. as links to the compiler, and don't conflate
WITHOUT_CLANG_IS_CC with installing GCC. Leave a reference to WITH_GCC
and WITHOUT_CLANG_IS_CC installing links to GCC, although this will be
removed in ~1.5 months when GCC 4.2.1 is removed from the tree.
Andriy Gapon [Tue, 14 Jan 2020 13:20:16 +0000 (13:20 +0000)]
storvsc: port a Linux patch, properly set residual data length on errors
This change is based on Linux commit 40630f462824ee. csio.resid should
account for transfer_len only for success and SRB_STATUS_DATA_OVERRUN
condition.
I am not sure how exactly this change works, but I have a report from a
user that they see lots of checksum errors when running a pool scrub
concurrently with iozone -l 1 -s 100G. After applying this patch the
problem cannot be reproduced.
asprintf returns -1, not an arbitrary value < 0. Also upon error the
(very sloppy specification) leaves an undefined value in *ret, so it is
wrong to inspect it, the error condition is enough.
Alexander Motin [Tue, 14 Jan 2020 03:27:57 +0000 (03:27 +0000)]
Restore loop break in vm_pageout_lowmem().
r355004 removed return statement from this loop with intention to also
call uma_reclaim_wakeup(). But in case of vm.lowmem_period=0 it causes
infinite loop.
Ryan Libby [Tue, 14 Jan 2020 02:14:15 +0000 (02:14 +0000)]
uma: split slabzone into two sizes
By allowing more items per slab, we can improve memory efficiency for
small allocs. If we were just to increase the bitmap size of the
slabzone, we would then waste slabzone memory. So, split slabzone into
two zones, one especially for 8-byte allocs (512 per slab). The
practical effect should be reduced memory usage for counter(9).
Ryan Libby [Tue, 14 Jan 2020 02:14:02 +0000 (02:14 +0000)]
malloc: remove assumptions about MINALLOCSIZE
Remove assumptions about the minimum MINALLOCSIZE, in order to allow
testing of smaller MINALLOCSIZE. A following patch will lower the
MINALLOCSIZE, but not so much that the present patch is required for
correctness at these sites.
Jeff Roberson [Tue, 14 Jan 2020 02:00:24 +0000 (02:00 +0000)]
Fix a long standing bug in journaled soft-updates. The dirrem structure
needs to handle file removal, directory removal, file move, directory move,
etc. The code in handle_workitem_remove() needs to propagate any completed
journal entries to the write that will render the change stable. In the
case of a moved directory this means the new parent. However, for an
overwrite that frees a directory (DIRCHG) we must move the jsegdep to the
removed inode to be released when it is stable in the cg bitmap or the
unlinked inode list. This case was previously unhandled and caused a
panic.
netmap: disable passthrough with no hypervisor support
The netmap passthrough subsystem requires proper support in the
hypervisor. In particular, two PCI device ids (from the Red Hat
PCI vendor id 0x1b36) need to be assigned to the two netmap
virtual devices. We then disable these devices until the ids have
not been assigned, in order to avoid conflicts with other
virtual devices emulated by upstream QEMU.
vmx: fix initialization of TSO related descriptor fields
Fix a mistake introduced by r343291, which ported the vmx(4)
driver to iflib.
In case of TSO, the hlen field of the (first) tx descriptor must
be initialized to the cumulative length of Ethernet, IP and TCP
headers. The length of the TCP header was missing.
Dimitry Andric [Mon, 13 Jan 2020 20:31:10 +0000 (20:31 +0000)]
Merge commit f46ba4f07 from llvm git (by Simon Atanasyan):
[mips] Use less registers to load address of TargetExternalSymbol
There is no pattern matched `add hi, (MipsLo texternalsym)`. As a
result, loading an address of 32-bit symbol requires two registers
and one more additional instruction:
```
addiu $1, $zero, %lo(foo)
lui $2, %hi(foo)
addu $25, $2, $1
```
This patch adds the missed pattern and enables generation more
effective set of instructions:
```
lui $1, %hi(foo)
addiu $25, $1, %lo(foo)
```
Merge commit 59bb3609f from llvm git (by Simon Atanasyan):
[mips] Fix 64-bit address loading in case of applying 32-bit mask to
the result
If result of 64-bit address loading combines with 32-bit mask, LLVM
tries to optimize the code and remove "redundant" loading of upper
32-bits of the address. It leads to incorrect code on MIPS64 targets.
MIPS backend creates the following chain of commands to load 64-bit
address in the `MipsTargetLowering::getAddrNonPICSym64` method:
```
(add (shl (add (shl (add %highest(sym), %higher(sym)),
16),
%hi(sym)),
16),
%lo(%sym))
```
If the mask presents, LLVM decides to optimize the chain of commands.
It really does not make sense to load upper 32-bits because the
0x0fffffff mask anyway clears them. After removing redundant commands
we get this chain:
```
(add (shl (%hi(sym), 16), %lo(%sym))
```
There is no patterns matched `(MipsHi (i64 symbol))`. Due a bug in
`SYM_32` predicate definition, backend incorrectly selects a pattern
for a 32-bit symbols and uses the `lui` instruction for loading
`%hi(sym)`.
As a result we get incorrect set of instructions with unnecessary
16-bit left shifting:
```
lui at,0x0
R_MIPS_HI16 foo
dsll at,at,0x10
daddiu at,at,0
R_MIPS_LO16 foo
```
This patch resolves two problems:
- Fix `SYM_32/SYM_64` predicates to prevent selection of patterns
dedicated to 32-bit symbols in case of using N64 ABI.
- Add missed patterns for 64-bit symbols for `%hi/%lo`.
Toomas Soome [Mon, 13 Jan 2020 20:02:27 +0000 (20:02 +0000)]
Backout 356693. The libsa malloc does provide necessary alignment and
memalign by 4 will reduce alignment for some platforms. Thanks for Ian for
pointing this out.
Kyle Evans [Mon, 13 Jan 2020 18:26:27 +0000 (18:26 +0000)]
tap(4): also note that we drop configured addresses
This provides a specific pointer for users of tap(4) to understand why their
interfaces are losing their addresses, and specifically how to workaround
this if they need different behavior.
This manpage received a .Dd bump earlier today in r35688, so no bump occurs
this time.
Mateusz Guzik [Mon, 13 Jan 2020 14:33:51 +0000 (14:33 +0000)]
ufs: relax an overzealous assert added in r356671
Part of i_flag can persist across a drop to hold count of 0, at which
point the vnode is taken off the lazy list. Then whoever locks and unlocks
the vnode can trip on the assert.
This trips over kyua running a test untarring character devices to ufs.
Cy Schubert [Mon, 13 Jan 2020 06:55:31 +0000 (06:55 +0000)]
Unbound's config.h is manually maintained, using a ./configure produced
config.h as a guide. In practice contributed software maintains a copy
of config.h within its build directory tree containing its Makefile.
usr.sbin/unbound is the home for its config.h.
Mitchell Horne [Mon, 13 Jan 2020 03:39:02 +0000 (03:39 +0000)]
RISC-V: fix global symbol lookups for mpentry with lld
This is a follow up to r356481. In locore.S, before virtual memory is
set up, we should avoid using indirect address lookups through the GOT.
Therefore we need to convert uses of the la instruction to lla, which
always generates an auipc/addi pair of instructions. This conversion was
done for the BSP case, but not the AP case, resulting in a fault
somewhere before mpva and a failure to bring APs online.
Reported by: lwhsu
Reviewed by: lwhsu, jrtc27 (accepted in a comment)
Differential Revision: https://reviews.freebsd.org/D23138
Mateusz Guzik [Mon, 13 Jan 2020 02:39:41 +0000 (02:39 +0000)]
vfs: per-cpu batched requeuing of free vnodes
Constant requeuing adds significant lock contention in certain
workloads. Lessen the problem by batching it.
Per-cpu areas are locked in order to synchronize against UMA freeing
memory.
vnode's v_mflag is converted to short to prevent the struct from
growing.
Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock: 122.38s user 1780.45s system 6242% cpu 30.480 total
patched: 144.84s user 985.90s system 4856% cpu 23.282 total
Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22998
Mateusz Guzik [Mon, 13 Jan 2020 02:37:25 +0000 (02:37 +0000)]
vfs: rework vnode list management
The current notion of an active vnode is eliminated.
Vnodes transition between 0<->1 hold counts all the time and the
associated traversal between different lists induces significant
scalability problems in certain workloads.
Introduce a global list containing all allocated vnodes. They get
unlinked only when UMA reclaims memory and are only requeued when
hold count reaches 0.
Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock: 118.55s user 3649.73s system 7479% cpu 50.382 total
patched: 122.38s user 1780.45s system 6242% cpu 30.480 total
Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22997
Conrad Meyer [Sun, 12 Jan 2020 20:47:38 +0000 (20:47 +0000)]
getrandom(2): Add Linux GRND_INSECURE API flag
Treat it as a synonym for GRND_NONBLOCK. The reasoning is this:
We have two choices for handling Linux's GRND_INSECURE API flag.
1. We could ignore it completely (like GRND_RANDOM). However, this might
produce the surprising result of GRND_INSECURE requests blocking, when the
Linux API does not block.
2. Alternatively, we could treat GRND_INSECURE requests as requests for
GRND_NONBLOCk. Here, the surprising result for Linux programs is that
invocations with unseeded random(4) will produce EAGAIN, rather than
garbage.
Honoring the flag in the way Linux does seems fraught. If we actually use
the output of a random(4) implementation prior to seeding, we leak some
entropy (in an information theory and also practical sense) from what will
be the initial seed to attackers (or allow attackers to arbitrary DoS
initial seeding, if we don't leak). This seems unacceptable -- it defeats
the purpose of blocking on initial seeding.
Secondary to that concern, before seeding we may have arbitrarily little
entropy collected; producing output from zero or a handful of entropy bits
does not seem particularly useful to userspace.
If userspace can accept garbage, insecure, non-random bytes, they can create
their own insecure garbage with srandom(time(NULL)) or similar. Any program
which would be satisfied with a 3-bit key CTR stream has no need for CSPRNG
bytes. So asking the kernel to produce such an output from the secure
getrandom(2) API seems inane.
For now, we've elected to emulate GRND_INSECURE as an alternative spelling
of GRND_NONBLOCK (2). Consider this API not-quite stable for now. We
guarantee it will never block. But we will attempt to monitor actual port
uptake of this bizarre API and may revise our plans for the unseeded
behavior (prior stable/13 branching).
Approved by: csprng(markm), manpages(bcr)
See also: https://lwn.net/ml/linux-kernel/cover.1577088521.git.luto@kernel.org/
See also: https://lwn.net/ml/linux-kernel/20200107204400.GH3619@mit.edu/
Differential Revision: https://reviews.freebsd.org/D23130
Fix the way 'factor' behaves when using OpenSSL to match the description
of how it works when not compiled with OpenSSL.
Also, allow users to specify a hexadecimal number by using a prefix of
'0x'. Before this, users could only specify a hexadecimal value if that
value included a hex digit ('a'-'f') in the value.
Michael Tuexen [Sun, 12 Jan 2020 17:52:32 +0000 (17:52 +0000)]
Fix race when accepting TCP connections.
When expanding a SYN-cache entry to a socket/inp a two step approach was
taken:
1) The local address was filled in, then the inp was added to the hash
table.
2) The remote address was filled in and the inp was relocated in the
hash table.
Before the epoch changes, a write lock was held when this happens and
the code looking up entries was holding a corresponding read lock.
Since the read lock is gone away after the introduction of the
epochs, the half populated inp was found during lookup.
This resulted in processing TCP segments in the context of the wrong
TCP connection.
This patch changes the above procedure in a way that the inp is fully
populated before inserted into the hash table.
Thanks to Paul <devgs@ukr.net> for reporting the issue on the net@
mailing list and for testing the patch!
Bjoern A. Zeeb [Sun, 12 Jan 2020 17:41:09 +0000 (17:41 +0000)]
nd6_rtr: constantly use __func__ for nd6log()
Over time one or two hard coded function names did not match the
actual function anymore. Consistently use __func__ for nd6log() calls
and re-wrap/re-format some messages for consitency.
Xin LI [Sun, 12 Jan 2020 06:13:52 +0000 (06:13 +0000)]
Tighten FAT checks and fix off-by-one error in corner case.
sbin/fsck_msdosfs/fat.c:
- readfat:
* Only truncate out-of-range cluster pointers (1, or greater than
NumClusters but smaller than CLUST_RSRVD), as the current cluster
may contain some data. We can't fix reserved cluster pointers at
this pass, because we do no know the potential cluster preceding
it.
* Accept valid cluster for head bitmap. This is a no-op, and mainly
to improve code readability, because the 1 is already handled in
the previous else if block.
- truncate_at: absorbed into checkchain.
- checkchain: save the previous node we have traversed in case that we
have a chain that ends with a special (>= CLUST_RSRVD) cluster, or is
free. In these cases, we need to truncate at the cluster preceding the
current cluster, as the current cluster contains a marker instead of
a next pointer and can not be changed to CLUST_EOF (the else case can
happen if the user answered "no" at some point in readfat()).
- clearchain: correct the iterator for next cluster so that we don't
stop after clearing the first cluster.
- checklost: If checkchain() thinks the chain have no cluster, it
doesn't make sense to reconnect it, so don't bother asking.
Kyle Evans [Sun, 12 Jan 2020 04:18:36 +0000 (04:18 +0000)]
Makefile.inc1: push /usr/libexec into the BPATH/TMPPATH
${WORLDTMP}/legacy/usr/libexec will only have libexec/ bits that we've
pushed as bootstrap tools, so this is generally safe to include prior to
PATH. The following are the ramifications of this change:
- BPATH addition gets us at least bootstrap flua in WMAKEENV path for
buildenv, for those earlier systems where it's bootstrapped still
- Reworked the sysent target to just set PATH and let it get worked out in
src.lua.mk or individual sysent makefiles -- this gives us back the
ability to overwrite LUA_CMD and use a different/external lua for these
targets. sysent can also now work cleanly in buildenv.
- tools/build/Makefile will now symlink the host flua into build's host
tools so that the above can work without needing to add the host's
/usr/libexec explicitly into TMPPATH.
Kyle Evans [Sun, 12 Jan 2020 04:07:03 +0000 (04:07 +0000)]
regulator: small enhancements to regulator_shutdown
Highlights:
- Exit early if we're not disabling unused regulators; there's no need to
take the regulator topology lock and re-evaluate this every iteration, as
it's not going to change.
- Don't emit a notice that we're shutting down a regulator if it's not
enabled, to reduce noise.
- Mention the outcome of the shutdown, to aide debugging and easily let
developer/user collect list of regulators we actually shutdown to
determine problematic one.
Reviewed by: manu
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D22213