asomers [Mon, 23 Oct 2017 23:12:01 +0000 (23:12 +0000)]
Remove artificial restriction on lio_listio's operation count
In r322258 I made p1003_1b.aio_listio_max a tunable. However, further
investigation shows that there was never any good reason for that limit to
exist in the first place. It's used in two completely different ways:
* To size a UMA zone, which globally limits the number of concurrent
aio_suspend calls.
* To artifically limit the number of operations in a single lio_listio call.
There doesn't seem to be any memory allocation associated with this limit.
This change does two things:
* Properly names aio_suspend's UMA zone, and sizes it based on a new constant.
* Eliminates the artifical restriction on lio_listio. Instead, lio_listio
calls will now be limited by the more generous max_aio_queue_per_proc. The
old p1003_1b.aio_listio_max is now an alias for
vfs.aio.max_aio_queue_per_proc, so sysconf(3) will still work with
_SC_AIO_LISTIO_MAX.
asomers [Mon, 23 Oct 2017 23:05:29 +0000 (23:05 +0000)]
Fix the error message when creating a zpool on a too-small device
Don't check for SPA_MINDEVSIZE in vdev_geom_attach when opening by path.
It's redundant with the check in vdev_open, and failing to attach here
results in the wrong error message being printed. However, still check for
it in some other situations:
* When opening by guids, so we don't get bogged down reading from slow
devices like floppy drives.
* In vdev_geom_read_pool_label for the same reason, because we iterate over
all providers.
* If the caller requests that we verify the guid, because then we'll have to
read from the device before vdev_open verifies the size.
dim [Mon, 23 Oct 2017 21:31:04 +0000 (21:31 +0000)]
After jemalloc was updated to version 5.0.0 in r319971, i386 executables
linked with AddressSanitizer (even those linked on earlier versions of
FreeBSD, or with external versions of clang) started failing with errors
similar to:
This is because AddressSanitizer expects all the TLS data in the program
to be aligned to at least 8 bytes.
Before the jemalloc 5.0.0 update, all the TLS data in the i386 version
of libc.so added up to 80 bytes (a multiple of 8), but 5.0.0 made this
grow to 2404 bytes (not a multiple of 8). This is due to added caching
data in jemalloc's internal struct tsd_s.
To fix AddressSanitizer, ensure this struct is aligned to at least 16
bytes, which can be done unconditionally for all architectures. (An
earlier version of the fix aligned the struct to 8 bytes, but only for
ILP32 architectures. This was deemed unnecessarily complicated.)
shurd [Mon, 23 Oct 2017 20:50:08 +0000 (20:50 +0000)]
Some cache related optimizations
1. prefetch 128 bytes of mbufs.
2. Re-order filling the pkt_info so cache stalls happen at the end
3. Define empty prefetch2cachelines() macro when the function isn't present.
Provides small performance improvments on some hardware
kib [Mon, 23 Oct 2017 16:14:55 +0000 (16:14 +0000)]
Expand explanation of atomicity.
Mention per-location total order, out of thin air, and torn writes
guarantees. Mention C11 standard' memory model and one most important
FreeBSD additional requirement, that is aligned ordinary loads and
stores are atomic on processors.
The text is introductional and informal. Reference the C11 and
C++1{1,4,7} standards for authorative description.
In collaboration with: alc
Sponsored by: The FreeBSD Foundation (kib)
MFC after: 1 week
mjoras [Mon, 23 Oct 2017 15:43:38 +0000 (15:43 +0000)]
Move clear_unrhdr to tmpfs_free_tmp.
Clearing the unr in tmpfs_unmount is not correct. In the case of
multiple references to the tmpfs mount (e.g. when there are lookup
threads using it) it will not be the one to finish tmpfs_free_tmp. In
those cases tmpfs_free_node_locked will be the final one to execute
tmpfs_free_tmp, and until then the unr must be valid.
imp [Sun, 22 Oct 2017 22:52:23 +0000 (22:52 +0000)]
Create a shell script to build sys/boot on all the architectures.
One could run this from any directory, but it's designed to do
regression testing on sys/boot (it only tests on a subset of
architectures since all of them would take a lot longer and not help).
This will also ensure that future commits to sys/boot compile
everywhere.
jilles [Sun, 22 Oct 2017 20:01:07 +0000 (20:01 +0000)]
libc: Do not refer to _DefaultRuneLocale in ctype inlines
Referring to _DefaultRuneLocale causes this >4KB structure to be copied to
all executables that use <ctype.h> inlines (except PIE executables).
This only affects the case where thread local storage is available.
_CurrentRuneLocale cannot be NULL, so the check can be removed entirely.
_DefaultRuneLocale needs to remain available for now since libc++ uses it.
The __isctype inline in include/_ctype.h also refers to _DefaultRuneLocale
and remains available because it may still be used by third party software.
markj [Sun, 22 Oct 2017 19:17:25 +0000 (19:17 +0000)]
Address some miscellaneous issues in the CTF man page.
- Fix a number of typos.
- Replace some illumos-specific references.
- Note that a type definition of kind CTF_K_FUNCTION may be followed by
a null type identifier in order to provide 4-byte alignment for the
next type definition.
https://www.illumos.org/issues/8300
Prior to integrating the mdocml update to 1.14.1, fix issues found by
new version, especially the "new sentence, new line" style rule.
FreeBSD note: this revision merges only the changes to the CTF manual
page. The changes to the ZFS pages cannot be applied directly.
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Yuri Pankov <yuri.pankov@nexenta.com>
trasz [Sun, 22 Oct 2017 10:35:29 +0000 (10:35 +0000)]
Add OID for the vm.overcommit sysctl. This makes it possible to remove
one call to sysctl(2) from jemalloc startup code. (That also requires
changes to jemalloc, but I plan to push those to upstream first.)
kib [Sun, 22 Oct 2017 08:11:45 +0000 (08:11 +0000)]
Remove the support for mknod(S_IFMT), which created dummy vnodes with
VBAD type.
FFS ffs_write() VOP catches such vnodes and panics, other VOPs do not
check for the type and their behaviour is really undefined. The
comment claims that this support was done for 'badsect' to flag bad
sectors, we do not have such facility in kernel anyway.
Reported by: Dmitry Vyukov <dvyukov@google.com>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
imp [Sun, 22 Oct 2017 03:52:12 +0000 (03:52 +0000)]
Stopgap fix to the mistmatch between LOADER_GELI_SUPPORT and
LOADER_NO_GELI_SUPPORT. To disable geli support in the loader, define
LOADER_GELI_SUPPORT=no. Proper warnings for for old build options to
follow.
dim [Sat, 21 Oct 2017 19:14:45 +0000 (19:14 +0000)]
Pull in r316035 from upstream llvm trunk (by Tim Northover):
AArch64: account for possible frame index operand in compares.
If the address of a local is used in a comparison, AArch64 can fold
the address-calculation into the comparison via "adds".
Unfortunately, a couple of places (both hit in this one test) are not
ready to deal with that yet and just assume the first source operand
is a register.
This should fix an assertion failure while building the test suite of
www/firefox for AArch64.
dim [Sat, 21 Oct 2017 18:21:44 +0000 (18:21 +0000)]
After the import of libc++ 5.0.0, there is no need to disable building
libc++experimental.a on arm (r318654) and mips (r318859) anymore, since
upstream fixed the static assertions which would occur.
Noticed by: George Abdelmalik <gabdelmalik@uniridge.com.au>
PR: 223119
MFC after: 3 days
kib [Sat, 21 Oct 2017 17:28:12 +0000 (17:28 +0000)]
Check that the page which is freed as zeroed, indeed has all-zero content.
This catches some rare mysterious failures at the source. The check
is only performed on architectures which implement direct map, and
only enabled with option DIAGNOSTIC, similar to other costly
consistency checks.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
se [Sat, 21 Oct 2017 16:55:52 +0000 (16:55 +0000)]
Mention sysrc(8) as scripting interface for the modification of config
files. This is a follow up commit to r324721, which added sysrc(8) to
the SEE ALSO list.
Submitted by: Kurt Jaeger (lists at opsec.eu)
MFC after: 1 week
mmel [Sat, 21 Oct 2017 12:06:18 +0000 (12:06 +0000)]
Make elf_aux_info() as public libc function.
- Teach elf aux vector functions about newly added AT_HWCAP and AT_HWCAP2
vectors.
- Export _elf_aux_info() as new public libc function elf_aux_info(3)
The elf_aux_info(3) should be considered as FreeBSD counterpart of glibc
getauxval() with more robust interface.
Note:
We cannot name this new function as getauxval(), with glibc compatible
interface. Some ports autodetect its existence and then expects that all
Linux specific AT_<*> vectors are defined and implemented.
mmel [Sat, 21 Oct 2017 12:05:01 +0000 (12:05 +0000)]
Add AT_HWCAP2 ELF auxiliary vector.
- allocate value for new AT_HWCAP2 auxiliary vector on all platforms.
- expand 'struct sysentvec' by new 'u_long *sv_hwcap2', in exactly
same way as for AT_HWCAP.
bz [Fri, 20 Oct 2017 21:40:59 +0000 (21:40 +0000)]
With r181803 on 2008-08-17 23:27:27Z the first VIMAGE commit went into
HEAD. Enable VIMAGE in GENERIC kernels and some others (where GENERIC does
not exist) on HEAD.
Disable building LINT-VIMAGE with VIMAGE being default.
This should give it a lot more exposure in the run-up to 12 to help
us evaluate whether to keep it on by default or not.
We are also hoping to get better performance testing.
The feature can be disabled using nooptions.
Requested by: many
Reviewed by: kristof, emaste, hiren
X-MFC after: never
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D12639
markj [Fri, 20 Oct 2017 14:56:13 +0000 (14:56 +0000)]
Avoid the nbp lookup in the final loop iteration in flushbuflist().
The end of the loop must re-lookup the next buf since the bufobj lock
is dropped in the loop body. If the lookup fails, the loop is restarted.
This mechanism non-obviously also terminates the loop when the end of
the buf list is reached. Split up the two loops termination cases to
make the code a bit less fragile. No functional change intended.
kib [Fri, 20 Oct 2017 08:32:37 +0000 (08:32 +0000)]
Do not overwrite clean blocks on pageout.
If filesystem block size is less than the page size, it is possible
that the page-out run contains partially clean pages. E.g., the chunk
of the page might be bdwrite()-ed, or some thread performed bwrite()
on a buffer which references a chunk of the paged out page. As
result, the assertion added in r319975, which checked that all pages
in the run are dirty, does not hold on such filesystems.
One solution is to remove the assert, but it is undesirable, because
we do overwrite the valid on-disk content. I cannot provide a scenario
where such write would corrupt the file data, but I do not like it on
principle. Another, in my opinion proper, solution is to only write
parts of the pages still marked dirty. The patch implements this, it
skips clean blocks and only writes the dirty block runs.
Note that due to clustering, write one page might clean other pages in
the run, so the next write range must be calculated only after the
current range is written out.
More, due to a possible invalidation, and the fact that the object
lock is dropped and reacquired before the checks, it is possible that
the whole page-out pages run appears to consist of only clean pages.
For this reason, it is impossible to assert that there is some work
for the pageout method to do (i.e. assert that there is at least one
dirty page in the run). But such clearing can only occur due to
invalidation, and not due to a parallel write, because we own the
vnode lock exclusive.
Reported by: fsu
In collaboration with: pho
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12668
hselasky [Fri, 20 Oct 2017 08:20:15 +0000 (08:20 +0000)]
The remote DMA TCP portspace selector, RDMA_PS_TCP, is used for both
iWarp and RoCE in ibcore. The selection of RDMA_PS_TCP can not be used
to indicate iWarp protocol use. Backport the proper IB device
capabilities from Linux upstream to distinguish between iWarp and
RoCE. Only allocate the additional socket required for iWarp for RDMA
IDs when at least one iWarp device present. This resolves
interopability issues between iWarp and RoCE in ibcore
mjg [Fri, 20 Oct 2017 03:38:58 +0000 (03:38 +0000)]
amd64: __exclusive_cache_line pv_chunks_mutex and pv_list_locks
Note that pv_list_locks is an array and currently it fits 2 locks per line.
Resizing it and/or putting more locks in different lines requires several tests.
mjg [Fri, 20 Oct 2017 00:30:35 +0000 (00:30 +0000)]
mtx: clean up locking spin mutexes
1) shorten the fast path by pushing the lockstat probe to the slow path
2) test for kernel panic only after it turns out we will have to spin,
in particular test only after we know we are not recursing
emaste [Thu, 19 Oct 2017 16:40:17 +0000 (16:40 +0000)]
psci: change bootverbose string to 'PSCI 0.2 compatible'
Prior to r324754 we treated PSCI 0.2 and 1.0 as identical, and r324754
extended that to include all PSCI 1.x revisions. Change the string
emitted under bootverbose to reference '0.2 compatible' to avoid
confusion when the system includes a later PSCI rev.
Discussed with: andrew
Sponsored by: The FreeBSD Foundation
avg [Thu, 19 Oct 2017 16:36:07 +0000 (16:36 +0000)]
remove spa_sync_on assert from spa_async_thread_vd
Unlike spa_async_thread that can get started only from spa_sync()
spa_async_thread_vd can get started from other contexts.
Additionally, spa_async_thread_vd does not really depend on
spa sync being enabled.
The incorrect assert could be triggered by importing a pool in the
read-only mode and then disconnecting one of its disks.
In this case spa_sync_on was false because the pool was read-only
and spa_async_thread_vd was started to handle SPA_ASYNC_REMOVE event.
Note: spa_async_thread_vd() currently exists only in FreeBSD, it was
split out of spa_async_thread() in r253990.
andrew [Thu, 19 Oct 2017 13:22:52 +0000 (13:22 +0000)]
Allow later PSCI revisions to also work. The latest ARM Trusted Firmware
reports version 1.1 so the check was failing. As thjis is a minor change
from 1.0, and future 1.x revisions are also expected to be backwards
compatible just ignore the minor revision in the init handler.
mav [Thu, 19 Oct 2017 09:01:15 +0000 (09:01 +0000)]
Relax per-ifnet cif_vrs list double locking in carp(4).
In all cases where cif_vrs list is modified, two locks are held: per-ifnet
CIF_LOCK and global carp_sx. It means to read that list only one of them
is enough to be held, so we can skip CIF_LOCK when we already have carp_sx.
This fixes kernel panic, caused by attempts of copyout() to sleep while
holding non-sleepable CIF_LOCK mutex.
alc [Thu, 19 Oct 2017 04:13:47 +0000 (04:13 +0000)]
Batch atomic updates to the number of active, inactive, and laundry
pages by vm_object_terminate_pages(). For example, for a "buildworld"
workload, this batching reduces vm_object_terminate_pages()'s average
execution time by 12%. (The total savings were about 11.7 billion
processor cycles.)
jhibbits [Thu, 19 Oct 2017 03:38:53 +0000 (03:38 +0000)]
Add some more devices to the MPC85XX-based configs
These devices bring the configs closer to a desktop-like (GENERIC) kernel
config.
* The Freescale DIU support was added to the config in r306358.
Without keyboard support video support is nearly pointless, so add ukbd and
ums.
* The AmigaOne X5000, and P1022 devboard, both use a variant of the ds1307 RTC
* cpufreq scaling is currently supported by the p1022. More SoCs will be added
eventually.
cy [Thu, 19 Oct 2017 03:17:50 +0000 (03:17 +0000)]
Anticongestion refinements for ntpd rc script. This reverts r324681
and checks if ntp leapfile needs fetching before entering into the
anticongestion sleep.
Unfortunately some ports still use their own sleeps so, this commit
doesn't address the complete problem which is compounded by every
port that uses its own anticongestion mechanism.
mjg [Thu, 19 Oct 2017 01:38:31 +0000 (01:38 +0000)]
sysctl: only take mem lock if oldlen is > 4 * PAGE_SIZE
The previous limit of just one page is hit by ps.
The entire mechanism should be reworked, if not whacked. It seems the intent
is to reduce kernel dos-ability - some handlers wire the amount of memory
passed here. Handlers should probably stop wiring in the first place or in
the worst case indicate they are doing so so that the check is done only if
necessary. It should also probably be a counter, not a lock.