kib [Wed, 30 Jan 2019 02:07:13 +0000 (02:07 +0000)]
i386: Merge PAE and non-PAE pmaps into same kernel.
Effectively all i386 kernels now have two pmaps compiled in: one
managing PAE pagetables, and another non-PAE. The implementation is
selected at cold time depending on the CPU features. The vm_paddr_t is
always 64bit now. As result, nx bit can be used on all capable CPUs.
Option PAE only affects the bus_addr_t: it is still 32bit for non-PAE
configs, for drivers compatibility. Kernel layout, esp. max kernel
address, low memory PDEs and max user address (same as trampoline
start) are now same for PAE and for non-PAE regardless of the type of
page tables used.
Non-PAE kernel (when using PAE pagetables) can handle physical memory
up to 24G now, larger memory requires re-tuning the KVA consumers and
instead the code caps the maximum at 24G. Unfortunately, a lot of
drivers do not use busdma(9) properly so by default even 4G barrier is
not easy. There are two tunables added: hw.above4g_allow and
hw.above24g_allow, the first one is kept enabled for now to evaluate
the status on HEAD, second is only for dev use.
i386 now creates three freelists if there is any memory above 4G, to
allow proper bounce pages allocation. Also, VM_KMEM_SIZE_SCALE changed
from 3 to 1.
The PAE_TABLES kernel config option is retired.
In collaboarion with: pho
Discussed with: emaste
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D18894
kib [Tue, 29 Jan 2019 22:46:44 +0000 (22:46 +0000)]
Untangle jemalloc and mutexes initialization.
The need to use libc malloc(3) from some places in libthr always
caused issues. For instance, per-thread key allocation was switched to
use plain mmap(2) to get storage, because some third party mallocs
used keys for implementation of calloc(3).
Even more important, libthr calls calloc(3) during initialization of
pthread mutexes, and jemalloc uses pthread mutexes. Jemalloc provides
some way to both postpone the initialization, and to make
initialization to use specialized allocator, but this is very fragile
and often breaks. See the referenced PR for another example.
Add the small malloc implementation used by rtld, to libthr. Use it in
thr_spec.c and for mutexes initialization. This avoids the issues with
mutual dependencies between malloc and libthr in principle. The
drawback is that some more allocations are not interceptable for
alternate malloc implementations. There should be not too much memory
use from this allocator, and the alternative, direct use of mmap(2) is
obviously worse.
PR: 235211
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D18988
mav [Tue, 29 Jan 2019 20:35:09 +0000 (20:35 +0000)]
Reimplement BIO_ORDERED handling in nvd(4).
This fixes BIO_ORDERED semantics while also improving performance by:
- sleeping also before BIO_ORDERED bio, as defined, not only after;
- not queueing BIO_ORDERED bio to taskqueue if no other bios running;
- waking up sleeping taskqueue explicitly rather then rely on polling.
On Samsung SSD 970 PRO this shows sync write latency, measured with
`diskinfo -wS`, reduction from ~2ms to ~1.1ms by not sleeping without
reason till next HZ tick.
On the same device ZFS pool with 8 ZVOLs synchronously writing 4KB blocks
shows ~950 IOPS instead of ~750 IOPS before. I suspect ZFS does not need
BIO_ORDERED on BIO_FLUSH at all, but that will be next question.
ae [Tue, 29 Jan 2019 11:18:41 +0000 (11:18 +0000)]
Fix the bug introduced in r342908, that causes problems with dynamic
handling for protocols without ports numbers.
Since port numbers were uninitialized for protocols like ICMP/ICMPv6,
ipfw_chk() used some non-zero values to create dynamic states, and due
this it failed to match replies with created states.
Reported by: Oliver Hartmann, Boris Lytochkin
Obtained from: Yandex LLC
X-MFC after: r342908
andrew [Tue, 29 Jan 2019 11:04:17 +0000 (11:04 +0000)]
Extract the coverage sanitizer KPI to a new file.
This will allow multiple consumers of the coverage data to be compiled
into the kernel together. The only requirement is only one can be
registered at a given point in time, however it is expected they will
only register when the coverage data is needed.
A new kernel conflig option COVERAGE is added. This will allow kcov to
become a module that can be loaded as needed, or compiled into the
kernel.
kevans [Tue, 29 Jan 2019 04:08:49 +0000 (04:08 +0000)]
bectl(8) test: Force destroy the zpool in cleanup
This is a wild guess as to why bectl tests failed once upon a time in CI,
given no apparent way to see a transcript of cleanup routines with Kyua. The
bectl tests construct a new, clean zpool for every test. The failure
indicated was because of a mount that was leftover from a previous test, but
the previous test had succeeded so it's not clear how the mount remained
leftover unless the `zpool get health ${pool}` had somehow failed.
mckusick [Mon, 28 Jan 2019 21:36:45 +0000 (21:36 +0000)]
This bug was introduced with the change to use softdep_bp_to_mp() in
January 2018 changes -r327723 and -r327821. The softdep_bp_to_mp()
function failed to include VFIFO as one of the valid cases.
Although fifo's do not allocate blocks in the filesystem, they will
allocate blocks if they use extended attributes (such as ACLs). Thus,
softdep_bp_to_mp() needs to return a non-NULL mount pointer when
presented with a fifo vnode so that the soft updates write complete
will properly process the soft updates structures associated with the
extended attribute blocks. It was the failure to process these soft
updates structures, thus leaving them hanging off the buffer, which
lead to the "panic: softdep_deallocate_dependencies: dangling deps"
when trying to clean up the buffer after it was written.
pkelsey [Mon, 28 Jan 2019 20:30:04 +0000 (20:30 +0000)]
Speed up non-status operations applied to a single interface
When performing a non-status operation on a single interface, it is
not necessary for ifconfig to build a list of all addresses in the
system, sort them, then iterate through them looking for the entry for
the single interface of interest. Doing so becomes increasingly
expensive as the number of interfaces in the system grows (e.g., in a
system with 1000+ vlan(4) interfaces).
pkelsey [Mon, 28 Jan 2019 20:26:09 +0000 (20:26 +0000)]
Don't re-evaluate ALTQ kernel configuration due to events on non-ALTQ interfaces
Re-evaluating the ALTQ kernel configuration can be expensive,
particularly when there are a large number (hundreds or thousands) of
queues, and is wholly unnecessary in response to events on interfaces
that do not support ALTQ as such interfaces cannot be part of an ALTQ
configuration.
bcr [Mon, 28 Jan 2019 19:54:58 +0000 (19:54 +0000)]
A few corrections and clarifications to r343406.
- Use "in" instead of "on" when referring to directory and UFS partition.
- Switch from hw.physmem to hw.realmem and add a description to
distinguish the two.
- Explain why the "df" command is having trouble displaying ZFS sizes
correctly. Add a bit more descriptive text to help why the output of
"zfs list -o space" should be used.
- Switch to vmstat instead of iostat display for systat(1) as it shows
more information on one screen. Describe what is displayed based on the
text of the man page. Change the list of the other values accordingly.
- Sort the flags to "zfs destroy" alphabetically.
tuexen [Mon, 28 Jan 2019 12:45:31 +0000 (12:45 +0000)]
Fix the detection of ECN-setup SYN-ACK packets.
RFC 3168 defines an ECN-setup SYN-ACK packet as on with the ECE flags
set and the CWR flags not set. The code was only checking if ECE flag
is set. This patch adds the check to verify that the CWR flags is not
set.
avos [Mon, 28 Jan 2019 11:39:54 +0000 (11:39 +0000)]
rsu(4): do not ignore mgmtrate / mcastrate / ucastrate.
Enforce net80211 rates for control / management / multicast / EAPOL frames
and allow to override rate for unicast frames via ifconfig(8) 'ucastrate'
option; by default it still uses f/w rate adaptation for unicast frames.
kp [Mon, 28 Jan 2019 08:36:10 +0000 (08:36 +0000)]
pfctl: Point users to net.pf.request_maxcount if large requests are rejected
The kernel will reject very large tables to avoid resource exhaustion
attacks. Some users run into this limit with legitimate table
configurations.
The error message in this case was not very clear:
pf.conf:1: cannot define table nets: Invalid argument
pfctl: Syntax error in config file: pf rules not loaded
If a table definition fails we now check the request_maxcount sysctl,
and if we've tried to create more than that point the user at
net.pf.request_maxcount:
pf.conf:1: cannot define table nets: too many elements.
Consider increasing net.pf.request_maxcount.
pfctl: Syntax error in config file: pf rules not loaded
kib [Sun, 27 Jan 2019 00:46:06 +0000 (00:46 +0000)]
Bump SPECNAMELEN to MAXNAMLEN.
This includes the bump for cdevsw d_version. Otherwise, the impact on
the ABI (not KBI) is surprisingly low. The most important affected
interface is devname(3) and ttyname(3) which already correctly handle
long names (and ttyname(3) should not be affected at all).
Still, due to the d_version bump, I argue that the change is not MFC-able.
Requested by: mmacy
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D18932
kib [Sun, 27 Jan 2019 00:37:52 +0000 (00:37 +0000)]
Remove now redundand ifunc relocation code which should have been
removed as part of r341441.
This call to reloc_non_plt() may crash if ifunc resolvers use the
needed libraries symbols since the pass over the needed libs
relocation is not yet done. The change in r341441 ensures the right
relocation order otherwise.
se [Sat, 26 Jan 2019 22:24:15 +0000 (22:24 +0000)]
Slightly improve previous commit that silenced a Clang Scan warning.
The strdup() call does not take advantage of the known length of the
source string. Replace by malloc() and memcpy() utilizimng the pre-
calculated string length.
marius [Sat, 26 Jan 2019 21:35:51 +0000 (21:35 +0000)]
- In _iflib_fl_refill(), don't mark an RX buffer as available in the
corresponding bitmap before adding an mbuf has actually succeeded.
Previously, m_gethdr(M_NOWAIT, ...) failing caused a "hole" in the
RX ring but not in its bitmap. One implication of such a hole was
that in a subsequent call to _iflib_fl_refill() with the RX buffer
accounting still indicating another reclaimable buffer, bit_ffc(3)
nevertheless returned -1 in frag_idx which in turn caused havoc
when used as an index. Thus, additionally assert that frag_idx is
0 or greater.
Another possible consequence of a hole in the RX ring was a NULL-
dereference when trying to use the unallocated mbuf, for example
in iflib_rxd_pkt_get().
While at it, make the variable declarations in _iflib_fl_refill()
conform to style(9) and remove redundant checks already performed
by bit_ffc{,_at}(3).
- In iflib_queues_alloc(), don't pass redundant M_ZERO to bit_alloc(3).
se [Sat, 26 Jan 2019 20:43:28 +0000 (20:43 +0000)]
Fix potential buffer overflow and undefined behavior.
The buffer allocated in read_chat() could be 1 element too short, if the
chatstr parameter passed in is 1 or 3 charachters long (e.g. "a" or "a b").
The allocation of the pointer array does not account for the terminating
NULL pointer in that case.
Overlapping source and destination strings are undefined in strcpy().
Instead of moving a string to the left by one character just increment the
char pointer before it is assigned to the results array.
bz [Sat, 26 Jan 2019 17:52:12 +0000 (17:52 +0000)]
Fix logic errors in iwm_pcie_load_firmware_chunk introduced in r314065.
* There's no reason to have a while() loop here, because:
- if msleep returns 0, that means we were woken up by the interrupt handler,
and we are going to exit immediately as sc_fw_chunk_done will now be 1
(there is nothing else that sleeps on sc_fw.)
- if msleep doesn't return 0 (i.e. it returned ETIMEDOUT) then we will
exit immediately because of the if-test.
So, just use a single msleep() and then check sc_fw_chunk_done as before.
* The comment said we were sleeping for 5 seconds, but the msleep was only
for 1. Before r314065, this was 1 second and so was the comment,
and in that commit the comment was changed and the function call wasn't.
Possibly fixes failures to initialize uCode on certain devices.
oshogbo [Sat, 26 Jan 2019 14:10:49 +0000 (14:10 +0000)]
libcasper: do not run registered exit functions
Casper library should not use exit(3) function because before setting it up
applications may register it. Casper doesn't depend on any registered exit
function, so it safe to change this.
mckusick [Sat, 26 Jan 2019 05:35:24 +0000 (05:35 +0000)]
Expand DDB's set of printable soft dependency data structures. The
set of known soft dependency data structures now includes: sd_worklist,
sd_inodedep, sd_allocdirect, sd_allocindir, and sd_mkdir. DDB can
also print lists of sd_allinodedeps, sd_mkdir_list, and sd_workhead.
The sd_workhead script is useful for listing all the dependencies
associated with a buffer, e.g. bp->b_dep.
Prefix the soft dependency show names with sd_ so that they sort
together when listed by DDB's "show help" and to distinguish them
from other data structures printable by DDB.
pfg [Fri, 25 Jan 2019 22:22:29 +0000 (22:22 +0000)]
ext2fs: Add some extra consistency checks for the superblock.
Maliciously formed, or badly corrupted, filesystems can cause kernel
panics. In general, such acts of foot-shooting can only be accomplished
by root, but in a world with VM images that is moving towards automated
mounts it is important to have some form of prevention.
Reported by: Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert
of Fraunhofer FKIE.
Incidentaly this should also fix a memory corruption issue reported by
Dr Silvio Cesare of InfoSect.
Huge thanks to all reseachers for making us aware of the issue.
admbug: 872, 891
Reviewed by: fsu
Obtained from: NetBSD (with minor changes)
MFC after: 3 days
jhb [Fri, 25 Jan 2019 20:54:18 +0000 (20:54 +0000)]
Fix a few more places to handle ofld tx queues for RATELIMIT.
- Drain offload transmit queues when RATELIMIT is enabled but
TCP_OFFLOAD is not.
- Expose the per-VI nofldtxq and first_ofld_txq sysctls when
RATELIMIT is enabled but TCP_OFFLOAD is not.
- Clear offload transmit queue stats as part of a 'cxgbetool clearstats'
request when RATELIMIT is enabled but TCP_OFFLOAD is not.
mckusick [Fri, 25 Jan 2019 20:07:18 +0000 (20:07 +0000)]
Allow tunefs to include '_' as a legal character in label names
to make it consistent with newfs. Document the legality of '_'
in label names in both tunefs(8) and newfs(8).
trasz [Fri, 25 Jan 2019 17:09:26 +0000 (17:09 +0000)]
Comment out the default sh(1) aliases for root, introduced in r343416.
The rest of this stuff is still to be discussed, but I think at this
point we have the agreement that the aliases should go.
gallatin [Fri, 25 Jan 2019 15:02:18 +0000 (15:02 +0000)]
Fix an iflib driver unload panic introduced in r343085
The new loop to sync and unload descriptors was indexed
by "i", rather than "j". The panic was caused by "i"
being advanced rather than "j", and eventually becoming
out of bounds.
Reviewed by: kib
MFC after: 3 days
Sponsored by: Netflix
emaste [Fri, 25 Jan 2019 14:46:13 +0000 (14:46 +0000)]
clang: default to DWARF 4 as of FreeBSD 13
FreeBSD previously defaulted to DWARF 2 because several tools (gdb,
ctfconvert, etc.) did not support later versions. These have either
been fixed or are deprecated.
Note that gdb 6 still exists but has been moved out of $PATH into
/usr/libexec and is intended only for use by crashinfo(8). The kernel
build sets the DWARF version explicitly via -gdwarf2, so this should
have no effect there.
PR: 234887 [exp-run]
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17930
tuexen [Fri, 25 Jan 2019 13:57:09 +0000 (13:57 +0000)]
Fix a bug in the restart window computation of TCP New Reno
When implementing support for IW10, an update in the computation
of the restart window used after an idle phase was missed. To
minimize code duplication, implement the logic in tcp_compute_initwnd()
and call it. This fixes a bug in NewReno, which was not aware of
IW10.
kp [Fri, 25 Jan 2019 01:06:06 +0000 (01:06 +0000)]
pf: Fix use-after-free of counters
When cleaning up a vnet we free the counters in V_pf_default_rule and
V_pf_status from shutdown_pf(), but we can still use them later, for example
through pf_purge_expired_src_nodes().
Free them as the very last operation, as they rely on nothing else themselves.
ngie [Thu, 24 Jan 2019 20:35:58 +0000 (20:35 +0000)]
Fix a typo/wordsmith a description modified in r343407
r343407 accidentally introduced a typo (folling -> following). While
reading the change out loud, I realized that the original sentence was
wordy. almost sounding like a run-on sentence.
Improve the flow by splitting up the two thoughts into two distinct sentence
fragments.
se [Thu, 24 Jan 2019 18:39:45 +0000 (18:39 +0000)]
Silence Clang Scan warnings regarding the use of strcp().
While these warnings are false positives, the use of strdup() instead of
malloc() and strcpy() simplifies and clarifies the code.
While checking the remaining uses of strcpy and strcat I noticed an
assignment of a strlen() to a variable "s", whose value needs to be
preserved for use in later output routines (where it is used to allocate
a buffer). I do not think that the value of "s" will come out lower than
its correct value and thus there is no risk of a buffer overflow, in the
general case, but a specially crafter argument might lead to an overflow.
The bogus assignment to "s" is removed since this value was only used a
single time in the following malloc() call, which has been removed.
bcr [Thu, 24 Jan 2019 18:13:23 +0000 (18:13 +0000)]
Add ZFS usage tips to freebsd-tips.
Add a bunch of examples on how to use ZFS features like:
- listing available space,
- setting and displaying a userquota,
- displaying pool I/O statistics and pool history,
- displaying the compression ratio for a dataset,
- various list options (sorting, removing headers),
- performing a dry-run of a snapshot delete,
- removing a range of snapshots,
- setting a custom property,
- preventing removal of a snapshot with ZFS holds,
- permission sets for zfs send/receive.
Additionally, clarify the existing examples a bit when
it comes to displaying space by mentioning UFS explicitly.
Other examples include displaying I/O in top(1), querying
sysctl(8) for active CPUs and available RAM. Mention systat(1)
and its options, too.
While here, reformat the example to upload a dmesg(8) a bit
to wrap properly.
Thanks to Allan Jude for his help with some of the ZFS examples.
hselasky [Thu, 24 Jan 2019 08:34:13 +0000 (08:34 +0000)]
Fix refcounting leaks in IPv6 MLD code leading to loss of IPv6
connectivity.
Looking at past changes in this area like r337866, some refcounting
bugs have been introduced, one by one. For example like calling
in6m_disconnect() and in6m_rele_locked() in mld_v1_process_group_timer()
where previously no disconnect nor refcount decrement was done.
Calling in6m_disconnect() when it shouldn't causes IPv6 solitation to no
longer work, because all the multicast addresses receiving the solitation
messages are now deleted from the network interface.
This patch reverts some recent changes while improving the MLD
refcounting and concurrency model after the MLD code was converted
to using EPOCH(9).
List changes:
- All CK_STAILQ_FOREACH() macros are now properly enclosed into
EPOCH(9) sections. This simplifies assertion of locking inside
in6m_ifmultiaddr_get_inm().
- Corrected bad use of in6m_disconnect() leading to loss of IPv6
connectivity for MLD v1.
- Factored out checks for valid inm structure into
in6m_ifmultiaddr_get_inm().
hselasky [Thu, 24 Jan 2019 08:25:02 +0000 (08:25 +0000)]
When detaching a network interface drain the workqueue freeing the inm's
because the destructor will access the if_ioctl() callback in the ifnet
pointer which is about to be freed. This prevents use-after-free.