jtl [Tue, 14 Aug 2018 17:59:42 +0000 (17:59 +0000)]
MFC r337781:
Make the IPv6 fragment limits be global, rather than per-VNET, limits.
The IPv6 reassembly fragment limit is based on the number of mbuf clusters,
which are a global resource. However, the limit is currently applied
on a per-VNET basis. Given enough VNETs (or given sufficient customization
on enough VNETs), it is possible that the sum of all the VNET fragment
limits will exceed the number of mbuf clusters available in the system.
Given the fact that the fragment limits are intended (at least in part) to
regulate access to a global resource, the IPv6 fragment limit should
be applied on a global basis.
Note that it is still possible to disable fragmentation for a particular
VNET by setting the net.inet6.ip6.maxfragpackets sysctl to 0 for that
VNET. In addition, it is now possible to disable fragmentation globally
by setting the net.inet6.ip6.maxfrags sysctl to 0.
Approved by: so
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
jtl [Tue, 14 Aug 2018 17:54:39 +0000 (17:54 +0000)]
MFC r337780:
Implement a limit on on the number of IPv4 reassembly queues per bucket.
There is a hashing algorithm which should distribute IPv4 reassembly
queues across the available buckets in a relatively even way. However,
if there is a flaw in the hashing algorithm which allows a large number
of IPv4 fragment reassembly queues to end up in a single bucket, a per-
bucket limit could help mitigate the performance impact of this flaw.
Implement such a limit, with a default of twice the maximum number of
reassembly queues divided by the number of buckets. Recalculate the
limit any time the maximum number of reassembly queues changes.
However, allow the user to override the value using a sysctl
(net.inet.ip.maxfragbucketsize).
Approved by: so
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
jtl [Tue, 14 Aug 2018 17:52:06 +0000 (17:52 +0000)]
MFC r337778:
Add a global limit on the number of IPv4 fragments.
The IP reassembly fragment limit is based on the number of mbuf clusters,
which are a global resource. However, the limit is currently applied
on a per-VNET basis. Given enough VNETs (or given sufficient customization
of enough VNETs), it is possible that the sum of all the VNET limits
will exceed the number of mbuf clusters available in the system.
Given the fact that the fragment limit is intended (at least in part) to
regulate access to a global resource, the fragment limit should
be applied on a global basis.
VNET-specific limits can be adjusted by modifying the
net.inet.ip.maxfragpackets and net.inet.ip.maxfragsperpacket
sysctls.
To disable fragment reassembly globally, set net.inet.ip.maxfrags to 0.
To disable fragment reassembly for a particular VNET, set
net.inet.ip.maxfragpackets to 0.
Approved by: so
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
jtl [Tue, 14 Aug 2018 17:46:54 +0000 (17:46 +0000)]
MFC r337776:
Improve IPv6 reassembly performance by hashing fragments into buckets.
Currently, all IPv6 fragment reassembly queues are kept in a flat
linked list. This has a number of implications. Two significant
implications are: all reassembly operations share a common lock,
and it is possible for the linked list to grow quite large.
Improve IPv6 reassembly performance by hashing fragments into buckets,
each of which has its own lock. Calculate the hash key using a Jenkins
hash with a random seed.
Approved by: so
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
jtl [Tue, 14 Aug 2018 17:43:11 +0000 (17:43 +0000)]
MFC r337775:
Improve hashing of IPv4 fragments.
Currently, IPv4 fragments are hashed into buckets based on a 32-bit
key which is calculated by (src_ip ^ ip_id) and combined with a random
seed. However, because an attacker can control the values of src_ip
and ip_id, it is possible to construct an attack which causes very
deep chains to form in a given bucket.
To ensure more uniform distribution (and lower predictability for
an attacker), calculate the hash based on a key which includes all
the fields we use to identify a reassembly queue (dst_ip, src_ip,
ip_id, and the ip protocol) as well as a random seed.
brooks [Tue, 14 Aug 2018 16:22:04 +0000 (16:22 +0000)]
MFC r337508:
Terminate filter_create_ext() args with NULL, not 0.
filter_create_ext() is documented to take a NULL terminated set of
arguments. 0 is promoted to an int so this would fail on 64-bit
systems if the value was not passed in a register. On all currently
supported 64-bit architectures it is.
kib [Sun, 12 Aug 2018 08:45:23 +0000 (08:45 +0000)]
MFC r336569:
Move mostly useless examples binaries from OFED, as well as the Subnet
Manager, under the new option WITH_OFED_EXTRA, disabled by default.
truckman [Sun, 12 Aug 2018 03:22:28 +0000 (03:22 +0000)]
MFC r336855
Fix the long term ULE load balancer so that it actually works. The
initial call to sched_balance() during startup is meant to initialize
balance_ticks, but does not actually do that since smp_started is
still zero at that time. Since balance_ticks does not get set,
there are no further calls to sched_balance(). Fix this by setting
balance_ticks in sched_initticks() since we know the value of
balance_interval at that time, and eliminate the useless startup
call to sched_balance(). We don't need to randomize the intial
value of balance_ticks.
Since there is now only one call to sched_balance(), we can hoist
the tests at the top of this function out to the caller and avoid
the overhead of the function call when running a SMP kernel on UP
hardware.
kevans [Sun, 12 Aug 2018 00:33:24 +0000 (00:33 +0000)]
MFC r337331: efirt: Don't enter EFI context early, convert addrs to KVA
efi_enter here was needed because efi_runtime dereference causes a fault
outside of EFI context, due to runtime table living in runtime service
space. This may cause problems early in boot, though, so instead access it
by converting paddr to KVA for access.
While here, remove the other direct PHYS_TO_DMAP calls and the explicit DMAP
requirement from efidev.
kevans [Fri, 10 Aug 2018 01:43:05 +0000 (01:43 +0000)]
MFC r337549: libnv: Remove -I${SRCTOP}/sys
This should have been done as part of r336019 -- including ${SRCTOP}/sys is
not a good business model for something that's build in legacy/bootstrap
stages.
Beyond that, libnv seems to build quite alright as legacy, part of
buildworld, and standalone without. Axe it.
davidcs [Thu, 9 Aug 2018 01:17:35 +0000 (01:17 +0000)]
MFC r336695
Remove support for QLNX_RCV_IN_TASKQ - i.e., Rx only in TaskQ.
Added support for LLDP passthru
Upgrade ECORE to version 8.33.5.0
Upgrade STORMFW to version 8.33.7.0
Added support for SRIOV
davidcs [Thu, 9 Aug 2018 00:39:39 +0000 (00:39 +0000)]
MFC r336438
Fixes for the following issues:
1. Fix taskqueues drain/free to fix panic seen when interface is being
bought down and in parallel asynchronous link events happening.
2. Fix bxe_ifmedia_status()
Submitted by: Vaishali.Kulkarni@cavium.com and Anand.Khoje@cavium.com
r320280:
packages: Allow stageworld/stagekernel to run with make jobs.
r320281:
packages: Allow staging world/kernel in parallel.
r320282:
packages: Allow creating kernel/world packages in parallel.
r320283:
packages: Allow actually building individual world packages in parallel.
r320284:
packages: Parallelize individual kernel packaging.
r320285:
Expose only the create-packages-* targets since they set needed
DEST/DIRDIR.
r320692:
Fix create-kernel-packages with multiple BUILDKERNELS after r320284
r322362:
Indent nested conditionals for readability.
r322401:
Avoid creating kernel-dbg.txz distribution sets and kernel-debug packages
when MK_DEBUG_FILES is 'no'.
r322402:
Fix indentation from r322401.
r336181:
Fix parsing of create-kernel-packages
MFC r331098 (by melifaro):
Fix outgoing TCP/UDP packet drop on arp/ndp entry expiration.
Current arp/nd code relies on the feedback from the datapath indicating
that the entry is still used. This mechanism is incorporated into the
arpresolve()/nd6_resolve() routines. After the inpcb route cache
introduction, the packet path for the locally-originated packets changed,
passing cached lle pointer to the ether_output() directly. This resulted
in the arp/ndp entry expire each time exactly after the configured max_age
interval. During the small window between the ARP/NDP request and reply
from the router, most of the packets got lost.
Fix this behaviour by plugging datapath notification code to the packet
path used by route cache. Unify the notification code by using single
inlined function with the per-AF callbacks.
MFC r336132:
Add "record-state", "set-limit" and "defer-action" rule options to ipfw.
"record-state" is similar to "keep-state", but it doesn't produce implicit
O_PROBE_STATE opcode in a rule. "set-limit" is like "limit", but it has the
same feature as "record-state", it is single opcode without implicit
O_PROBE_STATE opcode. "defer-action" is targeted to be used with dynamic
states. When rule with this opcode is matched, the rule's action will
not be executed, instead dynamic state will be created. And when this
state will be matched by "check-state", then rule action will be executed.
This allows create a more complicated rulesets.
dab [Tue, 7 Aug 2018 14:39:00 +0000 (14:39 +0000)]
MFC r336761 & r336781:
Allow a EVFILT_TIMER kevent to be updated.
If a timer is updated (re-added) with a different time period
(specified in the .data field of the kevent), the new time period has
no effect; the timer will not expire until the original time has
elapsed. This violates the documented behavior as the kqueue(2) man
page says (in part) "Re-adding an existing event will modify the
parameters of the original event, and not result in a duplicate
entry."
This modification, adapted from a patch submitted by cem@ to PR214987,
fixes the kqueue system to allow updating a timer entry. The kevent
timer behavior is changed to:
* When a timer is re-added, update the timer parameters to and
re-start the timer using the new parameters.
* Allow updating both active and already expired timers.
* When the timer has already expired, dequeue any undelivered events
and clear the count of expirations.
All of these changes address the original PR and also bring the
FreeBSD and macOS kevent timer behaviors into agreement.
A few other changes were made along the way:
* Update the kqueue(2) man page to reflect the new timer behavior.
* Fix man page style issues in kqueue(2) diagnosed by igor.
* Update the timer libkqueue system test to test for the updated
timer behavior.
* Fix the (test) libkqueue common.h file so that it includes
config.h which defines various HAVE_* feature defines, before the
#if tests for such variables in common.h. This enables the use of
the actual err(3) family of functions.
* Fix the usages of the err(3) functions in the tests for incorrect
type of variables. Those were formerly undiagnosed due to the
disablement of the err(3) functions (see previous bullet point).
jtl [Mon, 6 Aug 2018 17:41:53 +0000 (17:41 +0000)]
MFC r337384:
Address concerns about CPU usage while doing TCP reassembly.
Currently, the per-queue limit is a function of the receive buffer
size and the MSS. In certain cases (such as connections with large
receive buffers), the per-queue segment limit can be quite large.
Because we process segments as a linked list, large queues may not
perform acceptably.
The better long-term solution is to make the queue more efficient.
But, in the short-term, we can provide a way for a system
administrator to set the maximum queue size.
We set the default queue limit to 100. This is an effort to balance
performance with a sane resource limit. Depending on their
environment, goals, etc., an administrator may choose to modify this
limit in either direction.
Reviewed by: jhb
Approved by: so
Security: FreeBSD-SA-18:08.tcp
Security: CVE-2018-6922
kevans [Mon, 6 Aug 2018 03:58:56 +0000 (03:58 +0000)]
MFC r336919, r336924
r336919:
efirt: Add tunable to allow disabling EFI Runtime Services
Leading up to enabling EFIRT in GENERIC, allow runtime services to be
disabled with a new tunable: efi.rt_disabled. This makes it so that EFIRT
can be disabled easily in case we run into some buggy UEFI implementation
and fail to boot.
r336924:
Follow up to r336919 and r336921: s/efi.rt_disabled/efi.rt.disabled/
The latter matches the rest of the tree better [0]. The UPDATING entry has
been updated to reflect this, and the new tunable is now documented in
loader(8) [1].
Relevant vendor changes:
Fix issue #948: out-of-bounds read in lha_read_data_none()
MFH r336854:
Sync libarchive with vendor.
Important vendor changes:
PR #993: Chdir to -C directory for metalog processing
OSS-Fuzz #4969: Check size of the extended time field in zip archives
PR #973: Record informational compression level in gzip header
kevans [Sat, 4 Aug 2018 22:15:05 +0000 (22:15 +0000)]
MFC r336152-r336154, r336157
r336152:
subr_hints: Use goto/label instead of series of conditionals
r336153:
subr_hints: Convert some bool-like ints to bools
r336154:
subr_hints: Skip static_env and static_hints if they don't contain hints
This is possible because, well, they're static. Both the dynamic environment
and the MD-environment (generally loader(8) environment) can potentially
have room for new variables to be set, and thus do not receive this
treatment.
r336157:
kern_environment: bool'itize dynamic_kenv; fix small style(9) nit
As an aside- this has been slightly altered from the version in head to keep
the MD and config-static environments mutually exclusive by default.
This difference is a one-line change in init_static_kenv to setup the MD
environment if the config-static environment is empty or if
loader_env.disabled is explicitly set to 0.
r335998:
kern_environment: use any provided environments, evict hintmode/envmode
At the moment, hintmode and envmode are used to indicate whether static
hints or static env have been provided in the kernel config(5) and the
static versions are mutually exclusive with loader(8)-provided environment.
hintmode *can* be reconfigured later to pull from the dynamic environment,
thus taking advantage of the loader(8) or post-kmem environment setting.
This changeset fixes both problems at once to move us from a semi-confusing
state to a consistent state: if an environment file, hints file, or
loader(8) environment are provided, we use them in a well-known order of
precedence:
Once the dynamic environment is setup this becomes a moot point. The
loader(8) and static environments are merged (respecting the above order of
precedence), and the static hints are merged in on an as-needed basis after
the dynamic environment has been setup.
Hints lookup are changed to respect all of the above. Before the dynamic
environment is setup, lookups use the above-mentioned order and fallback to
the next environment if a matching hint is not found. Once the dynamic
environment is setup, that is used on its own since it captures all of the
above information plus any dynamic kenv settings that came up later in boot.
The following tangentially related changes were made to res_find:
- A hintp cookie is now passed in so that related searches continue using
the chain of environments (or dynamic environment) without relying on
global state
- All three environments will be searched if they actually have valid hints
to use, rather than just choosing the first environment that actually had
a hint and rolling with that only
The hintmode sysctl has been ripped out. static_{env,hints}.disabled are
still honored and will disable their respective environments from being used
for hint lookups and from being merged into the dynamic environment, as
expected.
r336019:
config(8): De-dupe hint/env vars within a single file
r335653 flipped the order in which hints/env files are concatenated to match
the order in which vars are processed by the kernel. This is the other
hammer to drop.
Use nv(9) to de-dupe entries within a single `hint` or `env` file, using the
latest value specified for a key. This leaves some duplicates if a variable
is specified in multiple hint/env files or via `envvar` in a kernel config,
but the reversed order of concatenation (from r335653) makes this a
non-issue as the latest-specified version will be seen first.
This change also silently rewrote hint bits to use the same sanitization
process that ian@ wrote for r335642. To the kernel, hints and env vars are
basically the same thing through early boot, then get merged into the
dynamic environment once kmem becomes available and the dynamic environment
is created. They should be subjected to the same restrictions.
libnv has been added to -legacy for the time being to support the build of
config(8) with the new cnvlist API.
r336026:
config(8): Fix broken ABI
r336019 introduced ${SRCTOP}/sys to the include paths in order to pull in a
new sys/{c,}nv.h. This is wrong, because the build tree's ABI isn't
guaranteed to match what's running on the host system.
Fix instead by removing -I${SRCTOP}/sys and installing the libnv headers
with `make -C lib/libnv includes`... this may or may not get re-worked in
the future so that a userland lib isn't installing includes from sys/.
r336036:
kern_environment: Fix SYSINIT ordering
The dynamic environment was being initialized at SI_SUB_KMEM, SI_ORDER_ANY.
I added the hint-merging at SI_SUB_KMEM, SI_ORDER_ANY as well in r335998 -
this can only work by coincidence.
Re-do both to operate at SI_SUB_KMEM + 1, SI_ORDER_FIRST and SI_ORDER_SECOND
respectively to be safe. It's sufficiently obfuscated away as to when in
SU_SUB_KMEM malloc will be available, and the dynamic environment cannot be
relied upon there anyways since it's initialized at SI_ORDER_ANY.
r336217:
kern_environment: Give the static environment a chance to disable MD env
This variable has been given the name "loader_env.disabled" as it's the
primary way most people will have an MD environment. This restores the
previously-default behavior of ignoring the loader(8) environment, which may
be useful for vendor distributions or other scenarios where inheriting the
loader environment may be considered a security issue or potentially
breaking of a more locked-down environment.
As the change to config(5) indicates, disabling the loader environment
should not be a choice made lightly since it may provide ACPI hints and
other useful things that the system can rely on to boot.
An UPDATING entry has been added to mention an upgrade path for those that
may have relied on the previous behavior.
r336335 by arichardson:
No longer install sys/nv.h and sys/cnv.h in lib/libnv/Makefile
Use tools/build/Makefile to install the headers into ${WORLDTMP}/legacy
instead. Compared to r336026 this has the minor advantage that it avoids
unncessary header installation when building the non-bootstrap libnv.
r336337:
Unconditionally build libnv in legacy
Rather than using a config(8) built from new tree linking libnv built on
host.
r336415:
config(8): Add compatibility shims for r335998
Plumb the %VERSREQ from Makefile.<arch> through to the rest of config(8).
We've recorded the config(8) version that we're calling "the end of
envmode and hintmode," and we'll write them out for earlier versions. Later
kernel version bumps will remove envmode/hintmode from the kernel as needed,
which is OK since the current kernel does not use them at all.
These compatibility shims really need to go away when the major version
rolls over...
r336416:
Fix GCC 4.2 build after r336415, proper declaration and prototype
Relnotes: yes (maybe) [The loader environment may now be used with
the config-static environment by specifying loader_env.disabled=0 in the
config-static environment]
mav [Sat, 4 Aug 2018 00:34:15 +0000 (00:34 +0000)]
MFC r336590:
Stop further SCSI recovery attempts after one has failed.
We've got a set of probably damaged hard disks, reporting 0x04,0x02
("Logical unit not ready, initializing command required") in response
to READ CAPACITY(16), where attempts to use START STOP UNIT for recovery
results in 0x44,0x00 ("Internal target failure") after ~1 second delay.
As result of all recovery retries, device open attempt took ~3 seconds
before finally reporting to GEOM that device is opened, but has no media.
If the open was for writing and since it hasn't formally failed, following
close triggered GEOM retaste, opening device few more times with respective
delays.
This change reduces whole time of this cycle from ~12 seconds to ~3 by
giving up on recovery after the first failure.
manu [Fri, 3 Aug 2018 21:59:01 +0000 (21:59 +0000)]
MFC r336997:
release: Restore copy of boot.scr for some board
This is not a problem for 12-CURRENT as EFI boot works but it doesn't
for 11.
While here some board arm_install_uboot also copy ubldr.bin et create
firstboot files but it's already done in arm_install_boot
asomers [Fri, 3 Aug 2018 14:05:22 +0000 (14:05 +0000)]
MFC r336205:
Don't acquire evclass_lock with a spinlock held
When the "pc" audit class is enabled and auditd is running, witness will
panic during thread exit because au_event_class tries to lock an rwlock
while holding a spinlock acquired upstack by thread_exit.
To fix this, move AUDIT_SYSCALL_EXIT futher upstack, before the spinlock is
acquired. Of thread_exit's 16 callers, it's only necessary to call
AUDIT_SYSCALL_EXIT from two, exit1 (for exiting processes) and kern_thr_exit
(for exiting threads). The other callers are all kernel threads, which
needen't call AUDIT_SYSCALL_EXIT because since they can't make syscalls
there will be nothing to audit. And exit1 already does call
AUDIT_SYSCALL_EXIT, making the second call in thread_exit redundant for that
case.
asomers [Fri, 3 Aug 2018 14:03:50 +0000 (14:03 +0000)]
MFC r335899:
auditd(8): register signal handlers interrutibly
auditd_wait_for_events() relies on read(2) being interrupted by signals,
but it registers signal handlers with signal(3), which sets SA_RESTART.
That breaks asynchronous signal handling. It means that signals don't
actually get handled until after an audit(8) trigger is received.
Symptoms include:
* Sending SIGTERM to auditd doesn't kill it right away; you must send
SIGTERM and then send a trigger with auditon(2).
* Same with SIGHUP
* Zombie child processes don't get reaped until auditd receives a trigger
sent by auditon. This includes children created by expiring audit trails
at auditd startup.
hselasky [Thu, 2 Aug 2018 08:55:19 +0000 (08:55 +0000)]
MFC r336450:
Do not inline transmit headers and use HW VLAN tagging if supported by mlx5en(4).
Query the minimal inline mode supported by the card.
When creating a send queue, cache the queried mode and optimize the transmit
if no inlining is required. In this case, we can avoid touching the headers
cache line and avoid dirtying several more lines by copying headers into
the send WQEs. Also, if no inline headers are used, hardware assists in
the VLAN tag framing.
hselasky [Thu, 2 Aug 2018 08:51:55 +0000 (08:51 +0000)]
MFC r336410:
Add module parameter to limit number of MSIX EQ vectors in mlx5en(4).
For setups having a large amount of PCI devices, it makes sense to limit the
number of MSIX vectors per PCI device, in order to avoid running out of IRQ
vectors.