Ian Lepore [Fri, 10 May 2019 02:30:16 +0000 (02:30 +0000)]
Allow dcons(4) to be unloaded when loaded as a module.
When the module is unloaded, the tty devices are destroyed. That requires
implementing the tsw_free callback to avoid a panic. This driver requires
no particular cleanup to be done from the callback, but the module itself
must remain in memory until the deferred tsw_free callbacks are invoked.
These changes implement that by incrementing a reference count variable in
the detach routine, and decrementing it in the tsw_free callback. The
MOD_UNLOAD event handler doesn't return until the count drops to zero.
Eric Joyner [Fri, 10 May 2019 00:41:42 +0000 (00:41 +0000)]
iflib: use default ntxd and nrxd when user value is not power of 2
From Jake:
A user may set a sysctl to override the default number of Tx or Rx
descriptors. However, certain calculations in the iflib core expect the
number of descriptors to be a power of 2.
Update _iflib_assert to verify that all of the shared context parameters
for the number of descriptors are powers of 2.
Modify iflib_reset_qvalues to check that the provided isc_nrxd value is
a power of 2. If it's not, print a warning message and then use the
default value.
An alternative might be to try rounding the number down instead.
However, this creates problems in case the rounded down value is below
the minimum value that the driver would support.
Enji Cooper [Fri, 10 May 2019 00:03:32 +0000 (00:03 +0000)]
Refactor tests/sys/opencrypto/runtests
* Convert from plain to TAP for slightly improved introspection when skipping
the tests due to requirements not being met.
* Test for the net/py-dpkt (origin) package being required when running the
tests, instead of relying on a copy of the dpkt.py module from 2014. This
enables the tests to work with py3. Subsequently, remove
`tests/sys/opencrypto/dpkt.py(c)?` via `make delete-old`.
* Parameterize out `python2` as `$PYTHON`.
Andrew Gallatin [Thu, 9 May 2019 22:38:15 +0000 (22:38 +0000)]
Remove IPSEC from GENERIC due to performance issues
Having IPSEC compiled into the kernel imposes a non-trivial
performance penalty on multi-threaded workloads due to IPSEC
refcounting. In my benchmarks of multi-threaded UDP
transmit (connected sockets), I've seen a roughly 20% performance
penalty when the IPSEC option is included in the kernel (16.8Mpps
vs 13.8Mpps with 32 senders on a 14 core / 28 HTT Xeon
2697v3)). This is largely due to key_addref() incrementing and
decrementing an atomic reference count on the default
policy. This cause all CPUs to stall on the same cacheline, as it
bounces between different CPUs.
Given that relatively few users use ipsec, and that it can be
loaded as a module, it seems reasonable to ask those users to
load the ipsec module so as to avoid imposing this penalty on the
GENERIC kernel. Its my hope that this will make FreeBSD look
better in "out of the box" benchmark comparisons with other
operating systems.
Many thanks to ae for fixing auto-loading of ipsec.ko when
ifconfig tries to configure ipsec, and to cy for volunteering
to ensure the the racoon ports will load the ipsec.ko module
- Remove Tn macros
- Refernce sysctl(8) instead of sysctl(1)
- Start new sentences on new lines
- Capitalize NFS where needed
- Use Fx for FreeBSD
- Remove a list block (Bl) that was added to the manual page
by accident in r335174
Kyle Evans [Thu, 9 May 2019 12:58:33 +0000 (12:58 +0000)]
ifconfig(8): Partial revert of r347241
r347241 introduced an ifname <-> kld mapping table, mostly so tun/tap/vmnet
can autoload the correct module on use. It also inadvertently made bogus
some previously valid uses of sizeof().
Revert back to ifkind on the stack for simplicity sake. This reduces the
diff from the previous version of ifmaybeload for easiser auditing.
Toomas Soome [Thu, 9 May 2019 11:04:10 +0000 (11:04 +0000)]
loader: ptable_print() needs two tabs sometimes
Since the partition/slice names do vary in length, check the length
of the fixed part of the line against 3 * 8, if the lenth is less than
3 tab stops, print out extra tab.
In mld_v2_cancel_link_timers() check number of references and disconnect
inm before releasing the last reference. This fixes possible panics and
assertion.
Gleb Smirnoff [Wed, 8 May 2019 23:39:24 +0000 (23:39 +0000)]
Existense of PCB route caching doesn't allow us to use new fast route
lookup KPI in ip_output() like it is already used in ip_forward().
However, when there is no PCB provided we can use fast KPI, gaining
performance advantage.
Typical case when ip_output() is called without a PCB pointer is a
sendto(2) on a not connected UDP socket. In practice DNS servers do
this.
Mateusz Guzik [Wed, 8 May 2019 16:30:38 +0000 (16:30 +0000)]
Reduce umtx-related work on exec and exit
- there is no need to take the process lock to iterate the thread
list after single-threading is enforced
- typically there are no mutexes to clean up (testable without taking
the global umtx lock)
- typically there is no need to adjust the priority (testable without
taking thread lock)
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20160
Justin Hibbits [Wed, 8 May 2019 16:15:28 +0000 (16:15 +0000)]
powerpc/booke: Rewrite pmap_sync_icache() a bit
* Make mmu_booke_sync_icache() use the DMAP on 64-bit prcoesses, no need to
map the page into the user's address space. This removes the
pvh_global_lock from the equation on 64-bit.
* Don't map the page with user-readability on 32-bit. I don't know what the
chance of a given user process being able to access the NULL page when
another process's page is added there, but it doesn't seem like a good
idea to map it to NULL with user read permissions.
* Only sync as much as we need to. There are only two significant places
where pmap_sync_icache is used: proc_rwmem(), and the SIGILL second-chance
for powerpc. The SIGILL second chance is likely the most common, and only
syncs 4 bytes, so avoid the other 127 loop iterations (4096 / 32 byte
cacheline) in __syncicache().
Justin Hibbits [Wed, 8 May 2019 16:05:18 +0000 (16:05 +0000)]
powerpc/booke: Do as much work outside of TLB locks as possible
Reduce the surface area of the TLB locks. Unfortunately the same trick for
serializing the tlbie instruction on OEA64 cannot be used here to reduce the
scope of the tlbivax mutex to the tlbsync only, as the mutex also serializes
the TLB miss lock as a side effect, so contention on this lock may not be
reducible any further.
Ruslan Bukin [Wed, 8 May 2019 15:22:27 +0000 (15:22 +0000)]
o Implement a bounce buffer based on device reserved memory.
Grab device reserved physical memory regions from FDT using standard
"memory-region" property and use vmem(9) to allocate buffers from it.
The same vmem could be used by DMA engine drivers to allocate memory for
DMA descriptors.
This is required for platforms that provide uncached memory region
reserved exclusively for DMA operations.
o Change sleepable sx(9) lock type to non-sleepable mutex(9) since
network drivers usually hold mutex during DMA operations. So we don't
take sleepable lock after non-sleepable.
Tested on U.S. Government Furnished Equipment (GFE) 64-bit RISC-V cores.
Conrad Meyer [Wed, 8 May 2019 14:54:32 +0000 (14:54 +0000)]
random(4): Don't complain noisily when an entropy source is slow
Mjg@ reports that RDSEED (r347239) causes a lot of logspam from this printf,
and I don't feel that it is especially useful (even ratelimited). There are
many other quality/quantity checks we're not performing on entropy sources;
lack of high frequency availability does not disqualify a good entropy
source.
There is some discussion in the linked Differential about what logging might
be appropriate and/or polling policy for slower TRNG sources. Please feel
free to chime in if you have opinions.
There is no reason to re-create the command workqueue during healthcare.
This also fixes an issue where a previous work struct may refer to a
destroyed workqueue.
MFC after: 3 days
Sponsored by: Mellanox Technologies
Fix race between driver unload and dumping firmware in mlx5core.
Present code uses lock-less accesses to the dump data to prevent top
level ioctls from blocking bottom-level call to dump. Unfortunately, this
depends on the type stability of the dump data structure, which makes it
non-functional during driver teardown.
Switch to the mutex locking scheme where top levels use the mutex in the
bound regions, while copyouts and drain for completion utilize condvars.
The mutex lifetime is guaranteed to be strictly larger than the time
interval where driver can initiate dump, and most of the control fields
of the old struct mlx5_dump_data are directly embedded into struct
mlx5_core_dev.
Flush command workqueue when command completion is triggered in mlx5core.
Avoid race for command completion when triggering a command completions event.
Serialize operation by queueing all commands on the same work queue.
This can happen when healthcare triggers.
MFC after: 3 days
Sponsored by: Mellanox Technologies
The command timeout is terribly long, whole two hours. Make it 60s so if
things do go wrong, the user gets feedback in relatively short time, so
they can take corrective actions and/or investigate using tools and such.
Allow more macro arguments and split the variable type and name into
separate arguments. This allows simple and powerful copy and extraction
of values from IFC based structures into SYSCTLs with the use of a single
macro.
MFC after: 3 days
Sponsored by: Mellanox Technologies
Implement a watchdog as part of the healtcare subsystem which
reads the PCI power status during startup and upon the PCI
power status change event and store it into the core device
structure. This value is then exported to user-space via a
read-only SYSCTL. A dmesg print has been added to inform
the admin about the PCI power status.
MFC after: 3 days
Sponsored by: Mellanox Technologies
Always return success for RoCE modify port in mlx5ib.
CM layer calls ib_modify_port() regardless of the link layer.
For the Ethernet ports, qkey violation and Port capabilities
are meaningless. Therefore, always return success for ib_modify_port
calls on the Ethernet ports.
Correct check for the calibration generation in mlx5en(4).
If generation is cleared due to hardware clock failure, check for it before
the divisor is used. Actually clear generation when failure occurs.
While there, stop doing the calculations inside the generation loop. Since
all members of mlx5e_clbr_point are used for calculations, get the
local copy of the structure and use it after generation stabilized.
Use software counters for rx_packets and rx_bytes in mlx5en(4).
The physical- and virtual- port counters might not reflect the amount
of data received after address filtering. Use the software counters
instead for rx_packets and rx_bytes to know exactly how much data
was received.
MFC after: 3 days
Sponsored by: Mellanox Technologies
The current mapping of driver counters to netstat counters is wrong.
For example, a single jabber packet, will cause the Ierrs counter to
count three times.
The work for mapping the hardware and software counters to their right
place in netstat counters were already done in Linux, take that as is
to the FreeBSD driver.
ib_req_notify_cq may return negative value which will indicate a
failure. In the case of uncorrectable error, we will end up in an
endless loop. Fix that, by going to another loop with poll_more
only if there is anything left to poll.