hselasky [Wed, 8 May 2019 10:42:33 +0000 (10:42 +0000)]
Fix netstat counters mapping in mlx5en(4).
The current mapping of driver counters to netstat counters is wrong.
For example, a single jabber packet, will cause the Ierrs counter to
count three times.
The work for mapping the hardware and software counters to their right
place in netstat counters were already done in Linux, take that as is
to the FreeBSD driver.
hselasky [Wed, 8 May 2019 10:42:05 +0000 (10:42 +0000)]
Fix endless loop in ipoib_poll().
ib_req_notify_cq may return negative value which will indicate a
failure. In the case of uncorrectable error, we will end up in an
endless loop. Fix that, by going to another loop with poll_more
only if there is anything left to poll.
hselasky [Wed, 8 May 2019 10:39:53 +0000 (10:39 +0000)]
Query and cache PCAM, MCAM registers on initialization in mlx5core.
On load_one, we now cache our capabilities registers internally, similar
to QUERY_HCA_CAP. Capabilities can later be queried using macros
introduced in this patch.
hselasky [Wed, 8 May 2019 10:39:25 +0000 (10:39 +0000)]
Implement PCAM, MCAM access register commands in mlx5core.
Introduced registers will expose capabilities of new registers and
features related to port/management.
Driver will query MCAM and PCAM in order to avoid failing on old
firmwares with lack of support.
PCAM and MCAM registers will provide information regarding firmware
support for different features, in order to avoid cases where new driver
combined with old firmware results in syndromes (for ex. PCIe counters
before this patchset).
hselasky [Wed, 8 May 2019 10:38:06 +0000 (10:38 +0000)]
Add Fast teardown support to mlx5core.
Today mlx5 devices support two teardown modes:
1- Regular teardown
2- Force teardown
This change introduces the enhanced version of the "Force teardown" that
allows SW to perform teardown in a faster way without the need to reclaim
all the pages.
Fast teardown provides the following advantages:
1- Fix a FW race condition that could cause command timeout
2- Avoid moving to polling mode
3- Close the vport to prevent PCI ACK to be sent without been
scattered to memory
hselasky [Wed, 8 May 2019 10:35:55 +0000 (10:35 +0000)]
Configure firmware to use RX hash format in mini CQE in mlx5en(4).
When using CQE zipping, one can choose between RX hash and Checksum.
This will indicate the parameter on which a zipping session should be
stopped.
While porting the Linux code, Checksum was chosen. However, the value
of Checksum is not being used anywhere.
For the FreeBSD driver, we prefer to use the RX hash format which will
guarantee the RX hash value for all the mini CQEs.
While at it, make sure to initialize the Checksum value in the
decompressed CQE.
hselasky [Wed, 8 May 2019 10:35:35 +0000 (10:35 +0000)]
Disable CQE zipping by default in mlx5en(4).
After doing performance measurements, it seems like CQE zipping doesn't
have any significant benefit.
Moreover, we know that this feature is disabled by default on other
operating systems (Linux for example).
hselasky [Wed, 8 May 2019 10:35:14 +0000 (10:35 +0000)]
Split mlx5e_update_stats_work() in mlx5en(4).
Split the function into the mlx5e_update_stats_locked() core and make
mlx5e_update_stats_work() call the _locked helper, similar to many other
places in the kernel. This improves the code structure, making the
locking clean.
hselasky [Wed, 8 May 2019 10:34:42 +0000 (10:34 +0000)]
Implement fast close of RX channel in mlx5en(4).
Instead of waiting for all jobs to be cancelled, simply close the completion
queue to prevent more completion events and let mlx5e_destroy_rq() cleanup
the remaining mbufs.
MFC after: 3 days
Sponsored by: Mellanox Technologies
hselasky [Wed, 8 May 2019 10:32:03 +0000 (10:32 +0000)]
Fix tx_jumbo_packets counter in mlx5en(4).
Instead of reading Ethernet RFC 2819 pXtoYoctets counters from
hardware which counts RX octets, count tx_stat_pXtoYoctets from
Ethernet extended counters which counts TX octets.
TX jumbo counters should be accumulated only after the PPCNT
counters were fetched from hardware with their latest value.
hselasky [Wed, 8 May 2019 10:30:47 +0000 (10:30 +0000)]
Protect from infinite sw-reset loop in mlx5core.
Avoid an infinite software firmware reset loop that may be caused by a
hardware bug by limiting the maximum number of resets.
The counter between resets is reset by request for reset, and not by a
successful reset.
The interval between two resets can be configured via sysctl:
hw.mlx5.sw_reset_timeout
which is global to all mlx5 devices in the system.
hselasky [Wed, 8 May 2019 10:28:18 +0000 (10:28 +0000)]
Add temperature warning event to log in mlx5core.
Temperature warning event is sent by FW to indicate high temperature
as detected by one of the sensors on the board.
Add handling of this event by writing the numbers of the alert sensors
to the kernel log.
hselasky [Wed, 8 May 2019 10:27:29 +0000 (10:27 +0000)]
Correctly define the interface state bits in mlx5en(4).
While at it remove unused interface state bits. This also fixes and issue
during shutdown:
There is an issue where the firmware fails during mlx5_load_one,
the health_care timer detects the issue and schedules a health_care call.
Then the mlx5_load_one detects the issue, cleans up and quits. Then
the health_care starts and calls mlx5_unload_one to clean up the resources
that no longer exist and causes kernel panic.
The root cause is that the bit MLX5_INTERFACE_STATE_DOWN is not set
after mlx5_load_one fails. The solution is removing the bit
MLX5_INTERFACE_STATE_DOWN and quit mlx5_unload_one if the
bit MLX5_INTERFACE_STATE_UP is not set. The bit MLX5_INTERFACE_STATE_DOWN
is redundant and we can use MLX5_INTERFACE_STATE_UP instead.
dim [Wed, 8 May 2019 05:45:00 +0000 (05:45 +0000)]
Pull in r360099 from upstream llvm trunk (by Eli Friedman):
[ARM] Glue register copies to tail calls.
This generally follows what other targets do. I don't completely
understand why the special case for tail calls existed in the first
place; even when the code was committed in r105413, call lowering
didn't work in the way described in the comments.
Stack protector lowering breaks if the register copies are not glued
to a tail call: we have to insert the stack protector check before
the tail call, and we choose the location based on the assumption
that all physical register dependencies of a tail call are adjacent
to the tail call. (See FindSplitPointForStackProtector.) This is sort
of fragile, but I don't see any reason to break that assumption.
I'm guessing nobody has seen this before just because it's hard to
convince the scheduler to actually schedule the code in a way that
breaks; even without the glue, the only computation that could
actually be scheduled after the register copies is the computation of
the call address, and the scheduler usually prefers to schedule that
before the copies anyway.
This should fix several instances of "Bad machine code: Using an
undefined physical register", when compiling ports such as
multimedia/vlc, audio/alsa-lib and devel/avro-c for armv6, with
-fstack-protector-strong.
kevans [Wed, 8 May 2019 02:32:11 +0000 (02:32 +0000)]
tun/tap: merge and rename to `tuntap`
tun(4) and tap(4) share the same general management interface and have a lot
in common. Bugs exist in tap(4) that have been fixed in tun(4), and
vice-versa. Let's reduce the maintenance requirements by merging them
together and using flags to differentiate between the three interface types
(tun, tap, vmnet).
This fixes a couple of tap(4)/vmnet(4) issues right out of the gate:
- tap devices may no longer be destroyed while they're open [0]
- VIMAGE issues already addressed in tun by kp
[0] emaste had removed an easy-panic-button in r240938 due to devdrn
blocking. A naive glance over this leads me to believe that this isn't quite
complete -- destroy_devl will only block while executing d_* functions, but
doesn't block the device from being destroyed while a process has it open.
The latter is the intent of the condvar in tun, so this is "fixed" (for
certain definitions of the word -- it wasn't really broken in tap, it just
wasn't quite ideal).
ifconfig(8) also grew the ability to map an interface name to a kld, so
that `ifconfig {tun,tap}0` can continue to autoload the correct module, and
`ifconfig vmnet0 create` will now autoload the correct module. This is a
low overhead addition.
(MFC commentary)
This may get MFC'd if many bugs in tun(4)/tap(4) are discovered after this,
and how critical they are. Changes after this are likely easily MFC'd
without taking this merge, but the merge will be easier.
mav [Wed, 8 May 2019 01:35:43 +0000 (01:35 +0000)]
Fix dataset name comparison in zfs_compare().
The code never returned match comparing two datasets (not snapshots).
As result, uu_avl_find(), called from zfs_callback(), never succeeded,
allowing to add same dataset into the list multiple times, for example:
# zfs get name pers pers pers@z pers@z
NAME PROPERTY VALUE SOURCE
pers name pers -
pers name pers -
pers@z name pers@z -
With the patch:
# zfs get name pers pers pers@z pers@z
NAME PROPERTY VALUE SOURCE
pers name pers -
pers@z name pers@z -
cem [Wed, 8 May 2019 00:45:16 +0000 (00:45 +0000)]
random: x86 driver: Prefer RDSEED over RDRAND when available
Per
https://software.intel.com/en-us/blogs/2012/11/17/the-difference-between-rdrand-and-rdseed
, RDRAND is a PRNG seeded from the same source as RDSEED. The source is
more suitable as PRNG seed material, so prefer it when the RDSEED intrinsic
is available (indicated in CPU feature bits).
tuexen [Tue, 7 May 2019 20:28:12 +0000 (20:28 +0000)]
Remove non-functional SCTP checksum offload support for virtio.
Checksum offloading for SCTP is not currently specified for virtio.
If the hypervisor announces checksum offloading support, it means TCP
and UDP checksum offload. If an SCTP packet is sent and the host announced
checksum offload support, the hypervisor inserts the IP checksum (16-bit)
at the correct offset, but this is not the right checksum, which is a CRC32c.
This results in all outgoing packets having the wrong checksum and therefore
breaking SCTP based communications.
This patch removes SCTP checksum offloading support from the virtio
network interface.
Thanks to Felix Weinrank for making me aware of the issue.
trasz [Tue, 7 May 2019 19:06:41 +0000 (19:06 +0000)]
Support PTRACE_GETREGSET w/ NT_PRSTATUS in Linux ptrace(2).
While Linux strace(1) doesn't strictly require it - it has a fallback
to PTRACE_GETREGS - it's a newer interface, so we better support it
before the old one is deprecated.
Reviewed by: dchagin
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20152
cem [Tue, 7 May 2019 17:47:20 +0000 (17:47 +0000)]
device_printf: Use sbuf for more coherent prints on SMP
device_printf does multiple calls to printf allowing other console messages to
be inserted between the device name, and the rest of the message. This change
uses sbuf to compose to two into a single buffer, and prints it all at once.
It exposes an sbuf drain function (drain-to-printf) for common use.
Update documentation to match; some unit tests included.
emaste [Tue, 7 May 2019 16:17:33 +0000 (16:17 +0000)]
makesyscalls: use @generated tag in generated files
Multiple tools use @generated to identify generated files (for example,
in a review Phabricator will by default hide diffs in generated files).
Use the @generated tag in makesyscalls.sh as we've done for other
generated files.
Reviewed by: cem
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20183
Disable interrupts first and then set spinlock_count to 1.
Otherwise interrupt can be generated just after setting spinlock_count
and before disabling interrupts.
emaste [Tue, 7 May 2019 13:04:26 +0000 (13:04 +0000)]
Use @generated tag in generated files
Multiple tools use @generated to identify generated files (for example,
in a review Phabricator will by default hide diffs in generated files).
Use the @generated tag in makeobjops.awk and vnode_if.awk as we've done
for other generated files.
marius [Tue, 7 May 2019 08:31:54 +0000 (08:31 +0000)]
o Avoid determining the MAC class (LEM/EM or IGB) - possibly even multiple
times - on every interrupt by using an own set of device methods for the
IGB class. This translates to introducing igb_if_intr_{disable,enable}()
and igb_if_{rx,tx}_queue_intr_enable() with that IGB-specific code moved
out of their EM counterparts and otherwise continuing to use the EM IFDI
methods also for IGB.
Note that igb_if_intr_{disable,enable}() also issue E1000_WRITE_FLUSH as
lost with the conversion of igb(4) to iflib(4).
Also note, that the em_if_{disable,enable}_intr() methods are renamed to
em_if_intr_{disable,enable}() for consistency with the names used in the
interface declaration.
o In em_intr():
- Don't bother to bail out if the interrupt type is "legacy", i. e. INTx
or MSI, as iflib(4) doesn't use ift_legacy_intr methods for MSI-X. All
other iflib(4)-based drivers avoid this check, too.
- Given that only the MSI-X interrupts have one-shot behavior (by taking
advantage of the EIAC register), explicitly disable interrupts. Hence,
em_intr() now matches what {em,igb}_irq_fast() previously did (in case
of igb(4) supposedly also to work around MSI message reordering errata
on certain systems).
o In em_if_intr_disable():
- Clear the EIAC register unconditionally for 82574 and not just in case
of MSI-X, matching em_if_intr_enable() and bringing back the last hunk
of r206437 lost with the iflib(4) conversion.
- Write to EM_EIAC for clearing said register instead of to the IGB-only
E1000_EIAC used ever since the iflib(4) conversion.
marius [Tue, 7 May 2019 08:28:35 +0000 (08:28 +0000)]
o Use iflib_fast_intr_rxtx() also for "legacy" interrupts, i. e. INTx and
MSI. Unlike as with iflib_fast_intr_ctx(), the former will also enqueue
_task_fn_tx() in addition to _task_fn_rx() if appropriate, bringing TCP
TX throughput of EM-class devices on par with the MSI-X case and, thus,
close to wirespeed/pre-iflib(4) times again. [1]
Note that independently of the interrupt type, the UDP performance with
these MACs still is abysmal and nowhere near to where it was before the
conversion of em(4) to iflib(4).
o In iflib_init_locked(), announce which free list failed to set up.
o In _task_fn_tx() when running netmap(4), issue ifdi_intr_enable instead
of the ifdi_tx_queue_intr_enable method in case of a "legacy" interrupt
as the latter is valid with MSI-X only.
o Instead of adding the missing - and apparently convoluted enough that a
DBG_COUNTER_INC was put into a wrong spot in _task_fn_rx() - checks for
ifdi_{r,t}x_queue_intr_enable being available in the MSI-X case also to
iflib_fast_intr_rxtx(), factor these out to iflib_device_register() and
make the checks fail gracefully rather than panic. This avoids invoking
the checks at runtime over and over again in iflib_fast_intr_rxtx() and
_task_fn_{r,t}x() - even if it's just in case of INVARIANTS - and makes
these functions more readable.
o In iflib_rx_structures_setup(), only initialize LRO resources if device
and driver have LRO capability in order to not waste memory. Also, free
the LRO resources again if setting them up fails for one of the queues.
However, don't bother invoking iflib_rx_sds_free() in that case because
iflib_rx_structures_setup() doesn't call iflib_rxsd_alloc() either (and
iflib_{device,pseudo}_register() will issue iflib_rx_sds_free() in case
of failure via iflib_rx_structures_free(), but there definitely is some
asymmetry left to be fixed, though).
o Similarly, free LRO resources again in iflib_rx_structures_free().
o In iflib_irq_set_affinity(), handle get_core_offset() errors gracefully
instead of panicing (but only in case of INVARIANTS). This is a follow-
up to r344132, as such driver bugs shouldn't be fatal.
o Likewise, handle unknown iflib_intr_type_t in iflib_irq_alloc_generic()
gracefully, too.
o Bring yet more sanity to iflib_msix_init():
- If the device doesn't provide enough MSI-X vectors or not all vectors
can be allocate so the expected number of queues in addition to admin
interrupts can't be supported, try MSI next (and then INTx) as proper
MSI-X vector distribution can't be assured in such cases. In essence,
this change brings r254008 forward to iflib(4). Also, this is the fix
alluded to in the commit message of r343934.
- If the MSI-X allocation has failed, don't prematurely announce MSI is
going to be used as the latter in fact may not be available either.
- When falling back to MSI, only release the MSI-X table resource again
if it was allocated in iflib_msix_init(), i. e. isn't supplied by the
driver, in the first place.
o In mp_ndesc_handler(), handle unknown type arguments gracefully, too.
dougm [Mon, 6 May 2019 22:12:15 +0000 (22:12 +0000)]
The intention of the blist cursor is for the search for free blocks to
resume where the last search left off. Suppose that there are no free
blocks of size 32, but plenty of size 16. If we repeatedly request
size 32 blocks, fail, and retry with size 16 blocks, then the failures
all reset the cursor to the beginning of memory, making the 16 block
allocation use a first fit, rather than next fit, strategy.
This change has blist_alloc make a copy of the cursor for its own
decision making, and only updates the real blist cursor after a
successful allocation, making those 16 block searches behave like
next-fit searches.
marius [Mon, 6 May 2019 20:56:41 +0000 (20:56 +0000)]
- Remove the unused ifc_link_irq and ifc_mtx_name members of struct iflib_ctx.
- Remove the only ever written to ift_db_mtx_name member of struct iflib_txq.
- Remove the unused or only ever written to ifr_size, ifr_cq_pidx, ifr_cq_gen
and ifr_lro_enabled members of struct iflib_rxq.
- Consistently spell DMA, RX and TX uppercase in comments, messages etc.
instead of mixing with some lowercase variants.
- Consistently use if_t instead of a mix of if_t and struct ifnet pointers.
- Bring the function comments of _iflib_fl_refill(), iflib_rx_sds_free() and
iflib_fl_setup() in line with reality.
- Judging problem reports, people are wondering what on earth messages like:
"TX(0) desc avail = 1024, pidx = 0"
are trying to indicate. Thus, extend this string to be more like that of
non-iflib(4) Ethernet MAC drivers, notifying about a watchdog timeout due
to which the interface will be reset.
- Take advantage of the M_HAS_VLANTAG macro.
- Use false/true rather than FALSE/TRUE for variables of type bool.
- Use FALLTHROUGH as advocated by style(9).
imp [Mon, 6 May 2019 18:39:22 +0000 (18:39 +0000)]
We only ever need one devinfo per handle. So allocate it outside of
looping over the filesystem modules rather than doing a malloc + free
each time through the loop. In addition, nothing changes from loop to
loop, so setup the new devinfo outside the loop as well.
cem [Mon, 6 May 2019 18:24:07 +0000 (18:24 +0000)]
List-ify kernel dump device configuration
Allow users to specify multiple dump configurations in a prioritized list.
This enables fallback to secondary device(s) if primary dump fails. E.g.,
one might configure a preference for netdump, but fallback to disk dump as a
second choice if netdump is unavailable.
This change does not list-ify netdump configuration, which is tracked
separately from ordinary disk dumps internally; only one netdump
configuration can be made at a time, for now. It also does not implement
IPv6 netdump.
savecore(8) is already capable of scanning and iterating multiple devices
from /etc/fstab or passed on the command line.
This change doesn't update the rc or loader variables 'dumpdev' in any way;
it can still be set to configure a single dump device, and rc.d/savecore
still uses it as a single device. Only dumpon(8) is updated to be able to
configure the more complicated configurations for now.
As part of revving the ABI, unify netdump and disk dump configuration ioctl
/ structure, and leave room for ipv6 netdump as a future possibility.
Backwards-compatibility ioctls are added to smooth ABI transition,
especially for developers who may not keep kernel and userspace perfectly
synced.
hselasky [Mon, 6 May 2019 16:17:38 +0000 (16:17 +0000)]
Disabling a PCI device should only disable busmaster in the LinuxKPI.
As Linux comment for this function point:
Signal to the system that the PCI device is not in use by the system
anymore. This only involves disabling PCI bus-mastering, if active.
hselasky [Mon, 6 May 2019 16:00:20 +0000 (16:00 +0000)]
Allow controlling pr_debug at runtime in the LinuxKPI.
Turning on pr_debug at compile time make it non-optional at runtime.
This often means that the amount of the debugging is unbearable.
Allow developer to turn on pr_debug output only when needed.
royger [Mon, 6 May 2019 09:48:34 +0000 (09:48 +0000)]
geom: fix initialization order
There's a race between the initialization of devsoftc.mtx (by devinit)
and the creation of the geom worker thread g_run_events, which calls
devctl_queue_data_f. Both of those are initialized at SI_SUB_DRIVERS
and SI_ORDER_FIRST, which means the geom worked thread can be created
before the mutex has been initialized, leading to the panic below:
wpanic: mtx_lock() of spin mutex (null) @ /usr/home/osstest/build.135317.build-amd64-freebsd/freebsd/sys/kern/subr_bus.c:620
cpuid = 3
time = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe003b968710
vpanic() at vpanic+0x19d/frame 0xfffffe003b968760
panic() at panic+0x43/frame 0xfffffe003b9687c0
__mtx_lock_flags() at __mtx_lock_flags+0x145/frame 0xfffffe003b968810
devctl_queue_data_f() at devctl_queue_data_f+0x6a/frame 0xfffffe003b968840
g_dev_taste() at g_dev_taste+0x463/frame 0xfffffe003b968a00
g_load_class() at g_load_class+0x1bc/frame 0xfffffe003b968a30
g_run_events() at g_run_events+0x197/frame 0xfffffe003b968a70
fork_exit() at fork_exit+0x84/frame 0xfffffe003b968ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe003b968ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 13 tid 100029 ]
Stopped at kdb_enter+0x3b: movq $0,kdb_why
Fix this by initializing geom at SI_ORDER_SECOND instead of
SI_ORDER_FIRST.
kib [Mon, 6 May 2019 08:49:43 +0000 (08:49 +0000)]
Do not flush NFS node from NFS VOP_SET_TEXT().
The more appropriate place to do the flushing is VOP_OPEN(). This was
uncovered because VOP_SET_TEXT() is now called with the vnode'
vm_object rlocked, which is incompatible with the flush operations.
After the move, there is no need for NFS-specific VOP_SET_TEXT
overload.
Sponsored by: The FreeBSD Foundation
MFC after: 30 days
jhibbits [Sun, 5 May 2019 20:23:43 +0000 (20:23 +0000)]
powerpc/booke: Use #ifdef __powerpc64__ instead of hw_direct_map in places
Since the DMAP is only available on powerpc64, and is *always* available on
Book-E powerpc64, don't penalize either side (32-bit or 64-bit) by always
checking hw_direct_map to perform operations. This saves 5-10% time on
various ports builds, and on buildworld+buildkernel on Book-E hardware.
jhibbits [Sun, 5 May 2019 20:05:50 +0000 (20:05 +0000)]
powerpc/booke: Fix size check for phys_avail in pmap bootstrap
Use the nitems() macro instead of the expansion, a'la r298352. Also, fix the
location of this check to after initializing availmem_regions_sz, so that the
check isn't always against 0, thus always failing (nitems(phys_avail) is always
more than 0).
kib [Sun, 5 May 2019 11:20:43 +0000 (11:20 +0000)]
Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
kib [Sun, 5 May 2019 11:04:01 +0000 (11:04 +0000)]
imgact_elf: do not relock the text vnode if possible.
We unlock the vnode around malloc(M_WAITOK), to make it possible for
pagedaemon to flush vnode pages for us. Instead of doing it
unconditionally, first try M_NOWAIT allocation, which typically
succeed. Only on failure, unlock the vnode and retry with M_WAITOK.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19923
adrian [Sun, 5 May 2019 06:32:40 +0000 (06:32 +0000)]
[ath_rate_sample] Have the final attempted rate in 11n modes to be the lowest one.
Right now ath_rate_sample has a fixed rate schedule, rather than the minstrel_ht
style "best, good, most reliable" triplet. So, if higher rates are tried then
it'll not fail back to a lower MCS rate in that transmission schedule.
This means that in low SNR situations it'll not easily drop to MCS0 unless enough
transmissions occur to allow rate control to eventually decide to drop; and if
it's TCP traffic it'll get slowed down because of packet loss.
It's worse for 2-stream and 3-stream rates; it doesn't ever fall back to lower
stream rates, and these higher stream rates required higher SNR to work.
So instead let's (for now?) have each of the 11n transmit rates use MCS0 as
the last attempt. ath_rate_sample will quickly see that rate succeeds more
and will move to it much quicker.
adrian [Sun, 5 May 2019 04:56:37 +0000 (04:56 +0000)]
[ath] [ath_rate] Fix ANI calibration during non-ACTIVE states; start poking at rate control
These are some fun issues I've found with my upstairs wifi link at such a ridiculous
low signal level (like, < 5dB.)
* Add per-station tx/rx rssi statistics, in potential preparation to use that
in the RX rate control.
* Call the rate control on each received frame to let it potentially use
it as a hint for what rates to potentially use. It's a no-op right now.
* Do ANI calibration during scan as well. The ath_newstate() call was disabling the
ANI timer and only re-enabling it during transitions to _RUN. This has the
unfortunate side-effect that if ANI deafened the NIC because of interference
and it disassociated, it wouldn't be reset and the scan would never hear beacons.
The ANI configuration is stored at least globally on some HALs and per-channel
on others. Because of this a NIC reset wouldn't help; the ANI parameters would
simply be programmed back in.
Now, I have a feeling I also need to do this during AUTH/ASSOC too and maybe,
if I'm feeling clever, I need to reset the ANI parameters on a given channel
during a transition through INIT or if the VAP is destroyed/re-created.
However for now this gets me out of the immediate weeds with connectivity
upstairs (and thus I /can/ commit); I'll keep chipping away at tidying this
stuff up in subsequent commits.
cem [Sat, 4 May 2019 20:34:26 +0000 (20:34 +0000)]
x86: Implement MWAIT support for stopping a CPU
IPI_STOP is used after panic or when ddb is entered manually. MONITOR/
MWAIT allows CPUs that support the feature to sleep in a low power way
instead of spinning. Something similar is already used at idle.
It is perhaps especially useful in oversubscribed VM environments, and is
safe to use even if the panic/ddb thread is not the BSP. (Except in the
presence of MWAIT errata, which are detected automatically on platforms with
known wakeup problems.)
It can be tuned/sysctled with "machdep.stop_mwait," which defaults to 0
(off). This commit also introduces the tunable
"machdep.mwait_cpustop_broken," which defaults to 0, unless the CPU has
known errata, but may be set to "1" in loader.conf to signal that mwait
wakeup is broken on CPUs FreeBSD does not yet know about.
Unfortunately, Bhyve doesn't yet support MONITOR extensions, so this doesn't
help bhyve hypervisors running FreeBSD guests.
Submitted by: Anton Rang <rang AT acm.org> (earlier version)
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20135