np [Wed, 24 May 2017 20:28:48 +0000 (20:28 +0000)]
MFC r313318:
cxgbe(4): Allow tunables that control the number of queues to be set to
'-n' to tell the driver to create _up to_ 'n' queues if enough cores are
available. For example, setting hw.cxgbe.nrxq10g="-32" will result in
16 queues if the system has 16 cores, 32 if it has 32.
There is no change in the default number of queues of any type.
np [Wed, 24 May 2017 19:57:22 +0000 (19:57 +0000)]
MFC r313346:
cxgbe/t4_tom: Fix CLIP entry refcounting on the passive side. Every
IPv6 connection being handled by the TOE should have a reference on its
CLIP entry.
r311880:
The iw_cxgb and iw_cxgbe drivers should not use a FreeBSD device_t where
a linuxkpi style device is expected. If OFED/linuxkpi actually starts
using this field then we'll have to figure out whether to create fake
devices for these drivers or have linuxkpi deal with NULL device.
This mismatch was first reported as part of D6585.
r314167:
cxgbe/iw_cxgbe: Minor changes for T6.
r316118:
cxgbe/iw_cxgbe: T6 has no limit on the amount of memory that can be
registered in one ib_reg_phys_mr.
r316571:
cxgbe/iw_cxgbe: Remove bad cast that resulted in incorrect length for
memory regions larger than 4GB.
r316573:
cxgbe/iw_cxgbe: Replace a magic constant with something more readable
(and accurate).
T4 and later have an extra bit for page shift so the maximum page size
is 8TB (shift of 12 + 31) instead of 128MB (12 + 15). This saves space
in the chip's PBL (physical buffer list) when registering very large
memory regions.
r316580:
cxgbe/iw_cxgbe: Remove another bad cast. This should have been
included in r316571.
ae [Wed, 24 May 2017 09:03:46 +0000 (09:03 +0000)]
MFC r318399:
Set M_BCAST and M_MCAST flags on mbuf sent via divert socket.
r290383 has changed how mbufs sent by divert socket are handled.
Previously they are always handled by slow path processing in ip_input().
Now ip_tryforward() is invoked from ip_input() before in_broadcast() check.
Since diverted packet lost all mbuf flags, it passes the broadcast check
in ip_tryforward() due to missing M_BCAST flag. In the result the broadcast
packet is forwarded to the wire instead of be consumed by network stack.
Add in_broadcast() check to the div_output() function. And restore the
M_BCAST flag if destination address is broadcast for the given network
interface.
mav [Tue, 23 May 2017 17:00:56 +0000 (17:00 +0000)]
MFC r309321:
Add `gmirror create` subcommand, alike to gstripe, gconcat, etc.
It is quite specific mode of operation without storing on-disk metadata.
It can be useful in some cases in combination with some external control
tools handling mirror creation and disks hot-plug.
badger [Tue, 23 May 2017 12:40:50 +0000 (12:40 +0000)]
move p_sigqueue to the end of struct proc
In order to preserve KBI in stable branches, replace the existing
p_sigqueue slot with padding and move the expanded (as of r315949)
p_sigqueue to the end of the struct.
This is a repeat of r317529 (which concerned td_sigqueue in struct
thread) for p_sigqueue in struct proc.
Virtualbox modules (and possibly others) are affected without this fix.
mmel [Tue, 23 May 2017 12:03:59 +0000 (12:03 +0000)]
MFC r318021,r318251:
r318021:
Introduce pmap_remap_vm_attr(), it allows to remap one VM memattr class to
another.
r318251:
Clarify usage rules for pmap_remap_vm_attr(). Not a functional change.
trasz [Tue, 23 May 2017 08:09:44 +0000 (08:09 +0000)]
MFC r318138:
Revert to pre-r318116 wording to not give the false impression
that setting the kernels' idea of terminal size is somehow an
alternative to environment variables.
trasz [Tue, 23 May 2017 08:07:39 +0000 (08:07 +0000)]
MFC rr317934:
Add resizewin -z. It makes resizewin not do anything if the terminal
size is already set to something other than zero. It's supposed to be
called from eg /etc/profile - it's not neccessary to query terminal
size when logging in over the network, because the protocol used already
takes care of this, but it's neccessary when logging over a serial line.
trasz [Tue, 23 May 2017 08:04:36 +0000 (08:04 +0000)]
MFC r317909:
Make resizewin(1) discard the terminal queues, to lower the chance
for "unable to parse response" error which happens when youre typing
too fast for the machine you're running it on.
rmacklem [Mon, 22 May 2017 21:41:34 +0000 (21:41 +0000)]
MFC: r317931
Fix mount_nfs so that it doesn't create mounttab entries for NFSv4 mounts.
The NFSv4 protocol doesn't use the Mount protocol, so it doesn't make sense
to add an entry for an NFSv4 mount to /var/db/mounttab. Also, r308871
modified umount so that it doesn't remove any entry created by mount_nfs.
rmacklem [Mon, 22 May 2017 19:34:37 +0000 (19:34 +0000)]
MFC: r317906
Fix the client side krpc from doing TCP reconnects for ERESTART from sosend().
When sosend() replies ERESTART in the client side krpc, it indicates that
the RPC message hasn't yet been sent and that the send queue is full or
locked while a signal is posted for the process.
Without this patch, this would result in a RPC_CANTSEND reply from
clnt_vc_call(), which would cause clnt_reconnect_call() to create a new
TCP transport connection. For most NFS servers, this wasn't a serious problem,
although it did imply retries of outstanding RPCs, which could possibly
have missed the DRC.
For an NFSv4.1 mount to AmazonEFS, this caused a serious problem, since
AmazonEFS often didn't retain the NFSv4.1 session and would reply with
NFS4ERR_BAD_SESSION. This implies to the client a crash/reboot which
requires open/lock state recovery.
Three options were considered to fix this:
- Return the ERESTART all the way up to the system call boundary and then
have the system call redone. This is fraught with risk, due to convoluted
code paths, asynchronous I/O RPCs etc. cperciva@ worked on this, but it
is still a work in prgress and may not be feasible.
- Set SB_NOINTR for the socket buffer. This fixes the problem, but makes
the sosend() completely non interruptible, which kib@ considered
inappropriate. It also would break forced dismount when a thread
was blocked in sosend().
- Modify the retry loop in clnt_vc_call(), so that it loops for this case
for up to 15sec. Testing showed that the sosend() usually succeeded by
the 2nd retry. The extreme case observed was 111 loop iterations, or
about 100msec of delay.
This third alternative is what is implemented in this patch, since the
change is:
- localized
- straightforward
- forced dismount is not broken by it.
This patch has been tested by cperciva@ extensively against AmazonEFS.
davidcs [Mon, 22 May 2017 19:22:06 +0000 (19:22 +0000)]
MFC r318382
1. Move Rx Processing to fp_taskqueue(). With this CPU utilization for
processing interrupts drops to around 1% for 100G and under 1% for
other speeds.
2. Use sysctls for TRACE_LRO_CNT and TRACE_TSO_PKT_LEN
3. remove unused mtx tx_lock
4. bind taskqueue kernel thread to the appropriate cpu core
5. when tx_ring is full, stop further transmits till at least 1/16th of
the Tx Ring is empty. In our case 1K entries. Also if there are
rx_pkts to process, put the taskqueue thread to sleep for 100ms,
before enabling interrupts.
6. Use rx_pkt_threshold of 128.
gjb [Mon, 22 May 2017 16:07:17 +0000 (16:07 +0000)]
MFC r307469 (imp):
Allow root_rw_mount to be both lower and upper case. Before, if it was
upper case, you'd wind up with a read-only filesystem when you should
sometimes.
MSDOS and Windows GNU grep uses -u to mean "print byte offsets as if
running on an UNIX system." The option has no effect on systems that
do not use CRLF line endings.
asomers [Mon, 22 May 2017 15:12:49 +0000 (15:12 +0000)]
MFC r318189:
vdev_geom may associate multiple vdevs per g_consumer
vdev_geom.c currently uses the g_consumer's private field to point to a
vdev_t. That way, a GEOM event can cause a change to a ZFS vdev. For
example, when you remove a disk, the vdev's status will change to REMOVED.
However, vdev_geom will sometimes attach multiple vdevs to the same GEOM
consumer. If this happens, then geom events will only be propagated to one
of the vdevs.
Fix this by storing a linked list of vdevs in g_consumer's private field.
Fix the output of very large rebind, renew and lease time options in
lease file.
Some routers set very large values for rebind time (Netgear) and these
are erroneously reported as negative in the leasefile. This was due to a
wrong printf format specification of %ld for an unsigned long on 32-bit
platforms.
They would overflow a signed 32-bit time_t on 32 bit architectures. This
was taken care of, but a compiler optimisation makes this behave
erratically. This could be resolved by adding a -fwrapv flag, but
instead we can check the value before adding the current timestamp to
it.
hselasky [Mon, 22 May 2017 08:17:07 +0000 (08:17 +0000)]
MFC r318531:
mlx4: Use the CQ quota for SRIOV when creating completion EQs
When creating EQs to handle CQ completion events for the PF or for
VFs, we create enough EQE entries to handle completions for the max
number of CQs that can use that EQ.
When SRIOV is activated, the max number of CQs a VF (or the PF) can
obtain is its CQ quota (determined by the Hypervisor resource
tracker). Therefore, when creating an EQ, the number of EQE entries
that the VF should request for that EQ is the CQ quota value (and not
the total number of CQs available in the firmware).
Under SRIOV, the PF, also must use its CQ quota, because the resource
tracker also controls how many CQs the PF can obtain.
Using the firmware total CQs instead of the CQ quota when creating EQs
resulted wasting MTT entries, due to allocating more EQEs than were
needed.
ngie [Mon, 22 May 2017 06:24:43 +0000 (06:24 +0000)]
MFC r317594:
usb(4): manpage cleanup
1. Wrap at <80 columns for readability when editing. Rewrap some lines
prematurely wrapped to better fit in <80 columns and not waste
vertical space.
2. Fix SEE ALSO sorting (sort by section first, then manpage name).
3. Tweak the compound device description slightly by adding soft stops
via commas.
ngie [Mon, 22 May 2017 06:07:09 +0000 (06:07 +0000)]
MFC r315793:
intro(3): fix markup
- Use `Em` with `.It` macro when referring to other libraries, instead of
`Xr`.
- Use `.Em` instead of `.Xr` when referring to libraries.
- Remove commented out lines.
jhibbits [Sat, 20 May 2017 05:12:32 +0000 (05:12 +0000)]
MFC r314370,r318130,r318167:
DTrace related fixes for PowerPC.
r314370:
Unbreak kernel breakpoints, broken for ~4 years now
r318130:
Fix the encoded instruction for FBT traps on powerpc
r318167:
Fix stack tracing in dtrace for powerpc
gjb [Sat, 20 May 2017 01:04:19 +0000 (01:04 +0000)]
Document r316613, C standard library has been updated to make use
of reallocarray(3).
Document r318121, system libraries have been updated to make use
of reallocarray(3).
Document r315282, GNU __nonnull__ attribute have been replaced with
the more benign Clang nullability attributes.
Submitted by: pfg
Sponsored by: The FreeBSD Foundation
gjb [Sat, 20 May 2017 01:04:18 +0000 (01:04 +0000)]
Document r316613, C standard library has been updated to make use
of reallocarray(3).
Document r318121, system libraries have been updated to make use
of reallocarray(3).
Submitted by: pfg
Sponsored by: The FreeBSD Foundation
jhb [Fri, 19 May 2017 23:01:55 +0000 (23:01 +0000)]
MFC 318360: Fix p_endcopy.
For p_endcopy to work correctly, it must be the name of the next element
in struct proc after the end of the copy region, not the name of the
last element in the copy region. Currently, the last element
(p_elf_flags) is not being copied. In addition, the p_xexit and
p_xsig fields should not be copied on fork, so move them out of the
copy region.
Note that for stable/11 the fix is a bit simpler than in HEAD as it
merely consists of formally moving p_xexit and p_xsig out of the
copy region by shrinking the end of the copy region.
gjb [Fri, 19 May 2017 20:11:35 +0000 (20:11 +0000)]
Add a 'Userland Debugging' section.
Document r306786, core dumps now include the process ID (PID) and
command line arguments.
Document r304499, ptrace(2) now supports events for vfork(2).
Submitted by: jhb
Sponsored by: The FreeBSD Foundation
gjb [Fri, 19 May 2017 20:11:34 +0000 (20:11 +0000)]
Document r306520, PCI pass through with bhyve resets functions via
FLR.
Document r306471, PCI pass through with bhyve supports more dynamic
configurations.
Submitted by: jhb
Sponsored by: The FreeBSD Foundation
gjb [Fri, 19 May 2017 20:11:33 +0000 (20:11 +0000)]
Document r306660, Virtual Function devices on T4 and T5 adapters.
Document r306661, TCP Offload Engine on Chelsio T4+ adapters.
Document r306664, cxgbev(4) addition.
Document r309560, several cxgbe(4) and cxgbev(4) updates.
Submitted by: jhb
Sponsored by: The FreeBSD Foundation
hselasky [Fri, 19 May 2017 12:51:13 +0000 (12:51 +0000)]
MFC r313555:
Flexible and asymmetric allocation of EQs and MSI-X vectors for PF/VFs.
Previously, the mlx4 driver queried the firmware in order to get the
number of supported EQs. Under SRIOV, since this was done before the
driver notified the firmware how many VFs it actually needs, the
firmware had to take into account a worst case scenario and always
allocated four EQs per VF, where one was used for events while the
others were used for completions. Now, when the firmware supports the
asymmetric allocation scheme, denoted by exposing num_sys_eqs > 0 (-->
MLX4_DEV_CAP_FLAG2_SYS_EQS), we use the QUERY_FUNC command to query
the firmware before enabling SRIOV. Thus we can get more EQs and MSI-X
vectors per function. Moreover, when running in the new
firmware/driver mode, the limitation that the number of EQs should be
a power of two is lifted.
Obtained from: Linux (dual BSD/GPLv2 licensed)
Submitted by: Dexuan Cui @ microsoft . com
Differential Revision: https://reviews.freebsd.org/D8867
Sponsored by: Mellanox Technologies
hselasky [Fri, 19 May 2017 12:35:23 +0000 (12:35 +0000)]
MFC r313556:
Change mlx4 QP allocation scheme.
When using Blue-Flame, BF, the QPN overrides the VLAN, CV, and SV
fields in the WQE. Thus, BF may only be used for QPNs with bits 6,7
unset.
The current ethernet driver code reserves a TX QP range with 256b
alignment.
This is wrong because if there are more than 64 TX QPs in use, QPNs >=
base + 65 will have bits 6/7 set.
This problem is not specific for the Ethernet driver, any entity that
tries to reserve more than 64 BF-enabled QPs should fail. Also, using
ranges is not necessary here and is wasteful.
The new mechanism introduced here will support reservation for "Eth
QPs eligible for BF" for all drivers: bare-metal, multi-PF, and VFs
(when hypervisors support WC in VMs). The flow we use is:
1. In mlx4_en, allocate Tx QPs one by one instead of a range allocation,
and request "BF enabled QPs" if BF is supported for the function
2. In the ALLOC_RES FW command, change param1 to:
a. param1[23:0] - number of QPs
b. param1[31-24] - flags controlling QPs reservation
Bit 31 refers to Eth blueflame supported QPs. Those QPs must have bits
6 and 7 unset in order to be used in Ethernet.
Bits 24-30 of the flags are currently reserved.
When a function tries to allocate a QP, it states the required
attributes for this QP. Those attributes are considered "best-effort".
If an attribute, such as Ethernet BF enabled QP, is a must-have
attribute, the function has to check that attribute is supported
before trying to do the allocation.
In a lower layer of the code, mlx4_qp_reserve_range masks out the bits
which are unsupported. If SRIOV is used, the PF validates those
attributes and masks out unsupported attributes as well. In order to
notify VFs which attributes are supported, the VF uses QUERY_FUNC_CAP
command. This command's mailbox is filled by the PF, which notifies
which QP allocation attributes it supports.
Obtained from: Linux (dual BSD/GPLv2 licensed)
Submitted by: Dexuan Cui @ microsoft . com
Differential Revision: https://reviews.freebsd.org/D8868
Sponsored by: Mellanox Technologies