r315370:
The adj_free and max_free values of new_entry will be calculated and
assigned by subsequent vm_map_entry_link(), therefore, remove the
pointless copying.
asomers [Tue, 30 May 2017 22:54:52 +0000 (22:54 +0000)]
MFC r318189:
vdev_geom may associate multiple vdevs per g_consumer
vdev_geom.c currently uses the g_consumer's private field to point to a
vdev_t. That way, a GEOM event can cause a change to a ZFS vdev. For
example, when you remove a disk, the vdev's status will change to REMOVED.
However, vdev_geom will sometimes attach multiple vdevs to the same GEOM
consumer. If this happens, then geom events will only be propagated to one
of the vdevs.
Fix this by storing a linked list of vdevs in g_consumer's private field.
asomers [Tue, 30 May 2017 22:45:01 +0000 (22:45 +0000)]
MFC r317755, r317758
r317755:
Various Coverity fixes in ifconfig(8)
* Exit early if kldload(2) fails (1011259). This is the only change that
affects ifconfig's behavior.
* Close memory and resource leaks (1305624, 1305205, 1007100)
* Mark usage() as _Noreturn (1305806, 1305750)
* Fix some dereference after null checks (1011474, 270774)
asomers [Tue, 30 May 2017 22:43:08 +0000 (22:43 +0000)]
MFC r316856:
MFV 316855
7900 zdb shouldn't print the path of a znode at verbosity < 5
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Alan Somers <asomers@freebsd.org>
asomers [Tue, 30 May 2017 22:41:19 +0000 (22:41 +0000)]
MFC r316760:
Fix vdev_geom_attach_by_guids for partitioned disks
When opening a vdev whose path is unknown, vdev_geom must find a geom
provider with a label whose guids match the desired vdev. However, due to
partitioning, it is possible that two non-synonomous providers will share
some labels. For example, if the first partition starts at the beginning of
the drive, then ada0 and ada0p1 will share the first label. More troubling,
if the last partition runs to the end of the drive, then ada0p3 and ada0
will share the last label. If vdev_geom opens ada0 when it should've opened
ada0p3, then the pool won't be readable. If it opens ada0 when it should've
opened ada0p1, then it will corrupt some other partition when it writes the
3rd and 4th labels.
The easiest way to reproduce this problem is to install a mirrored root pool
with the default partition layout, then swap the positions of the two boot
drives and reboot. Whether the bug manifests depends on the order in which
geom lists its providers, which is arbitrary.
Fix this situation by modifying the search algorithm to prefer geom
providers that have all four labels intact. If no such provider exists, then
open whichever provider has the most.
asomers [Tue, 30 May 2017 22:34:43 +0000 (22:34 +0000)]
MFC r316548:
Quiet 450.status-security when *_inline="YES"
Previously, 450.status-security would always set rc=3 in inline mode,
because it doesn't know whether "periodic security" is going to find
anything interesting. But this annoyingly results in daily reports that
simply say "Security check: \n\n-- End of daily output --".
This change fixes that by testing whether "periodic security" printed
anything, and setting 450.status-security's exit status to 3 if it did. An
alternative would be to change the exit status of periodic(8) to be the
worst of its scripts' exit statuses, but that would be a more intrusive
change.
Reviewed by: brian
Differential Revision: https://reviews.freebsd.org/D10267
r315034:
Document that the msun tests require WARNS=0
ATF tests have a default WARNS of 0, unlike other usermode programs. This
change is technically a noop, but it documents that the msun tests don't
work with any warnings enabled, at least not on all architectures.
asomers [Tue, 30 May 2017 16:15:52 +0000 (16:15 +0000)]
MFC r314148, r314150
r314148:
Misc Coverity fixes in xnb(4)
Most of these are null pointer dereferences or missing error checks in the
unit tests. One is a missing error check in xnb_attach_failed. None can
cause real problems in running systems.
asomers [Tue, 30 May 2017 16:09:54 +0000 (16:09 +0000)]
MFC r313069:
Allow 999.local to run scripts in any language
If one of the scripts listed in (daily|weekly|monthly)_local is executable,
999.local should simply execute it. Only if the script isn't executable
should 999.local assume it needs /bin/sh.
Reviewed by: brian
Sponsored by: Spectra Logic Corp
2. Ensure that nsswitch.conf hosts line contains something like:
hosts: files cache dns
Note that cache must be specified before dns.
3. Start nscd.
4. Run the following command:
while true; do nc -z -w 3 www.google.com 80; sleep 5; done
5. While running the command, remove or comment out all nameserver
statements in /etc/resolv.conf. After a short while you will notice
non-recoverable name rsolution failures.
6. Uncomment or replace all nameserver statements back into
/etc/resolv.conf. Take note that name resolution never recovers.
To recover nscd must be restarted. This patch fixes this.
ngie [Mon, 29 May 2017 06:25:59 +0000 (06:25 +0000)]
MFC r316179,r316180,r316181,r316260:
r316179 (by cem):
t_msgsnd: Use msgsnd()'s msgsz argument correctly to avoid overflow
msgsnd's msgsz argument is the size of the message following the 'long'
message type. Don't include the message type in the size of the message
when invoking msgsnd(2).
ngie [Sun, 28 May 2017 01:14:59 +0000 (01:14 +0000)]
MFC r309412,r316109,r316132:
r309412 (by imp):
dd is currently a bootstrap tool. It really doesn't have any business
being a bootstrap tool. However, for reproducible build output,
FreeBSD added dd status=none because it was otherwise difficult to
suppress the status information, but retain any errors that might
happen. There's no real reason that dd has to be a build tool, other
than we use status=none unconditional. Remove dd from a bootstrap tool
entirely by only using status=none when available. This may also help
efforts to build the system on non-FreeBSD hosts as well.
r316109:
Don't hardcode input files for stage 1/2 bootloaders; use .ALLSRC instead
This is a better pattern to follow when creating the bootloaders and doing
the relevant space checks to make sure that the sizes aren't exceeded (and
thus, copy-pasting is a bit less error prone).
r316132:
Parameterize out 7680 (15 * 512) as BOOT2SIZE, similar to sys/boot/i386/zfsboot/...
This is being done to make it easier to change in the future--this action might be
needed sooner rather than later because of gcc 6.3.0 bailing, stating that there
is negative free space left (deficit) in the boot2 bootloader.
ngie [Sun, 28 May 2017 00:47:02 +0000 (00:47 +0000)]
MFC r316131:
Fix up r316081 by using nitems(cam_errbuf) instead of sizeof(cam_errbuf)
Part of my original reasoning as far as converting the snprintf
calls was to permit switching over from char[] to wchar_t[] in the
future, as well as futureproof in case cam_errbuf's size was ever
changed.
Unfortunately, my approach was bugged because it conflated the
number of items with the size of the buffer, instead of the number of
elements being a fixed size != 1 byte.
Use nitems(..) instead which counts the quantity of items of a specific
type, as opposed to an unqualified sizeof(..) (which assumes that the
number of characters is equal to the buffer size).
Fix -Wimplicit-function-declaration compilation warning by moving libgeom.h
#include below the stdio.h #include.
gctl_dump(3) needs stdio.h, per reasoning noted in r317289.
PR: 218809
r317291:
Rename gctl.t to gctl_test.t and test.c to gctl_test_helper.c
This is being done to reduce ambiguity and to make the tests more portable
in the future to other locations in the source tree.
r317292:
gctl_test.t: use make to compile gctl_test_helper instead of calling cc directly
r317293:
gctl_test_helper: apply polish
- Staticize variables to fix warnings.
- Sprinkle asserts around for calls that can fail
- Apply style(9) for main(..) definition.
- ANSIify usage(..) definition.
r317294:
Bump WARNS to 6 per previous commits which fixed warnings
Tested with: clang (4.0), gcc (4.2.1, 6.3.0)
r317295:
The GPT class no longer exists; use the PART class instead
r317304:
gctl_test_helper: add diagnostic output for parse_retval(..)
This will help end-users better diagnose issues with the function.
r317306:
gctl_test.t: minor tweaks
- Declare $count with the `my` scope operator to permit `use strict`.
- Add `use strict`.
- Use `use warnings` instead of using `-w` in the shebang.
- Don't unlink $cmd when done (prevents unnecessary rebuilding).
- Improve the error message when running with insufficient permissions, e.g.,
non-root.
r317307:
Use verb=delete not verb=remove
The `remove` verb hasn't been present in geom_part*(4) for well
over a decade, if ever. I couldn't find any references to it in
^/stable/5 at least, which is around the timeframe that this test
was written.
r317308:
gctl_test.t: more tweaks to try and update the code and get it functional (again?)
- Make the logfile for $out be built off the basename for $cmd, instead of $cmd.
(r317292 broke this assumption).
- Rename $mntpt to $mntpt_prefix for clarity, as this variable is a prefix for
mountpoints.
- Reindent the umount directive block while here to match the rest of the code.
r317309:
gctl_test.t: improve error reporting with mdcfg and mount directives
If the commands had failed previously, it would press on and result in a
series of cascading failures. Fail early and continue on to the next case
instead of executing additional commands after a previously failed series
of steps.
ngie [Sat, 27 May 2017 23:26:10 +0000 (23:26 +0000)]
MFC r316099:
lib/libkvm: start adding basic tests for kvm(3)
- kvm_close: add a testcase to verify support for errno = EINVAL / -1
(see D10065) when kd == NULL is provided to the libcall.
- kvm_geterr:
-- Add a negative testcase for kd == NULL returning "" (see D10022).
-- Add two positive testcases:
--- test the error case using kvm_write on a O_RDONLY descriptor.
--- test the "no error" case using kvm_read(3) and kvm_nlist(3) as
helper routines and by injecting a bogus error message via
_kvm_err (an internal API) _kvm_err was used as there isn't a
formalized way to clear the error output, and because
kvm_nlist always returns ENOENT with the NULL terminator today.
- kvm_open, kvm_open2:
-- Add some basic negative tests for kvm_open(3) and kvm_open2(3).
Testing positive cases with a specific
`corefile`/`execfile`/`resolver` requires more work and would require
user intervention today in order to reliably test this out.
MFC note:
lib/libkvm/kvm_open2_test is not compiled/tested because ^/stable/10
lacks the kvm_open2(3) libcall.
ngie [Sat, 27 May 2017 23:04:48 +0000 (23:04 +0000)]
MFC r318546:
sys/fs/tmpfs/vnd_test: make md(4) allocation dynamic
The previous logic was flawed in the sense that it assumed that /dev/md3
was always available. This was a caveat I noted in r306038, that I hadn't
gotten around to solving before now.
Cache the device for the mountpoint after executing mdmfs, then use the
cached value in basic_cleanup(..) when unmounting/disconnecting the md(4)
device.
Apply sed expressions to use reuse logic in the NetBSD code that could
also be applied to FreeBSD, just with different tools.
ngie [Sat, 27 May 2017 22:57:07 +0000 (22:57 +0000)]
MFC r317288,r317289:
r317288:
libgeom(3): apply minor polish
- Use .Dv when mentioning NULL per mdoc(7).
- Reword `g_device_path`, `g_open_by_ident`, and `g_providername`'s descriptions
so they're less wordy.
- Fix a typo in `g_device_path` (can not -> cannot).
Tested with: igor, make manlint
r317289:
libgeom(3): note that stdio.h is required when referencing gctl_dump(3)
gctl_dump(3) is only exposed when stdio.h is #include'd first, per its
addition in r112510. The reasoning noted for the conditional "exposure"
of the function was to "limit #include pollution".
This addresses an issue I found with the documentation when looking at
bug 218809, which in turn addresses a -Wimplicit-function-declaration
compiler warning in `tools/regression/geom_gpt/test.c` (it uses
gctl_dump(3)).
hselasky [Sat, 27 May 2017 08:27:11 +0000 (08:27 +0000)]
MFC r318820:
Increase the allowed maximum number of audio channels from 31 to 127
in the PCM feeder mixer. Without this change a value of 32 channels is
treated like zero, due to using a mask of 0x1f, causing a kernel
assert when trying to playback bitperfect 32-channel audio. Also
update the AWK script which is generating the division tables to
handle more than 18 channels. This commit complements r282650.
hselasky [Sat, 27 May 2017 08:17:59 +0000 (08:17 +0000)]
MFC r318353:
Avoid use of contiguous memory allocations in busdma when possible.
This patch improves the boundary checks in busdma to allow more cases
using the regular page based kernel memory allocator. Especially in
the case of having a non-zero boundary in the parent DMA tag. For
example AMD64 based platforms set the PCI DMA tag boundary to
PCI_DMA_BOUNDARY, 4GB, which before this patch caused contiguous
memory allocations to be preferred when allocating more than PAGE_SIZE
bytes. Even if the required alignment was less than PAGE_SIZE bytes.
This patch also fixes the nsegments check for using kmem_alloc_attr()
when the maximum segment size is less than PAGE_SIZE bytes.
Updated some comments describing the code in question.
gjb [Fri, 26 May 2017 19:02:46 +0000 (19:02 +0000)]
MFC r314935 (thompsa):
Change ec2.conf to use the pkg tool from a chroot rather than trying
to bootstrap it and fail from the livecd readonly filesystem.
truckman [Thu, 25 May 2017 22:41:34 +0000 (22:41 +0000)]
MFC r318527
Fix the queue delay estimation in PIE/FQ-PIE when the timestamp
(TS) method is used. When packet timestamp is used, the "current_qdelay"
keeps storing the last queue delay value calculated in the dequeue
function. Therefore, when a burst of packets arrives followed by
a pause, the "current_qdelay" will store a high value caused by the
burst and stick to that value during the pause because the queue
delay measurement is done inside the dequeue function. This causes
the drop probability calculation function to calculate high drop
probability value instead of zero and prevents the burst allowance
mechanism from working properly. Fix this problem by resetting
"current_qdelay" inside the drop probability calculation function
when the queue length is zero and TS option is used.
truckman [Thu, 25 May 2017 17:23:26 +0000 (17:23 +0000)]
MFC r318511
The result of right shifting a negative signed value is implementation
defined. On machines without arithmetic shift instructions, zero bits
may be shifted in from the left, giving a large positive result instead
of the desired divide-by power-of-2. Fix this by operating on the
absolute value and compensating for the possible negation later.
Reverse the order of the underflow/overflow tests and the exponential
decay calculation to avoid the possibility of an erroneous overflow
detection if p is a sufficiently small non-negative value. Also
check for negative values of prob before doing the exponential decay
to avoid another instance of of right shifting a negative value.
np [Thu, 25 May 2017 02:00:37 +0000 (02:00 +0000)]
MFC r318014, r318091, r318125, and r318263.
r318014:
cxgbe(4): Fixes related to the knob that controls link autonegotiation.
- Do not leak the adapter lock in sysctl_autoneg.
- Accept only 0 or 1 as valid settings for autonegotiation.
- A fixed speed must be requested by the driver when autonegotiation is
disabled otherwise the firmware will reject the l1cfg command. Use
the top speed supported by the port for now.
r318091:
cxgbe(4): Do not assume that if_qflush is always followed by inteface-down.
r318125:
Adjust whitespace and fix a comment. No functional change.
r318263:
cxgbe(4): netmap-only interrupts for a VI do not have an associated rxq
or ofld_rxq and should be ignored by vi_intr_iq.
np [Thu, 25 May 2017 01:43:28 +0000 (01:43 +0000)]
MFC r317702, r317847, r318307
r317702:
cxgbe(4): Support routines for Tx traffic scheduling.
- Create a new file, t4_sched.c, and move all of the code related to
traffic management from t4_main.c and t4_sge.c to this file.
- Track both Channel Rate Limiter (ch_rl) and Class Rate Limiter (cl_rl)
parameters in the PF driver.
- Initialize all the cl_rl limiters with somewhat arbitrary default
rates and provide routines to update them on the fly.
- Provide routines to reserve and release traffic classes.
r317847:
cxgbe(4): The Tx scheduler initialization either works or doesn't. It
doesn't need a refresh in either case.
r318307:
cxgbe(4): Avoid an out of bounds access when an attempt to unbind a tx
queue from a traffic class fails.
gjb [Thu, 25 May 2017 01:31:12 +0000 (01:31 +0000)]
MFC r308737, r308779:
r308737:
Pass SWAPSIZE in env(1) when invoking mk-vmimage.sh, otherwise
mkimg(1) does not create the second partition after r307008.
r308779:
Pass SWAPSIZE in env(1) when invoking mk-vmimage.sh for the
vm-image target, missed in r308737.
np [Thu, 25 May 2017 00:16:41 +0000 (00:16 +0000)]
MFC r316971:
cxgbe: Add a tunable to configure the SGE time scaler, which is
available starting with T6. The values in the timer holdoff registers
are multiplied by the scaling factor before use.
dev.<nexus>.<n>.holdoff_timers shows the final values of the
timers in microseconds.
np [Wed, 24 May 2017 20:29:20 +0000 (20:29 +0000)]
MFC r313318:
cxgbe(4): Allow tunables that control the number of queues to be set to
'-n' to tell the driver to create _up to_ 'n' queues if enough cores are
available. For example, setting hw.cxgbe.nrxq10g="-32" will result in
16 queues if the system has 16 cores, 32 if it has 32.
There is no change in the default number of queues of any type.
np [Wed, 24 May 2017 20:01:12 +0000 (20:01 +0000)]
MFC r313346:
cxgbe/t4_tom: Fix CLIP entry refcounting on the passive side. Every
IPv6 connection being handled by the TOE should have a reference on its
CLIP entry.
r311880:
The iw_cxgb and iw_cxgbe drivers should not use a FreeBSD device_t where
a linuxkpi style device is expected. If OFED/linuxkpi actually starts
using this field then we'll have to figure out whether to create fake
devices for these drivers or have linuxkpi deal with NULL device.
This mismatch was first reported as part of D6585.
r314167:
cxgbe/iw_cxgbe: Minor changes for T6.
r316118:
cxgbe/iw_cxgbe: T6 has no limit on the amount of memory that can be
registered in one ib_reg_phys_mr.
r316571:
cxgbe/iw_cxgbe: Remove bad cast that resulted in incorrect length for
memory regions larger than 4GB.
r316573:
cxgbe/iw_cxgbe: Replace a magic constant with something more readable
(and accurate).
T4 and later have an extra bit for page shift so the maximum page size
is 8TB (shift of 12 + 31) instead of 128MB (12 + 15). This saves space
in the chip's PBL (physical buffer list) when registering very large
memory regions.
r316580:
cxgbe/iw_cxgbe: Remove another bad cast. This should have been
included in r316571.
badger [Tue, 23 May 2017 12:40:50 +0000 (12:40 +0000)]
move p_sigqueue to the end of struct proc
In order to preserve KBI in stable branches, replace the existing
p_sigqueue slot with padding and move the expanded (as of r315949)
p_sigqueue to the end of the struct.
This is a repeat of r317529 (which concerned td_sigqueue in struct
thread) for p_sigqueue in struct proc.
Virtualbox modules (and possibly others) are affected without this fix.
rmacklem [Mon, 22 May 2017 21:52:06 +0000 (21:52 +0000)]
MFC: r317931
Fix mount_nfs so that it doesn't create mounttab entries for NFSv4 mounts.
The NFSv4 protocol doesn't use the Mount protocol, so it doesn't make sense
to add an entry for an NFSv4 mount to /var/db/mounttab. Also, r308871
modified umount so that it doesn't remove any entry created by mount_nfs.
rmacklem [Mon, 22 May 2017 19:57:20 +0000 (19:57 +0000)]
MFC: r317906
Fix the client side krpc from doing TCP reconnects for ERESTART from sosend().
When sosend() replies ERESTART in the client side krpc, it indicates that
the RPC message hasn't yet been sent and that the send queue is full or
locked while a signal is posted for the process.
Without this patch, this would result in a RPC_CANTSEND reply from
clnt_vc_call(), which would cause clnt_reconnect_call() to create a new
TCP transport connection. For most NFS servers, this wasn't a serious problem,
although it did imply retries of outstanding RPCs, which could possibly
have missed the DRC.
For an NFSv4.1 mount to AmazonEFS, this caused a serious problem, since
AmazonEFS often didn't retain the NFSv4.1 session and would reply with
NFS4ERR_BAD_SESSION. This implies to the client a crash/reboot which
requires open/lock state recovery.
Three options were considered to fix this:
- Return the ERESTART all the way up to the system call boundary and then
have the system call redone. This is fraught with risk, due to convoluted
code paths, asynchronous I/O RPCs etc. cperciva@ worked on this, but it
is still a work in prgress and may not be feasible.
- Set SB_NOINTR for the socket buffer. This fixes the problem, but makes
the sosend() completely non interruptible, which kib@ considered
inappropriate. It also would break forced dismount when a thread
was blocked in sosend().
- Modify the retry loop in clnt_vc_call(), so that it loops for this case
for up to 15sec. Testing showed that the sosend() usually succeeded by
the 2nd retry. The extreme case observed was 111 loop iterations, or
about 100msec of delay.
This third alternative is what is implemented in this patch, since the
change is:
- localized
- straightforward
- forced dismount is not broken by it.
This patch has been tested by cperciva@ extensively against AmazonEFS.
davidcs [Mon, 22 May 2017 19:36:26 +0000 (19:36 +0000)]
MFC r318382
1. Move Rx Processing to fp_taskqueue(). With this CPU utilization for
processing interrupts drops to around 1% for 100G and under 1% for
other speeds.
2. Use sysctls for TRACE_LRO_CNT and TRACE_TSO_PKT_LEN
3. remove unused mtx tx_lock
4. bind taskqueue kernel thread to the appropriate cpu core
5. when tx_ring is full, stop further transmits till at least 1/16th of
the Tx Ring is empty. In our case 1K entries. Also if there are
rx_pkts to process, put the taskqueue thread to sleep for 100ms,
before enabling interrupts.
6. Use rx_pkt_threshold of 128.
MSDOS and Windows GNU grep uses -u to mean "print byte offsets as if
running on an UNIX system." The option has no effect on systems that
do not use CRLF line endings.
hselasky [Mon, 22 May 2017 08:19:08 +0000 (08:19 +0000)]
MFC r318531:
mlx4: Use the CQ quota for SRIOV when creating completion EQs
When creating EQs to handle CQ completion events for the PF or for
VFs, we create enough EQE entries to handle completions for the max
number of CQs that can use that EQ.
When SRIOV is activated, the max number of CQs a VF (or the PF) can
obtain is its CQ quota (determined by the Hypervisor resource
tracker). Therefore, when creating an EQ, the number of EQE entries
that the VF should request for that EQ is the CQ quota value (and not
the total number of CQs available in the firmware).
Under SRIOV, the PF, also must use its CQ quota, because the resource
tracker also controls how many CQs the PF can obtain.
Using the firmware total CQs instead of the CQ quota when creating EQs
resulted wasting MTT entries, due to allocating more EQEs than were
needed.
ngie [Mon, 22 May 2017 06:06:48 +0000 (06:06 +0000)]
MFC r315793:
intro(3): fix markup
- Use `Em` with `.It` macro when referring to other libraries, instead of
`Xr`.
- Use `.Em` instead of `.Xr` when referring to libraries.
- Remove commented out lines.
hselasky [Fri, 19 May 2017 12:53:50 +0000 (12:53 +0000)]
MFC r313555:
Flexible and asymmetric allocation of EQs and MSI-X vectors for PF/VFs.
Previously, the mlx4 driver queried the firmware in order to get the
number of supported EQs. Under SRIOV, since this was done before the
driver notified the firmware how many VFs it actually needs, the
firmware had to take into account a worst case scenario and always
allocated four EQs per VF, where one was used for events while the
others were used for completions. Now, when the firmware supports the
asymmetric allocation scheme, denoted by exposing num_sys_eqs > 0 (-->
MLX4_DEV_CAP_FLAG2_SYS_EQS), we use the QUERY_FUNC command to query
the firmware before enabling SRIOV. Thus we can get more EQs and MSI-X
vectors per function. Moreover, when running in the new
firmware/driver mode, the limitation that the number of EQs should be
a power of two is lifted.
Obtained from: Linux (dual BSD/GPLv2 licensed)
Submitted by: Dexuan Cui @ microsoft . com
Differential Revision: https://reviews.freebsd.org/D8867
Sponsored by: Mellanox Technologies
hselasky [Fri, 19 May 2017 12:39:35 +0000 (12:39 +0000)]
MFC r313556:
Change mlx4 QP allocation scheme.
When using Blue-Flame, BF, the QPN overrides the VLAN, CV, and SV
fields in the WQE. Thus, BF may only be used for QPNs with bits 6,7
unset.
The current ethernet driver code reserves a TX QP range with 256b
alignment.
This is wrong because if there are more than 64 TX QPs in use, QPNs >=
base + 65 will have bits 6/7 set.
This problem is not specific for the Ethernet driver, any entity that
tries to reserve more than 64 BF-enabled QPs should fail. Also, using
ranges is not necessary here and is wasteful.
The new mechanism introduced here will support reservation for "Eth
QPs eligible for BF" for all drivers: bare-metal, multi-PF, and VFs
(when hypervisors support WC in VMs). The flow we use is:
1. In mlx4_en, allocate Tx QPs one by one instead of a range allocation,
and request "BF enabled QPs" if BF is supported for the function
2. In the ALLOC_RES FW command, change param1 to:
a. param1[23:0] - number of QPs
b. param1[31-24] - flags controlling QPs reservation
Bit 31 refers to Eth blueflame supported QPs. Those QPs must have bits
6 and 7 unset in order to be used in Ethernet.
Bits 24-30 of the flags are currently reserved.
When a function tries to allocate a QP, it states the required
attributes for this QP. Those attributes are considered "best-effort".
If an attribute, such as Ethernet BF enabled QP, is a must-have
attribute, the function has to check that attribute is supported
before trying to do the allocation.
In a lower layer of the code, mlx4_qp_reserve_range masks out the bits
which are unsupported. If SRIOV is used, the PF validates those
attributes and masks out unsupported attributes as well. In order to
notify VFs which attributes are supported, the VF uses QUERY_FUNC_CAP
command. This command's mailbox is filled by the PF, which notifies
which QP allocation attributes it supports.
Obtained from: Linux (dual BSD/GPLv2 licensed)
Submitted by: Dexuan Cui @ microsoft . com
Differential Revision: https://reviews.freebsd.org/D8868
Sponsored by: Mellanox Technologies