hselasky [Thu, 2 Aug 2018 08:55:19 +0000 (08:55 +0000)]
MFC r336450:
Do not inline transmit headers and use HW VLAN tagging if supported by mlx5en(4).
Query the minimal inline mode supported by the card.
When creating a send queue, cache the queried mode and optimize the transmit
if no inlining is required. In this case, we can avoid touching the headers
cache line and avoid dirtying several more lines by copying headers into
the send WQEs. Also, if no inline headers are used, hardware assists in
the VLAN tag framing.
hselasky [Thu, 2 Aug 2018 08:51:55 +0000 (08:51 +0000)]
MFC r336410:
Add module parameter to limit number of MSIX EQ vectors in mlx5en(4).
For setups having a large amount of PCI devices, it makes sense to limit the
number of MSIX vectors per PCI device, in order to avoid running out of IRQ
vectors.
hselasky [Thu, 2 Aug 2018 08:47:24 +0000 (08:47 +0000)]
MFC r336403:
Add context numbers for HW elements in mlx5en(4).
To access the data, set sysctl dev.mce.N.conf.debug_stats to 1.
This enables the sysctl node dev.mce.N.hw_ctx_debug. Its content is
the mapping of each channel' number to used receive queue and associated
completion queue, set of the transmit queues numbers and corresponding
completion queues.
hselasky [Thu, 2 Aug 2018 08:43:54 +0000 (08:43 +0000)]
MFC r336398:
Make sure the state variable is set atomically instead of using a mutex in mlx5core.
Device detach and setting error state may deadlock over the interface mutex
like this:
a) Detach code in mlx5en waits until error state is set while the interface
mutex is locked.
b) The set error handler needs to lock the interface mutex before it can
set the error state.
The solution is to use atomics to set the error state.
hselasky [Thu, 2 Aug 2018 08:37:44 +0000 (08:37 +0000)]
MFC r336393:
Use static device naming instead of dynamic one in mlx5ib.
When resetting mlx5core instances it can happen that the order of attach and
detach for mlx5ib instances is changed. Take the unit number for mlx5_%d from
the parent PCI device, similarly to what is done in mlx5en(4), so that there
is a direct relationship between mce<N> and mlx5_<N>.
hselasky [Thu, 2 Aug 2018 08:35:32 +0000 (08:35 +0000)]
MFC r336964:
Only NULL check the VNET pointer when VIMAGE is enabled in ibcore.
Else a NULL VNET pointer should be ignored. This fixes address resolving
when VIMAGE is disabled.
hselasky [Thu, 2 Aug 2018 08:30:44 +0000 (08:30 +0000)]
MFC r336388:
Add support for RoCEv2 multicast in ibcore.
When creating address handle from multicast GID, set MAC according to
the appropriate formula instead of searching for it in the GID table:
- For IPv4 multicast GID use ip_eth_mc_map().
- For IPv6 multicast GID use ipv6_eth_mc_map().
hselasky [Thu, 2 Aug 2018 08:29:40 +0000 (08:29 +0000)]
MFC r336387:
Honor return status of ib_init_ah_from_mcmember() in ibcore.
The return status of ib_init_ah_from_mcmember() is ignored by
cma_ib_mc_handler(). Honor it and return error event if ah attribute
initialization failed.
hselasky [Thu, 2 Aug 2018 08:25:48 +0000 (08:25 +0000)]
MFC r336383:
Check port number supplied by user verbs cmds in ibcore.
The ib_uverbs_create_ah() ind ib_uverbs_modify_qp() calls receive
the port number from user input as part of its attributes and assumes
it is valid. Down on the stack, that parameter is used to access kernel
data structures. If the value is invalid, the kernel accesses memory
it should not. To prevent this, verify the port number before using it.
hselasky [Thu, 2 Aug 2018 08:23:54 +0000 (08:23 +0000)]
MFC r336381:
Fix kernel crash during fail to initialize device in ibcore.
This patch fixes the kernel crash that occurs during ib_dealloc_device()
called due to provider driver fails with an error after
ib_alloc_device() and before it can register using ib_register_device().
This crashed seen in tha lab as below which can occur with any IB device
which fails to perform its device initialization before invoking
ib_register_device().
This patch avoids touching cache and port immutable structures if device
is not yet initialized.
It also releases related memory when cache and port immutable data
structure initialization fails during register_device() state.
hselasky [Thu, 2 Aug 2018 08:22:53 +0000 (08:22 +0000)]
MFC r336380:
Check AF family prior resolving address and introduce safer rdma_addr_size() variants in ibcore.
Garbage supplied by user will cause to UCMA module provide zero
memory size for memcpy(), because it wasn't checked, it will
produce unpredictable results in rdma_resolve_addr().
There are several places in the ucma ABI where userspace can pass in a
sockaddr but set the address family to AF_IB. When that happens,
rdma_addr_size() will return a size bigger than sizeof struct sockaddr_in6,
and the ucma kernel code might end up copying past the end of a buffer
not sized for a struct sockaddr_ib.
Fix this by introducing new variants
int rdma_addr_size_in6(struct sockaddr_in6 *addr);
int rdma_addr_size_kss(struct __kernel_sockaddr_storage *addr);
that are type-safe for the types used in the ucma ABI and return 0 if the
size computed is bigger than the size of the type passed in. We can use
these new variants to check what size userspace has passed in before
copying any addresses.
hselasky [Thu, 2 Aug 2018 08:21:55 +0000 (08:21 +0000)]
MFC r336379:
Check for a cm_id->device in all user calls that need it in ibcore.
This was done by auditing all callers of ucma_get_ctx and switching the
ones that unconditionally touch ->device to ucma_get_ctx_dev. This covers
a little less than half of the call sites.
The 11 remaining call sites to ucma_get_ctx() were manually audited.
hselasky [Thu, 2 Aug 2018 08:21:04 +0000 (08:21 +0000)]
MFC r336377:
Fix kernel panic while using XRC_TGT QP type in ibcore.
Attempt to modify XRC_TGT QP type from the user space (ibv_xsrq_pingpong
invocation) will trigger the following kernel panic. It is caused by the
fact that such QPs missed uobject initialization.
hselasky [Thu, 2 Aug 2018 08:20:11 +0000 (08:20 +0000)]
MFC r336376:
Fix NULL pointer dereference during device removal in ibcore.
As part of ib_uverbs_remove_one which might be triggered upon
reset flow, we trigger IB_EVENT_DEVICE_FATAL event to userspace
application.
If device was removed after uverbs fd was opened but before
ib_uverbs_get_context was called, the event file will be accessed
before it was allocated, result in NULL pointer dereference:
hselasky [Thu, 2 Aug 2018 08:17:09 +0000 (08:17 +0000)]
MFC r336373:
Ensure that CM_ID exists prior to access it in ibcore.
Prior to access UCMA commands, the context should be initialized
and connected to CM_ID with ucma_create_id(). In case user skips
this step, he can provide non-valid ctx without CM_ID and cause
to multiple NULL dereferences.
Also there are situations where the create_id can be raced with
other user access, ensure that the context is only shared to
other threads once it is fully initialized to avoid the races.
hselasky [Thu, 2 Aug 2018 08:15:05 +0000 (08:15 +0000)]
MFC r336372:
Add support for prio-tagged traffic for RDMA in ibcore.
When receiving a PCP change all GID entries are reloaded.
This ensures the relevant GID entries use prio tagging,
by setting VLAN present and VLAN ID to zero.
The priority for prio tagged traffic is set using the regular
rdma_set_service_type() function.
Fake the real network device to have a VLAN ID of zero
when prio tagging is enabled. This is logic is hidden inside
the rdma_vlan_dev_vlan_id() function which must always be used
to retrieve the VLAN ID throughout all of ibcore and the
infiniband network drivers.
The VLAN presence information then propagates through all
of ibcore and so incoming connections will have the VLAN
bit set. The incoming VLAN ID is then checked against the
return value of rdma_vlan_dev_vlan_id().
hselasky [Thu, 2 Aug 2018 08:12:52 +0000 (08:12 +0000)]
MFC r336370:
Set RoCEv2 MGID according to spec in ibcore.
RoCEv2 Annex states that for RoCEv2 over IPv4, the corresponding
IPv4 address is encoded into the GID according to the following rule:
GID= :ffff:<IPv4 address>
Remove the 0xff0e prefix for RoCEv2 packets with IPv4 and leave it
zeroed and change rdma_is_multicast_addr() to consider the new logic.
hselasky [Thu, 2 Aug 2018 08:12:01 +0000 (08:12 +0000)]
MFC r336369:
For multicast functions in ibcore, verify that LIDs are multicast LIDs.
The Infiniband spec defines "A multicast address is defined by a
MGID and a MLID" (section 10.5).
Add check to verify that the MLID value is in the correct address
range.
RoCE Annex (A16.9.10/11) declares that during attach (detach) QP to a
multicast group, if the QP is associated with a RoCE port, the
multicast group MLID is unused and is ignored.
During attach or detach multicast, when the QP is associated with a
port, it is enough to check the port's link layer and validate the
LID only if it is Infiniband. Otherwise, avoid validating the
multicast LID.
hselasky [Thu, 2 Aug 2018 08:10:54 +0000 (08:10 +0000)]
MFC r336368:
Fix for RDMA loopback over VLAN in ibcore.
Implement a more generic solution for detecting loopback.
The problem was that the default netdevice was resolved
for loopback also when VLAN was used. Use real network
device instead of loopback device for bound device
interface.
How to test:
ucmatose -b 127.0.0.1 -p 20090
ucmatose -s 5.6.5.1 -p 20090
Note that RDMA treats the IPv4 and IPv6 loopback
addresses like any address.
hselasky [Thu, 2 Aug 2018 08:08:02 +0000 (08:08 +0000)]
MFC r336366:
If the MGID/MLID pair is not on the list return an error in ibcore.
A list of MGID/MLID pairs is built when doing a multicast attach. When
the multicast detach is called, the list is searched, and regardless of
the search outcome, the driver detach is called.
If an MGID/MLID pair is not on the list, driver detach should not be
called, and an error should be returned. Calling the driver without
removing an MGID/MLID pair from the list can leave the core and driver
out of sync.
hselasky [Thu, 2 Aug 2018 08:07:10 +0000 (08:07 +0000)]
MFC r336365:
Add lock to multicast handlers in ibcore.
When two handlers used the same object in the old schema, we blocked
the process in the kernel. The new schema just returns -EBUSY. This
could lead to different behaviour in applications between the old
schema and the new schema. In most cases, using such handlers
concurrently could lead to crashing the process. For example, if
thread A destroys a QP and thread B modifies it, we could have the
destruction happens before the modification. In this case, we are
accessing freed memory which could lead to crashing the process.
This is true for most cases. However, attaching and detaching
a multicast address from QP concurrently is safe. Therefore, we
preserve the original behaviour by adding a lock there.
hselasky [Thu, 2 Aug 2018 08:06:17 +0000 (08:06 +0000)]
MFC r336364:
Only update source address when resolving is successful in ibcore.
When resolving an IP address in ibcore, only update the source address
upon normal completion. The ibcore address resolve function does not
care about the scope ID value of the IPv6 link-local addresses and expects
this information has already been extracted into the bound_dev_if field.
Because the same IPv6 link-local address can exist on multiple interfaces
the ibcore address resolver gets confused and returns ENETUNREACH.
Instead of updating both source address and bound_dev_if just keep the
address set to any address until resolving completes. For the sake of code
symmetry a similar change has been applied to the IPv4 address resolve path.
hselasky [Thu, 2 Aug 2018 08:05:20 +0000 (08:05 +0000)]
MFC r336363:
Process address resolve requests at least one time per second in ibcore.
When setting a large address resolve timeout it was observed that the
address resolving would succeed at the timeout and not when the address
was available. Make sure the address resolving requests are processed no
slower than one time every second.
While at it use "int" for jiffies instead of "unsigned long" to match
FreeBSD ticks.
rmacklem [Thu, 2 Aug 2018 03:13:59 +0000 (03:13 +0000)]
MFC: r336357
Modify the reasons for not issuing a delegation in the NFSv4.1 server.
The ESXi NFSv4.1 client will generate warning messages when the reason for
not issuing a delegation is two. Two refers to a resource limit and I do
not see why it would be considered invalid. However it probably was not the
best choice of reason for not issuing a delegation.
This patch changes the reasons used to ones that the ESXi client doesn't
complain about. This change does not affect the FreeBSD client and does
not appear to affect behaviour of the Linux NFSv4.1 client.
RFC5661 defines these "reasons" but does not give any guidance w.r.t. which
ones are more appropriate to return to a client.
MFC: r336215
Ignore the cookie verifier for NFSv4.1 when the cookie is 0.
RFC5661 states that the cookie verifier should be 0 when the cookie is 0.
However, the wording is somewhat unclear and a recent discussion on the
nfsv4@ietf.org mailing list indicated that the NFSv4 server should ignore
the cookie verifier's value when the dirctory offset cookie is 0.
This patch deletes the check for this that would return NFSERR_BAD_COOKIE
when the verifier was not 0.
This was found during testing of the ESXi client against the NFSv4.1 server.
r336662: Deprecate jedec_ts(4) and point users to jedec_dimm(4) instead
jedec_dimm(4) is a superset of the functionality of jedec_ts(4). Mark
jedec_ts(4) as removed in FreeBSD 12, and include a pointer to the migration
instructions in the jedec_dimm(4) manpage, in both the jedec_ts(4) code and
the jedec_ts(4) manpage. Add a note to the jedec_dimm(4) manpage about the
fact that it is a superset of jedec_ts(4).
MFC r336683:
Extend ranges of the critical sections to ensure that context switch
code never sees FPU pcb flags not consistent with the hardware state.
r307967: Allow config to be compiled from another source directory, such as
one for building tools. This boils down to replacing ${.CURDIR} with
${SRCDIR}, where the latter is the directory in which this makefile
lives.
Also allow overriding where file2c comes from using ${FILE2C}.
r324082: Typo in filename in comment.
r325955: Fix 'local' to not look in the source tree for the file.
Usually 'local' is used along with other rules such as 'no-implicit-rule' or
'dependency' which avoids this problem. It's possible to need to use
'local' while relying on the default rules though for a file which is not in
the source tree nor generated in the kernel.
MFC: r335866
Fix the server side krpc so that the kernel nfsd threads terminate.
Occationally the kernel nfsd threads would not terminate when a SIGKILL
was posted for the kernel process (called nfsd (slave)). When this occurred,
the thread associated with the process (called "ismaster") had returned from
svc_run_internal() and was sleeping waiting for the other threads to terminate.
The other threads (created by kthread_start()) were still in svc_run_internal()
handling NFS RPCs.
The only way this could occur is for the "ismaster" thread to return from
svc_run_internal() without having called svc_exit().
There was only one place in the code where this could happen and this patch
stops that from happening.
Since the problem is intermittent, I cannot be sure if this has fixed the
problem, but I have not seen an occurrence of the problem with this patch
applied.
MFC: r334966
Add a couple of safety belt checks to the NFSv4.1 client related to sessions.
There were a couple of cases in newnfs_request() that it assumed that it
was an NFSv4.1 mount with a session. This should always be the case when
a Sequence operation is in the reply or the server replies NFSERR_BADSESSION.
However, if a server was broken and sent an erroneous reply, these safety
belt checks should avoid trouble.
The one check required a small tweak to nfsmnt_mdssession() so that it
returns NULL when there is no session instead of the offset of the field
in the structure (0x8 for i386).
This patch should have no effect on normal operation of the client.
Found by inspection during pNFS server development.
MFC: r334492
Add the BindConnectiontoSession operation to the NFSv4.1 server.
Under some fairly unusual circumstances, the Linux NFSv4.1 client is
doing a BindConnectiontoSession operation for TCP connections.
It is also used by the ESXi6.5 NFSv4.1 client.
This patch adds this operation to the NFSv4.1 server.
MFC r308296 (by scottl):
asc/ascq 44/0 is typically a non-transient, permanent error (at least until
the components are reset). Therefore retries are pointless. This is very
visible in SATL systems, for example an LSI SAS controller and a SATA HDD/SSD.
dim [Fri, 27 Jul 2018 17:39:36 +0000 (17:39 +0000)]
MFC r327400 (by eadler):
cacos(3): correct spelling of 'I'
In some cases we had 'i' instead of 'I'.
PR: 195517
Submitted by: stephen
MFC r329259 (by eadler):
msun: signed overflow in atan2
As a component of atan2(y, x), the case of x == 1.0 is farmed out to
atan(y). The current implementation of this comparison is vulnerable
to signed integer underflow (that is, undefined behavior), and it's
performed in a somewhat more complicated way than it need be. Change
it to not be quite so cute, rather directly comparing the high/low
bits of x to the specific IEEE-754 bit pattern that encodes 1.0.
Note that while there are three different e_atan* files in the
relevant directory, only this one needs fixing. e_atan2f.c already
compares against the full bit pattern encoding 1.0f, while
e_atan2l.cuses bitwise-ands/ors/nots and so doesn't require a change.
clog.3, complex.3: Fix typos and igor style issues
PR: 228783
Reported by: Karsten <freebsd-bugzilla AT kkoenig.net>
MFC r336299 (by mmacy):
msun: add ld80/ld128 powl, cpow, cpowf, cpowl from openbsd
This corresponds to the latest status (hasn't changed in 9+
years) from openbsd of ld80/ld128 powl, and source cpowf, cpow,
cpowl (the complex power functions for float complex, double
complex, and long double complex) which are required for C99
compliance and were missing from FreeBSD. Also required for
some numerical codes using complex numbered Hamiltonians.
Thanks to jhb for tracking down the issue with making
weak_reference compile on powerpc.
When asked to review, bde said "I don't like it" - but
provided no actionable feedback or superior implementations.
Recommit r336497: Fix powl, cpow, cpowf, and cpowl imports from OpenBSD
This is a follow-up to r336299.
* lib/msun/Makefile:
. Remove polevll.c
* lib/msun/ld80/e_powl.c:
. Copy contents of polevll.c to here. This is the only consumer of
these functions. Make functions 'static inline'.
. Make reducl a 'static inline' function.
* lib/msun/man/exp.3:
. Remove BUGS section that no longer applies.
* lib/msun/src/math_private.h:
. Remove prototypes of __p1evll() and __polevll()
* lib/msun/src/s_cpow.c:
* lib/msun/src/s_cpowf.c:
* lib/msun/src/s_cpowl.c
. Include math_private.h.
. Use the CMPLX macro from either C99 or math_private.h (depends on
compiler support) instead of the problematic use of complex I.
Submitted by: Steve Kargl <sgk@troutmask.apl.washington.edu>
PR: 229876
Following r336726, explicitly invoke the 'obj' target when
setting BOOTFILES. On stable/11, without this change, the
.OBJDIR expands to /usr/src/stand instead /usr/obj/<foo>.
This is a piece of duct tape for now until I figure out why
the correct directory is not being located.
r336599:
release: Add arm_install_boot to install the commit boot bits
This reduce the per-board arm_install_uboot to just install u-boot.
While here remove the installation of rpi.dtb and rpi2.dtb as we load
them from the UFS partition via ubldr.
The bhyve(8) exit status indicates how the VM was terminated:
0 rebooted
1 powered off
2 halted
3 triple fault
The problem is when we have wrappers around bhyve that parses the exit
error code and gets an exit(1) for an error but interprets it as "powered off".
So to mitigate this issue and makes it less error prone for third part
applications, I have added a new exit code 4 that is "exited due to an error".
For now the bhyve(8) exit status are:
0 rebooted
1 powered off
2 halted
3 triple fault
4 exited due to an error
Reviewed by: @jhb
Sponsored by: iXsystems Inc.
Differential Revision: https://reviews.freebsd.org/D16161
dab [Mon, 23 Jul 2018 18:35:58 +0000 (18:35 +0000)]
MFC r336457:
Make the definition of struct kevent in event.h match what the man page for kevent(2) says.
This is a trivial comment-only fix. The man page for kevent(2) gives
the definition of struct kevent, including a comment on each
field. The actual definition in sys/event.h omitted the comments on
some fields. Add the comments in. Not only does this make the man page
and include file agree, but the comments are useful in and of
themselves.
When shutting down a vnet jail pf_shutdown() clears the remaining states, which
through pf_clear_states() calls pf_unlink_state().
For synproxy states pf_unlink_state() will send a TCP RST, which eventually
tries to schedule the pf swi in pf_send(). This means we can't remove the
software interrupt until after pf_shutdown().
Synproxy was accidentally broken by r335569. The 'return (action)' must be
executed for every non-PF_PASS result, but the error packet (TCP RST or ICMP
error) should only be sent if the packet was dropped (i.e. PF_DROP) and the
return flag is set.
PR: 229477
Submitted by: Andre Albsmeier <mail AT fbsd.e4m.org>