CyberLeo.Net >> Repos - FreeBSD/FreeBSD.git/log

arm64 support for pmu-events

8cc3815f:
hwpmc_arm64: accept raw event codes for PMC_OP_PMCALLOCATE

Make it possible to specify event codes without an offset of
PMC_EV_ARMV8_FIRST, by setting a machine-dependent flag. This is
required to make use of event definitions from pmu-events.

Reviewed by: ray (slightly earlier version)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30602

28dd6730:
libpmc: enable pmu_utils on arm64

This allows supported libpmc to query/select from the pmu-events table,
which may have a more complete set of events than what we define
manually. A future update to these definitions should greatly improve
this support. The alias table is empty for now, until this future import
is complete.

Add the Foundation's copyright for recent work on this file.

Reviewed by: ray (slightly earlier version)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30603

27ea55fc:
libpmc/hwpmc: fix issues with arm64 pmu-events support

Due to a mis-merge, the changes committed to libpmc never called
pmu_parse_event(), or set pm->pm_ev. However, this field shouldn't be
used to carry the actual pmc event code anyway, as it is expected to
contain the index into the pmu event array (otherwise, it breaks event
name lookup in pmclog_get_event()). Add a new MD field,
pm_md.pm_md_config, to pass the raw event code to arm64_allocate_pmc().

Additionally, the change made to pmc_md_op_pmcallocate was incorrect, as
this is a union, not a struct. Restore the proper padding size.

Reviewed by: luporl, ray, andrew
Fixes: 28dd6730a5d6 ("libpmc: enable pmu_utils on arm64")
Fixes: 8cc3815f02be ("hwpmc_arm64: accept raw event codes...")
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31221

(cherry picked from commit 8cc3815f02be9fa2a96e47713ad989e6d787e12a)
(cherry picked from commit 28dd6730a5d6bc73aca4c015c0ff875a9def25ac)
(cherry picked from commit 27ea55fc655b0081f760a34ff5dd5526ba02a0fb)

hwpmc_arm64: fill kern.hwpmc.cpuid

This will be used to detect supported pmu events. The expected format is
the MIDR register with the revision and variant fields masked. See also:
lib/libpmc/pmu-events/arch/arm64/mapfile.csv.

MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30601

(cherry picked from commit 5867cccdc49df3e7eb3147d6516b488dd8384afe)

hwpmc_arm64.c: fix return style

In accordance to style(9).

MFC after: 3 days
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 2129c8f6771a9a33253a1fe2d4e9d3494bc77f10)

libpmc: make libpmc_pmu_utils.c more amenable to porting

The current version has every function stubbed out for !x86. Only two
functions (pmu_alias_get() and pmc_pmu_pmcallocate() are really platform
dependent, so reduce the width of the ifdefs and remove some of the
stubs.

Reviewed by: ray
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30532

(cherry picked from commit 0024f1aa7720d5f4f587a6c5911fc5238195ae49)

libpmc: limit pmu-events to 64-bit powerpc

Although currently unused, there are only pmu event definitions for
POWER8 and POWER9. There is no sense in building these on 32-bit
platforms.

Sponsored by: The FreeBSD Foundation

(cherry picked from commit 507d68984a010dab0c3ecc5477c36526c3a7fa48)

libpmc: use $MACHINE_CPUARCH

This is preferred over $MACHINE_ARCH for these types of checks, although
it makes no difference for amd64 or i386. No functional change intended.

Sponsored by: The FreeBSD Foundation

(cherry picked from commit 3864da302af34853ddb2c33a42de5668a0d68cdd)

pmccontrol: improve -L with pmu-events

Check if the pmu utils are supported rather than carrying a
machine-dependent #ifdef.

Reviewed by: gnn, ray, emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30526

(cherry picked from commit 167cdaa7e30093215a753d4f788d921b1f7c1474)

libpmc: eliminate pmc_pmu_stat_mode()

There is a single consumer, the pmc utility, that clearly has knowledge
of which counters it is expecting. Remove this function and have it
use common counter aliases instead.

Reviewed by: gnn
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30528

(cherry picked from commit ec66cc955b629e614cf493bf168048de033f6a2c)

libpmc: remove pe->alias

It has never been a part of upstream's struct pmu_event. The jevents
utility will not fill this field, so remove it.

Reviewed by: gnn
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30530

(cherry picked from commit 0c915023dbb7000cd30bb768eb84f6dc757adcc5)

libpmc: always generate libpmc_events.c

The jevents build tool will create an empty table if it doesn't find any
events, so we can remove the extra $MACHINE_CPUARCH checks.

Reviewed by: gnn, ray, emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30531

(cherry picked from commit 689c7e7975cdee38ca6fd60ad3372d55c43c948c)

libpmc: remove unused 'isfixed' variable

Reviewed by: gnn, emaste
MFC after: 5 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30529

(cherry picked from commit 0092642f863946ee1edc88fa634887d7c8a54656)

libpmc: fix "instructions" alias on Intel

The typo prevents the counter from being allocated.

This fixes e.g. pmcstat -s instructions sleep 5

Reviewed by: mizhka, gnn, ray, emaste
MFC after: 5 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30527

(cherry picked from commit bc1a6a9d692a1f827514144b6bce0654a8be4f4d)

hwpmc: fix PMC_CPU_LAST

It is unused, but incorrect.

MFC after: 3 days
Sponsored by: The FreeBSD Foundation

(cherry picked from commit f59127dac5ca0be3648ecc0a031a21e472afb133)

libpmc: fall-back to kernel tables if pmu-events fails

On x86, the pmu_events table is the source of truth for finding
supported events. However, events not found there may still be present
in the kernel's static event tables. For example, the pmc.soft(3) events
will never be available from pmu-events.

Update pmc_allocate() to search the legacy event tables if
pmc_pmu_pmcallocate() fails to return a result. This allows both event
sources to be consulted before giving up, thus restoring pmc.soft(3) and
pmc.tsc(3) on x86.

Reviewed by: emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30216

(cherry picked from commit dfdc57e8aa8ba4b4e4484f736e8c7645ab69b54a)

libpmc: remove PMC_MDEP_TABLE logic

This logic was added for handling some of the complicated relationships
between events and x86 CPU models. Since that logic has been mostly
removed from libpmc(3) in favor of pmu-events, this no longer serves
much of a purpose. Mapping CPU types to event tables is already handled
by the switch statement in pmc_init().

Reviewed by: ray, emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30196

(cherry picked from commit da13ef6aa0565c8d79326bba5606671062033bbf)

libpmc: remove unused PMC_MDEP_INIT_INTEL_V2

All uses of this macro were removed in e92a1350b50e. Remove
cpu_has_iaf_counters as well.

Reviewed by: ray, emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30195

(cherry picked from commit 4d8d74a4f52efd078bd6298e0adbdd476ed70de9)

arm64: Fix finding the pmc event ID

The lower pmc event bits were masked off to find the PMC event ID.
The doesn't work when there are more events. Switch it to use the
offser relative to the first event while also checking the ID is
in the expected range.

Reviewed by: gnn, ray
Sponsored by: Innovate UK
Differential Revision: https://reviews.freebsd.org/D29600

(cherry picked from commit 24b2f4ea49229618c5608846acfc10be2eb0d567)

eli: Zero pad bytes that arise when certain auth algorithms are used

When authentication is configured, GELI ensures that the amount of data
per sector is a multiple of 16 bytes.  This is done in
eli_metadata_softc().  When the digest size is not a multiple of 16
bytes, this leaves some extra pad bytes at the end of every sector, and
they were not being zeroed before being written to disk.  In particular,
this happens with the HMAC/SHA1, HMAC/RIPEMD160 and HMAC/SHA384 data
authentication algorithms.

This change ensures that they are zeroed before being written to disk.

Reported by: KMSAN
Reviewed by: delphij, asomers
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 0fcafe8516d170852aa73f029a6a28bed1e29292)

Assert that valid PTEs are not overwritten when installing a new PTP

amd64 and 32-bit ARM already had assertions to this effect. Add them to
other pmaps.

Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation

(cherry picked from commit b092c58c006fd5c5c051b30ab097f5c1655e0d53)

pf: Constify tag name and queue name helper functions

No functional change intended.

Reviewed by: kp
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 81f95106b8c14d7ce935864b4705d54a8e437ed6)

mmc: Drain the intrhook in mmc_detach()

Buggy SD card drivers may attach and detach a mmc(4) driver instance in
quick succession. In this case mmc(4) must disestablish its intrhook
callback during detach. Thus, this change adds a call to
config_intrhook_drain(), which blocks or does nothing if the intrhook is
running or has already ran (the SD card was plugged in), and
disestablishes the hook if it hasn't ran yet (the SD card was not
plugged in).

PR: 254373
Reviewed by: imp, manu, markj
Sponsored by: The FreeBSD Foundation

(cherry picked from commit d5341d72a11be200e536ac7d8967449a3f521792)

man9: Update guarantees for userspace fetch/store operations

Platforms may either silently handle unaligned accesses or return an
error. Atomicity is not guaranteed in this case, however.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation

(cherry picked from commit fd5827b1785a9363abe601cbd9d8558b0fc8d3e8)

man9: Remove stray .In macros

Fixes: 9c11d8d483c4

(cherry picked from commit 18c696c00159d1071ed17e3bed1863e412dd5cb5)

nfsclient: Avoid copying uninitialized bytes into statfs

hst will be nul-terminated but the remaining space in the buffer is left
uninitialized. Avoid copying the entire buffer to ensure that
uninitialized bytes are not leaked via statfs(2).

Reported by: KMSAN
Reviewed by: rmacklem
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 44de1834b53f4654cc2f6d76406f5705f8167927)

arm64: Print CPU features slightly earlier

In particular, print them before we release APs. Otherwise they tend to
get mixed with other kernel messages.

Reviewed by: andrew, manu
Sponsored by: The FreeBSD Foundation

(cherry picked from commit fa46a46a82498532f547be5f6b5a94d05f53b0be)

bsdinstall: Only show menu if there are more items to be installed

Obtained from: Rubicon Communications, LLC ("Netgate")
Sponsored by: Rubicon Communications, LLC ("Netgate")

(cherry picked from commit 95f0da5be1e3456c930f5f9538cbc099c65f2014)

Fix the 'linux' rc script on aarch64.

Previously it would try to load linux.ko instead of linux64.ko
and fail. While here, don't try to match 'linuxaout'; even if
implemented, it's the same module as `linuxelf`.

Reviewed By: emaste
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D29288

(cherry picked from commit e026f4243c5a65d19a63d98f55be17e8294a1e87)

freebsd-tips: Use a fetchable URL as example

MFC after: 3 days

(cherry picked from commit ffe6afc4f0121f1909f2fa88694228f771dd3fcb)

freebsd-tips: Fix the description of fetch(1) to match the command

Reported by: jrtc27
MFC with: ffe6afc4f0121f1909f2fa88694228f771dd3fcb

(cherry picked from commit 167897510919a76740eca0d79713abbd088660fe)

man7: Update FreeBSD.org URLs

MFC after: 3 days

(cherry picked from commit 8dfb00245701a4f9290cd3a24e9bdcafa55a075b)

share/misc: Update FreeBSD.org URLs

MFC after: 3 days

(cherry picked from commit 89c0640c7090d5dfbe46adbe50186399923f360f)

freebsd-update: Update URL of supported platforms information

MFC after: 3 days

(cherry picked from commit 86d0d3aadb48a9a917167078ed197a061179fa4f)

[skip ci] correct a few SPDX license tags

These were all incorrectly labeled as 2-clause BSD licenses by a
semi-automated process, when in fact they are 3-clause.

Discussed with: pfg, imp
Sponsored by: Axcient

(cherry picked from commit 3874c0abb0afaea6adc24ac96dc9dc5043f2b69e)

fusefs: correctly set lock owner during FUSE_SETLK

During FUSE_SETLK, the owner field should uniquely identify the calling
process. The fusefs module now sets it to the process's pid.
Previously, it expected the calling process to set it directly, which
was wrong.

libfuse also apparently expects the owner field to be set during
FUSE_GETLK, though I'm not sure why.

PR: 256005
Reported by: Agata <chogata@moosefs.pro>
Reviewed by: pfg
Differential Revision: https://reviews.freebsd.org/D30622

(cherry picked from commit 18b19f8c6e04935a63a951afe0e540674bc94455)

pf: clean up syncookie callout on vnet shutdown

Ensure that we cancel any outstanding callouts for syncookies when we
terminate the vnet.

MFC after: 1 week
Sponsored by: Modirum MDPay

(cherry picked from commit 32271c4d383effeac7878201ef5cbdfaeedc3755)

pf: remove stray debug line

MFC after: 1 week
Sponsored by: Modirum MDPay

(cherry picked from commit 84db87b8dafd9e9970fd36ac48c11ffdc89d31d0)

pf: fix LINT build

We failed to list the new pf_syncookies.c file in sys/conf/files. This
worked for the usual configurations, where pf is a module, but not for
LINT builds.

Reported by: lwhsu
MFC after: 1 week
Sponsored by: Modirum MDPay

(cherry picked from commit b972a7fa9e1e01367435a5699b71cc7b5e494fee)

pf tests: ensure syncookie does not create state

Test that with syncookies enabled pf does not create state for
connections before the remote peer has replied to the SYN|ACK message.

MFC after: 1 week
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D31142

(cherry picked from commit 27ab791a55191c0b6503391d411303b042b41047)

pf tests: Forwarding syncookie test

Test syncookies on a forwarding host. That is, in a setup where the
machine (or vnet) running pf is not the same as the machine (or vnet)
running the server it's protecting.

MFC after: 1 week
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D31141

(cherry picked from commit 3be9301a7e4fbd630cbde1bd3e1b59ac726e21af)

pfctl: syncookie configuration

pfctl and libpfctl code required to enable/disable the syncookie
feature.

MFC after: 1 week
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D31140

(cherry picked from commit c69121c473d75abab55f9ade8e8138ac09c0942c)

pf: syncookie ioctl interface

Kernel side implementation to allow switching between on and off modes,
and allow this configuration to be retrieved.

MFC after: 1 week
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D31139

(cherry picked from commit 231e83d3422ff58fe94de8375a9532a1726056ed)

pf: syncookie support

Import OpenBSD's syncookie support for pf. This feature help pf resist
TCP SYN floods by only creating states once the remote host completes
the TCP handshake rather than when the initial SYN packet is received.

This is accomplished by using the initial sequence numbers to encode a
cookie (hence the name) in the SYN+ACK response and verifying this on
receipt of the client ACK.

Reviewed by: kbowling
Obtained from: OpenBSD
MFC after: 1 week
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D31138

(cherry picked from commit 8e1864ed07121b479b95d7e3a5931a9e0ffd4713)

pf: factor out pf_synproxy()

MFC after: 1 week
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D31137

(cherry picked from commit ee9c3d38039eb29966e1f0b8f617bc564c078289)

Revert "loader: support.4th resets the read buffer incorrectly"

This reverts commit 9c1c02093b90ae49745a174eb26ea85dd1990eec. It seems
to have broken all old nextboot.conf files causing hangs on boot.

PR: 239315

(cherry picked from commit 4783fb730fa1cfdbe5c905bb23ac74f681e2df6b)

gmirror: Zero the metadata block before writing

The mirror metadata fields contain string buffers and pad bytes, neither
were being zeroed before metadata was written to disk. Also, the
metadata structure is smaller than the sector size, and in one case
gmirror was failing to zero-fill the full buffer before writing.

Fix these problems by pre-zeroing the metadata structure and the sector
buffer.

Reported by: KMSAN
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 7f053a44aef75eab395ce15a1c8a1399a2f89cad)

blist: Correct the node count computed in blist_create()

Commit bb4a27f927a1 added the ability to allocate a span of blocks
crossing a meta node boundary.  To ensure that blst_next_leaf_alloc()
does not walk past the end of the tree, an extra all-zero meta node
needs to be present at the end of the allocation, and
blst_next_leaf_alloc() is implemented such that the presence of this
node terminates the search.

blist_create() computes the number of nodes required.  It had two
problems:
1. When the size of the blist is a power of BLIST_RADIX, we would
   unnecessarily allocate an extra level in the tree.
2. When the size of the blist is a multiple of BLIST_RADIX, we would
   fail to allocate a terminator node.  In this case,
   blst_next_leaf_alloc() could scan beyond the bounds of the
   allocation.  This was found using KASAN.

Modify blist_create() to handle these cases correctly.

Reported by: pho
Reviewed by: dougm

(cherry picked from commit 2783335caeae964bd8a1aa15726b523876613e45)

graid3: Zero the metadata block before writing

Ensure that string buffers and pad bytes are zero-filled before writing
graid3 metadata.

Reported by: KMSAN
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 39552dff7bb5463a74e5195d65a3252c583d9414)

fifo: Explicitly initialize generation numbers when opening

The fi_rgen and fi_wgen fields are generation numbers used when sleeping
waiting for the other end of the fifo to be opened. The fields were not
explicitly initialized after allocation, but this was harmless. To
avoid false positives from KMSAN, though, ensure that they get
initialized to zero.

Reported by: KMSAN
Sponsored by: The FreeBSD Foundation

(cherry picked from commit b9ca419a21d109948bf0fcea5c59725f1fe0cd7b)

uart: Fix an out-of-bounds read in ns8250_bus_probe()

The problem is that ns8250_bus_probe() accesses a field from the
ns8250_softc, which embeds the generic UART softc, but the ns8250_softc
hasn't yet been allocated because we're still probing.

This is a regression from commit 0aefb0a63c50. This fixed a problem
where one of the upper four IER bits, which are usually reserved, needs
to be set in order to get RX interrupts before the RX FIFO is full. At
the same time, we avoid clearing those reserved bits (see commit
58957d87173, though other UART drivers I looked at do not bother with
this).

So, copy what ns8250_init() does to disable interrupts, since we don't
know what the "right" mask is at this point.

Reported by: syzbot+f256beefd0df9eb796e7@syzkaller.appspotmail.com
Reviewed by: imp
Sponsored by: The FreeBSD Foundation

(cherry picked from commit 4a9a41650c909706bc0b9a3f29817c11b262b0a0)

cxgbe(4): Remove some dead code.

(cherry picked from commit 3965469eaa33aca03837baf5f88a55fa89f3f987)

Fix mismerge in OFED update

When OFED was upgraded to Linux v4.9, a bunch of Linux-specific
netlink changes were dropped. Unfortunately, there was a mismerge
in this process and as a result ib_sa_cancel_query() would fail to
cancel an outstanding MAD.

This was causing rdma_destroy_id() to hang indefinitely waiting
for the MAD to complete and release the final reference.

Sponsored by: Dell Inc.
Differential Revision: https://reviews.freebsd.org/D28421
Reviewed by: hselasky, kib

(cherry picked from commit 8a06ca2f734c726799688d65a7dad67284275438)

ipoib: Fix for accessing uninitialized pointers and freed memory during attach and detach.

Call infiniband_ifdetach() early to stop ifioctl(9) calls from user-space
during device removal. Also make sure that ifioctl(9) calls are blocked from
executing until the device is fully initialized. Ideally we would delay the
infiniband_ifattach() call, but because part of the initialization is to update
the link level address, that is not possible without more significant changes.

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit cd2c05d323d272163d04dd94caabe018ca2d4dc5)

mlx5: Numa domain improvements.

Properly allocate all mlx5en(4) structures from correct numa domain.

While at it cleanup unused numa domain integers deriving from the
Linux version of mlx5en(4).

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 7c3eff94bda8bb746bfa7a5edc81b014e2dc97f6)

mlx5: Fix for uninitialized "uid" field.

Make sure the "uid" field gets properly set when destroying DCT and QP
objects by making a copy of the field when creating such objects.

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit cbf6911e10d7ed4e772affdd03cb4d439669acbd)

mlx4: Map core_clock page to user space only when allowed

Currently when we map the hca_core_clock page to the user space,
there are vulnerable registers, one of which is semaphore, on
this page as well. If user read the wrong offset, it can modify the
above semaphore and hang the device.

Hence, mapping the hca_core_clock page to the user space only when
user required it specifically.

After this patch, mlx4 core_clock won't be mapped to user space by
default. Oppose to current state, where mlx4 core_clock is always mapped
to user space.

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit c8301cbb0fa25d03c1b6b2d056497d5a1580a8b4)

mlx5en: Allow binding channels to CPUs when RSS is not enabled.

Submitted by: Netflix
Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit c8d16d1e084dc14191491e95ce226d3ce8b39275)

ibcore: Add some functions and definitions for selecting and querying retryable ucontext cleanup.

Linux commit:
1c77483e4c50339b0306572167ccbff6b55d051b

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit f60da09dbb152d7c8ee1719197d98149a8b0c017)

mlx5en: Allocate per-channel doorbells.

To avoid congestion on the same PCI memory register space when
traffic consists mostly of small packets.

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 9dfa21486e1db730305abd63df449bcc1e76127b)

mlx5en: Wait for all TLS connections to terminate when unloading driver.

The driver expects all TLS tags to be returned to the driver before
it can free the UMA zone where the TLS tags reside.

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 3a934ba7a30831dda104e9faad9412f9743c9bae)

mlx4ib and mlx5ib: Set slid to zero in Ethernet completion struct

IB spec says that a lid should be ignored when link layer is Ethernet,
for example when building or parsing a CM request message (CA17-34).
However, since ib_lid_be16() and ib_lid_cpu16() validates the slid,
not only when link layer is IB, we set the slid to zero to prevent
false warnings in the kernel log.

Linux commit:
65389322b28f81cc137b60a41044c2d958a7b950

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 30416d4e827341be32c3e415f16c73f252a68d85)

mlx5en: Configure relaxed PCI read and write ordering for ethernet.

This may improve performance in some configurations.

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit de2437f19950f6758159abbde93200468d1327fa)

mlx5en: Check for pci_channel_offline() when draining sendqueue.

This speeds up detach in hypervisor environments.

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 4692d9808e61958675d91ec595b5732c8d1fa700)

mlx5ib: Implement support for enabling and disabling RoCE ECN.

RoCE is short for Remote direct memory access over Converged Ethernet.
ECN is short for Explicit Congestion Notification.

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 8abf5ac0e6ddaeddf49cf39193bbe0c3ebf7209b)

mlx5ib: Extend parameter macros so that more arguments may be added.

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 42f719d611413f3bf9c7914e008fe22c916e1ac5)

mlx5core: Don't query the PCI config space for offline during a firmware command.

Querying the PCI config space for offline for every firmware command blocks
the PCI bus and affects performance. Especially for packet pacing and TLS
when objects are frequently created and destroyed.

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit e787b5acb1bdf97fc04ec3ebebb7a8dd40d85199)

Fix LINT kernel build issues after c3987b8ea793c11f61fecb14ef93195a23e3522c .

Fixes the IPOIB_CM and SDP kernel options.

Reported by: lwhsu @
Sponsored by: NVIDIA Networking

(cherry picked from commit 693ddf4dc4b9c1ffbe6373e186e1b121e3635f53)

ibcore: Declare ib_post_send() and ib_post_recv() arguments const

Since neither ib_post_send() nor ib_post_recv() modify the data structure
their second argument points at, declare that argument const. This change
makes it necessary to declare the 'bad_wr' argument const too and also to
modify all ULPs that call ib_post_send(), ib_post_recv() or
ib_post_srq_recv(). This patch does not change any functionality but makes
it possible for the compiler to verify whether the
ib_post_(send|recv|srq_recv) really do not modify the posted work request.

Linux commit:
f696bf6d64b195b83ca1bdb7cd33c999c9dcf514
7bb1fafc2f163ad03a2007295bb2f57cfdbfb630
d34ac5cd3a73aacd11009c4fc3ba15d7ea62c411

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit c3987b8ea793c11f61fecb14ef93195a23e3522c)

mlx5: Set default timestamp format for mlx5en(4) and mlx5ib.

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 4fb0a74e081772fc6fc2a43222ee072fd089bf5f)

mlx5: Add new timestamp mode bits.

These fields declare which timestamp mode is supported
by the device per RQ/SQ/QP.

In addition add the ts_format field to the select the mode
for RQ/SQ/QP.

Linux commit:
a6a217dddcd544f6b75f0e2a60b6e84c1d494b7e

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 915fc66cb59faa543b852083233729c270d5aa3b)

ibcore: Implement ib_uverbs_get_ucontext_file().

Expose ib_ucontext from a given ib_uverbs_file. Drivers that use the ioctl(9)
API may have the ib_uverbs_file and need a way to get the related ib_ucontext
from it, this is enabled by this patch.

Downstream patches from this series will use it.

Linux commit:
7dc08dcfc8c86cb4457e383734ff6844ddaff876

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 79b817084ca891e465fe1a868ef1d9f1a3f33a69)

ibcore: Clean up INIT_UDATA() and INIT_UDATA_BUF_OR_NULL() macro usage.

We get a harmless warning about the fact that we use the result of a
multiplication as a condition in INIT_UDATA_BUF_OR_NULL():

uverbs_main.c: In function 'ib_uverbs_write':
error: '*' in boolean context, suggest '&&' instead [-Werror=int-in-bool-context]

This avoids the problem by using an inline function in place of
the macro.

After changing INIT_UDATA_BUF_OR_NULL() to an inline function,
do the same change to INIT_UDATA() for consistency.

Using an inline function gives us better type safety here among other
issues with macros. I'm using u64_to_user_ptr() to convert the user
pointer to simplify the logic rather than adding lots of new type casts.

Linux commit:
12f727721eee61b3d19dedb95cb893b2baa9fe41
40a203396cc1c239f2e71c47c66ed03097123d2c

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 05f4691919d6d0219795a1ca8ad84dd82d87b1cf)

ibcore: Simplify ib_modify_qp_is_ok().

All callers to ib_modify_qp_is_ok() provides enum ib_qp_state makes the
checks of out-of-scope redundant. Let's remove them together with updating
function signature to return boolean result.

While at it remove unused "ll" parameter from ib_modify_qp_is_ok().

Linux commit:
19b1f54099b6ee334acbfbcfbdffd1d1f057216d
d31131bba5a1630304c55ea775c48cc84912ab59

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit d92a9e5604d7302c349f77c0fde160256aec56ed)

ibcore: Support rate limit for packet pacing

Add new member rate_limit to ib_qp_attr which holds the packet pacing
rate in kbps, 0 means unlimited.

IB_QP_RATE_LIMIT is added to ib_attr_mask and could be used by RAW
QPs when changing QP state from RTR to RTS, RTS to RTS.

Linux commit:
528e5a1bd3f0e9b760cb3a1062fce7513712a15d

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 0c13880cccd75655c878ce31e767bce04b1d6e85)

ibcore: Add new IB rates.

Add the new rates that were added to Infiniband spec as part of
HDR and 2x support.

Linux commit:
a5a5d1993696419e7d5357fc3128e53d219d382e

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 3d2fb36a9ce72053e8865852caad30044dbd1059)

ibcore: Don't allocate method table, if already present.

This commit aligns the code in question with upstream Linux.

Linux commit:
2468b82d69e3a53d024f28d79ba0fdb8bf43dfbf

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 721b795b721b349db5e6198f8681d5992447c775)

ibcore: Fix a use-after-free in ucma_resolve_ip().

There is a race condition between ucma_close() and ucma_resolve_ip():

CPU0                            CPU1
ucma_resolve_ip():              ucma_close():

ctx = ucma_get_ctx(file, cmd.id);

        list_for_each_entry_safe(ctx, tmp, &file->ctx_list, list) {
                mutex_lock(&mut);
                idr_remove(&ctx_idr, ctx->id);
                mutex_unlock(&mut);
                ...
                mutex_lock(&mut);
                if (!ctx->closing) {
                        mutex_unlock(&mut);
                        rdma_destroy_id(ctx->cm_id);
                ...
                ucma_free_ctx(ctx);
        }

ret = rdma_resolve_addr();
ucma_put_ctx(ctx);

Before idr_remove(), ucma_get_ctx() could still find the ctx
and after rdma_destroy_id(), rdma_resolve_addr() may still
access id_priv pointer. Also, ucma_put_ctx() may use ctx after
ucma_free_ctx() too.

ucma_close() should call ucma_put_ctx() too which tests the
refcnt and waits for the last one releasing it. The similar
pattern is already used by ucma_destroy_id().

Linux commit:
5fe23f262e0548ca7f19fb79f89059a60d087d22

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit c6ccb08686f3b92c12778b4b903431b2ce71ec2c)

ibcore: Define option to set ack timeout.

Define new option in 'rdma_set_option' to override calculated QP timeout
when requested to provide QP attributes to modify a QP.

At the same time, pack tos_set to be bitfield.

Linux commit:
2c1619edef61a03cb516efaa81750784c3071d10

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 20fea7ac64683b064ffe4cefa750e46ba20de4f9)

ibcore: Do not overreact to SM LID change event.

When IPoIB receives an SM LID change event, it reacts by flushing its
path record cache and rejoining multicast groups. This is the same
behavior it performs when it receives a reregistration event. This
behavior is unnecessary as an SM may have database backup or
synchronization mechanisms which permit the SM location or LID to change
without loss of multicast membership and without impact to path records.

Both opensm and the OPA FM issue reregistration events if a new SM is
started (or restarted with a new config) or an SM event occurs which
results in loss of multicast membership records by the SM (such as
opensm failover) or the SM encounters new nodes with Active ports (such
as after joining 2 fabrics by connecting switches via ISLs). Hence this
event can be depended on as the trigger for IPoIB cache and multicast
flushing.

It appears that some drivers, such as qib, and hfi1 issue the
IB_EVENT_SM_CHANGE but other drivers such as mlx4 and mlx5 do not.
Empirical testing on Mellanox EDR using ibv_asyncwatch has confirmed
that Mellanox EDR HCAs do not generate SM change events and that opensm
does generate reregistration.

An SM LID change event is generated by the mentioned drivers to reflect
that sm_lid and/or sm_sl in the local port info has changed. The intent
of this event is to permit applications and ULPs which have a local copy
of this information (or an address handle using it) to update their
information.

The intent is that the reregistration event (caused by the SM via a bit
in Set(PortInfo)) be used to inform nodes that they need to rejoin
multicast groups, resubscribe for notices and potentially update path
records.

When an SM migrates or fails over, a SM LID change event can occur. In
response IPoIB discards path records and multicast membership and loses
connectivity until these records are restored via SA requests. In very
large fabrics, it may take minutes for the SM to be ready and for the SA
responses to be supplied.  This can result in undesirable and
unnecessary IPoIB connectivity impacts. It also can result in an
unnecessary storm of SA queries from all nodes in a cluster potentially
followed by yet another storm if the SM issues the reregistration
request.

The fact the Mellanox HCAs do not even generate this event, is further
evidence that on modern IB fabrics there will be no ill side effects
from the proposed changes below to reduce the reaction by 3 kernel
components to this event. So these changes should be benign for Mellanox
IB fabrics and will benefit OPA fabrics while also making ib_core and
ULP behavor "correct" as intended by the IBTA spec and kernel RDMA event
APIs.

Address these issues by removing IB_EVENT_SM_CHANGE handling from ipoib.
IPoIB does not locally store sm_lid nor sm_sl, so it does not need to do
anything on SM LID change. IPoIB makes use of other ib_core components
to issue SA requests for it and those components correctly track SM LID
and SM LID changes.

Also in ib_core multicast handling,  remove the test for
IB_EVENT_SM_CHANGE. This code is moving all multicast groups to the
error state, which will trigger rejoins. This code is used by IPoIB as
well as the connection manager and other clients of multicast groups.
This kernel module centralizes group membership status and joins since a
node can only join a given group once but multiple ULPs or applications
may want to join the same group. It makes use of the sa_query.c
component in ib_core, which correctly trackes SM LID and SL. This
component does not track SM LID nor SL itself and hence need not react
to their changes.

Similarly in the ib_core cache code remove the handling for the
IB_EVENT_SM_CHANGE.  In this function. The ib_cache_update function
which is ultimately called is updating local copies of the pkey table,
gid table and lmc. It does not update nor retain sm_lid nor sm_sl. As
such it does not need to be called on an SM LID change. It technically
also does not need to be called on a reregistration. The LID_CHANGE,
PKEY_CHANGE, GID_CHANGE and port state change events (PORT_ERR,
PORT_ACTICE) should be sufficient triggers.

It is worth noting that the alternative of simply having the hfi1 and
qib drivers not generate the SM LID change event was explored. While
this would duplicate what Mellanox drivers do now, it is not the correct
behavior and removes the ability for an SM to migrate without requiring
reregistration. Since both opensm and OPA SM have mechanisms to backup
or synchronize registration information, it is desirable to let them
perform SM migrations (with LID or SL changes) without requiring
reregistration when they deem it appropriate.

Linux commit:
ba7d8117f3cca8eb70d579fde3f9ec8cd6a28f39

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit df1df0c742ec22de6ee7656a91ea8773c05f3b81)

ibcore: Remove debug prints after allocation failure.

The prints after [k|v][m|z|c]alloc() functions are not needed,
because in case of failure, allocator will print their internal
error prints anyway.

Linux commit:
2716243212241855cd9070883779f6e58967dec5

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 26646ba5bcdadbd513dc4dfccc772952732aea5c)

ibcore: Fix use-after-free in IB mad completion handling.

We encountered a use-after-free bug when unloading the driver:

BUG: KASAN: use-after-free in ib_mad_post_receive_mads+0xddc/0xed0 [ib_core]
Read of size 4 at addr ffff8882ca5aa868 by task kworker/u13:2/23862

Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
Call Trace:
dump_stack+0x9a/0xeb
print_address_description+0xe3/0x2e0
ib_mad_post_receive_mads+0xddc/0xed0 [ib_core]
__kasan_report+0x15c/0x1df
ib_mad_post_receive_mads+0xddc/0xed0 [ib_core]
kasan_report+0xe/0x20
ib_mad_post_receive_mads+0xddc/0xed0 [ib_core]
find_mad_agent+0xa00/0xa00 [ib_core]
qlist_free_all+0x51/0xb0
mlx4_ib_sqp_comp_worker+0x1970/0x1970 [mlx4_ib]
quarantine_reduce+0x1fa/0x270
kasan_unpoison_shadow+0x30/0x40
ib_mad_recv_done+0xdf6/0x3000 [ib_core]
_raw_spin_unlock_irqrestore+0x46/0x70
ib_mad_send_done+0x1810/0x1810 [ib_core]
mlx4_ib_destroy_cq+0x2a0/0x2a0 [mlx4_ib]
_raw_spin_unlock_irqrestore+0x46/0x70
debug_object_deactivate+0x2b9/0x4a0
__ib_process_cq+0xe2/0x1d0 [ib_core]
ib_cq_poll_work+0x45/0xf0 [ib_core]
process_one_work+0x90c/0x1860
pwq_dec_nr_in_flight+0x320/0x320
worker_thread+0x87/0xbb0
__kthread_parkme+0xb6/0x180
process_one_work+0x1860/0x1860
kthread+0x320/0x3e0
kthread_park+0x120/0x120
ret_from_fork+0x24/0x30
...
Freed by task 31682:
save_stack+0x19/0x80
__kasan_slab_free+0x11d/0x160
kfree+0xf5/0x2f0
ib_mad_port_close+0x200/0x380 [ib_core]
ib_mad_remove_device+0xf0/0x230 [ib_core]
remove_client_context+0xa6/0xe0 [ib_core]
disable_device+0x14e/0x260 [ib_core]
__ib_unregister_device+0x79/0x150 [ib_core]
ib_unregister_device+0x21/0x30 [ib_core]
mlx4_ib_remove+0x162/0x690 [mlx4_ib]
mlx4_remove_device+0x204/0x2c0 [mlx4_core]
mlx4_unregister_interface+0x49/0x1d0 [mlx4_core]
mlx4_ib_cleanup+0xc/0x1d [mlx4_ib]
__x64_sys_delete_module+0x2d2/0x400
do_syscall_64+0x95/0x470
entry_SYSCALL_64_after_hwframe+0x49/0xbe

The problem was that the MAD PD was deallocated before the MAD CQ.
There was completion work pending for the CQ when the PD got deallocated.
When the mad completion handling reached procedure
ib_mad_post_receive_mads(), we got a use-after-free bug in the following
line of code in that procedure:
sg_list.lkey = qp_info->port_priv->pd->local_dma_lkey;
(the pd pointer in the above line is no longer valid, because the
pd has been deallocated).

We fix this by allocating the PD before the CQ in procedure
ib_mad_port_open(), and deallocating the PD after freeing the CQ
in procedure ib_mad_port_close().

Since the CQ completion work queue is flushed during ib_free_cq(),
no completions will be pending for that CQ when the PD is later
deallocated.

Note that freeing the CQ before deallocating the PD is the practice
in the ULPs.

Linux commit:
770b7d96cfff6a8bf6c9f261ba6f135dc9edf484

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 468a6b5055f0b6ea0bdb1ee8cbdf749204cb3b25)

ibcore: Fail early if unsupported QP is provided.

When requested QP type is not supported for a {device, port}, return the
error right away before validating all parameters during mad agent
registration time.

Linux commit:
798bba01b44b0ddf8cd6e542635b37cc9a9b739c

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 507389a35a41f5f15592d2156d34039e3ee1c3e5)

ibcore: Use inline function to validate port

Linux commit:
24dc831b77eca9361cf835be59fa69ea0e471afc

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit e2ae502d28605db0ca2af6e6261513106fe2b57d)

ibcore: Validate port number in query_pkey verb.

Before calling the driver's function let's make sure port is valid.

Linux commit:
9af3f5cf9d64a056eca53bc643f6288ad28bbbb5

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 31525faed8b43c579767cf3ac211f41764b2a46c)

ibcore: Protect against concurrent access to hardware stats.

Currently access to hardware stats buffer isn't protected, this can
result in multiple writes and reads at the same time to the same
memory location. This can lead to providing an incorrect value to
the user. Add a mutex to protect against it.

Linux commit:
e945130b52bea65d15f9bdf54949d4cb7a88db7f

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 912e98cedee2590748a9893d3152b11694de3379)

mlx5core: Make sure error code is propagated on error.

If mlx5_init_once() fails, mlx5_load_one() should fail too, else the
device instance remains attached causing problems at reboot.

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit d8cbfa101cbe3a9ece41120af93884177aff728a)

ibcore: Do not expose unsupported counters.

If the provider driver (such as rdma_rxe) doesn't support PMA counters,
avoid exposing its directory similar to optional hw_counters directory.
If core fails to read the PMA counter, return an error so that user can
retry later if needed.

Linux commit:
0f6ef65d1c6ec8deb5d0f11f86631ec4cfe8f22e

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit ac4174e064f447ed3b19a8b395625b7d2b3739d1)

ibcore: Introduce ib_port_phys_state enum.

In order to improve readability, add ib_port_phys_state enum to replace
the use of magic numbers.

Linux commit:
72a7720fca37fec0daf295923f17ac5d88a613e1

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 4238b4a7a2cfe44e95f3553ff4a3f6f813fb81f5)

ibcore: Fix unable to change lifespan entry for hw_counters.

This patch fixes the case where 'lifespan' entry of the hw_counters
is not writable. Currently write callback is not exposed for for
the hw_counters sysfs operation. Due to this, modifying lifespan
value results into permission denied error in below example.

echo 10 > /sys/class/infiniband/mlx5_0/ports/1/hw_counters/lifespan
-bash: /sys/class/infiniband/mlx5_0/ports/1/hw_counters/lifespan:
Permission denied

This patch adds the hook to modify any attribute which implements
store() operation.

Linux commit:
79c4d80b43b8e43684894574a508a871f0c196bf

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit d7d833e20ba33f5b9f3052a534af7ecdd602f152)

ibcore: Issue DREQ when receiving REQ/REP for stale QP.

From "InfiBand Architecture Specifications Volume 1":

  A QP is said to have a stale connection when only one side has
  connection information. A stale connection may result if the remote CM
  had dropped the connection and sent a DREQ but the DREQ was never
  received by the local CM. Alternatively the remote CM may have lost
  all record of past connections because its node crashed and rebooted,
  while the local CM did not become aware of the remote node's reboot
  and therefore did not clean up stale connections.

And:

  A local CM may receive a REQ/REP for a stale connection. It shall
  abort the connection issuing REJ to the REQ/REP. It shall then issue
  DREQ with "DREQ:remote QPN" set to the remote QPN from the REQ/REP.

This patch solves a problem with reuse of QPN. Current codebase, that
is IPoIB, relies on a REAP-mechanism to do cleanup of the structures
in CM. A problem with this is the timeconstants governing this
mechanism; they are up to 768 seconds and the interface may look
inresponsive in that period.  Issuing a DREQ (and receiving a DREP)
does the necessary cleanup and the interface comes up.

Linux commit:
9315bc9a133011fdb084f2626b86db3ebb64661f

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 12a913d207d287733da75784b3aa35f5e48d0cef)

ibcore: Fix memory leak in cm_req_handler error flows.

In the cm_req_handler() error flows, sometimes cm_id_priv->timewait_info
isn't free'd.

Linux commit:
8b00914654ef56ff5473f4fe1f1168254dbb8a17

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 2f05418748585dcbd5912f335040f652a8cc9703)

ibcore: Move debug counters to be under relevant IB device

The sysfs layout is created by CM incorrectly presented RDMA devices with
InfiniBand link layer. Layout of such devices represents device tree of
connections. By moving CM statistics to be under relevant port of IB
device, we will fix the following issues:

* Symlink name - It used device name instead of specific identifier.
* Target location - It was supposed to point to PCI-ID/infiniband_cm/
  instead of PCI-ID/infiniband/
* Target name - It created extra device file under already existing
  device folder, e.g. mlx5_0/mlx5_0
* Crash during boot with RDMA persistent naming patches.

sysfs: cannot create duplicate filename '/class/infiniband_cm/mlx5_0'
CPU: 29 PID: 433 Comm: modprobe Not tainted 5.0.0-rc5+ #178
Call Trace:
dump_stack+0xcc/0x180
sysfs_warn_dup.cold.3+0x17/0x2d
sysfs_do_create_link_sd.isra.2+0xd0/0xf0
device_add+0x7cb/0x1450
device_create_groups_vargs+0x1ae/0x220
device_create+0x93/0xc0
cm_add_one+0x38f/0xf60 [ib_cm]
add_client_context+0x167/0x210 [ib_core]
enable_device_and_get+0x230/0x3f0 [ib_core]
ib_register_device+0x823/0xbf0 [ib_core]
__mlx5_ib_add+0x45/0x150 [mlx5_ib]
mlx5_ib_add+0x1b3/0x5e0 [mlx5_ib]
mlx5_add_device+0x130/0x3a0 [mlx5_core]
mlx5_register_interface+0x1a9/0x270 [mlx5_core]
do_one_initcall+0x14f/0x5de
do_init_module+0x247/0x7c0
load_module+0x4c2f/0x60d0
entry_SYSCALL_64_after_hwframe+0x49/0xbe

After this change:
[leonro@server ~]$ ls -al /sys/class/infiniband/ibp0s12f0/ports/1/
drwxr-xr-x  2 root root    0 Mar 11 11:17 cm_rx_duplicates
drwxr-xr-x  2 root root    0 Mar 11 11:17 cm_rx_msgs
drwxr-xr-x  2 root root    0 Mar 11 11:17 cm_tx_msgs
drwxr-xr-x  2 root root    0 Mar 11 11:17 cm_tx_retries

Linux commit:
c87e65cfb97c7f325132a68288ed76ba7bdcd2c6

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit f48e85dfe2ece188eaf44ee6cb626de1f7ae65e9)

ibcore: Fix memory leak in cm_add/remove_one.

In the process of moving the debug counters sysfs entries, the commit
mentioned below eliminated the cm_infiniband sysfs directory.

This sysfs directory was tied to the cm_port object allocated in procedure
cm_add_one().

Before the commit below, this cm_port object was freed via a call to
kobject_put(port->kobj) in procedure cm_remove_port_fs().

Since port no longer uses its kobj, kobject_put(port->kobj) was eliminated.
This, however, meant that kfree was never called for the cm_port buffers.

Fix this by adding explicit kfree(port) calls to functions cm_add_one()
and cm_remove_one().

Note that the kfree call in the first chunk below, in the cm_add_one error
flow, fixes an old, undetected memory leak.

Linux commit:
94635c36f3854934a46d9e812e028d4721bbb0e6

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 8d04583de542dcd087b401f6b830b8e6ab43d696)

ibcore: Block processing of alternate path handling in RoCE RX CM messages.

Due to the below reasons, it is better to not support alternate path receive
messages for RoCE in near term.

1. Alternate path for RoCE is not supported at rdmacm layer.
2. It is not supported in uverbs/core layer for RoCE.
3. Alternate path for IPv6 for link local address cannot resolve route
determinstically without a valid incoming interface ID whose usecase
make sense only with dual port mode.
4. init_av_from_path while processing LAP messages for IB and RoCE can
lead to adding duplicate entry of AV into the port list, leads to list
corruption.
5. rdma-core userspace a well known userspace implementation has removed
support of libucm which use ucm.ko module, which is the only module that
can trigger alternate path related messages.
6. ucm kernel module is requested to be removed from the IB core in
the following patch, https://patchwork.kernel.org/patch/10268503/ .

Linux commit:
97c45c2c28cd291e06778d9d36a0f60ee74726bc

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 0f2e5b432dcdd19955285169f1603473219ac130)

ibcore: Store and restore ah_attr during LAP msg processing.

During CM LAP processing, ah_attr is reinitialized on receiving
a LAP request. First likely during CM request processing.

ah_attr might get zeroed out if LAP processing fails.
Therefore, try to create a new ah_attr for the LAP message.
If the initialization fails, continue with older ah_attr.
If the initialization passes, consider the new ah_attr by
overwriting the older one.

Linux commit:
0e225dcb7681c0a8e52fb9dc68bd8ab973de4ca2

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit bf5075e41e6a29232a1c5aef07e69b9173494187)

ibcore: Add rdma_reject_msg() helper function.

rdma_reject_msg() returns a pointer to a string message associated with
the transport reject reason codes.

Linux commit:
77a5db13153906a7e00740b10b2730e53385c5a8

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit e25bcf8d775c61e9bb67246545d2ee6717f4ca89)

mlx5ib: Fix XRC QP support after introducing extended atomic.

Extended atomics are supported with RC and XRC QP types, but Linux commit
a60109dc9a95 added an unneeded check to to_mlx5_access_flags().
This broke XRC QPs.

The following ib_atomic_bw invocation over XRC reproduces the issue:
ib_atomic_bw -d mlx5_1 --connection=XRC --atomic_type=FETCH_AND_ADD

It is safe to remove such checks because the QP type was already checked
in ib_modify_qp_is_ok(), which was previously called from
mlx5_ib_modify_qp().

Linux commit:
13f8d9c16693afb908ead3d2a758adbe6a79eccd

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit cf88b86e4954215eb447729042dab8dea722c044)

ibcore: Remove unused and erroneous msg sequence encoding.

In cm_form_tid(), a two bit message sequence number is OR'ed into bit
31-30 of the lower TID value.

After Linux commit f06d26537559 ("IB/cm: Randomize starting comm ID"), the
local_id is XOR'ed with a 32-bit random value. Hence, bit 31-30 in the
lower TID now has an arbitrarily value and it makes no sense to OR in
the message sequence number.

Adding to that, the evolution in use of IDR routines in cm_alloc_id()
has always had the possibility of returning a value with bit 30 set.

In addition, said bits are never checked.

Hence, remove the encoding and the corresponding enum.

Linux commit:
87a37ce9e400e40daee537ff95343e3c94743c6d

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 315627b7191dd4fe30a9293609feaf7eeb62e478)

ipoib: Destroying a CQ should never fail.

Remove not needed error handling when destroying a CQ. The function in
question will later on be updated to return "void".

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit eafc89853835147bcbd019a974ebfa9d3a8b00a7)

mlx5ib: Limit mkey page size to 2GB

The maximum page size in the mkey context is 2GB.

Until today, we didn't enforce this requirement in the code, and therefore,
if we got a page size larger than 2GB, we have passed zeros in the
log_page_shift instead of the actual value and the registration failed.

This patch limits the driver to use compound pages of 2GB for mkeys.

Linux commit:
762f899ae7875554284af92b821be8c083227092

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 565cb4e8cc5efb4c493efe5cf2cb2ec36f69a413)

mlx5ib: Simplify mlx5_ib_cont_pages()

The patch simplifies mlx5_ib_cont_pages and fixes the following
issues in the original implementation:

First issues is related to alignment of the PFNs. After the check
base + p != PFN, the alignment of the PFN wasn't checked. So the PFN
sequence 0, 1, 1, 2 would result in a page_shift of 13 even though
the 3rd PFN is not 8KB aligned.

This wasn't actually a bug because it was supported by all the
existing mlx5 compatible device, but we don't want to require
this support in all future devices.

Another issue is because the inner loop didn't advance PFN so
the test "if (base + p != pfn)" always failed for SGE with
len > (1<<page_shift).

Linux commit:
d67bc5d4e3e100d762c0f57ea67f28bc219698a6

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 21bc3710a4b46655067cbad54d1f21952c871dd2)

mlx5en: Add more error checks in the transmit path.

- Upon error more completion events than requested may be generated,
particularly when using the completion event factor feature.
- Count number of event errors in the transmit path.

Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 4f4739a77b0e69dae57fd1687926d6e48a698fe4)