jhb [Mon, 31 Oct 2016 22:45:11 +0000 (22:45 +0000)]
MFC 291665,291685,291856,297467,302110,302263: Add support for VIs.
291665:
Add support for configuring additional virtual interfaces (VIs) on a port.
Each virtual interface has its own MAC address, queues, and statistics.
The dedicated netmap interfaces (ncxgbeX / ncxlX) were already implemented
as additional VIs on each port. This change allows additional non-netmap
interfaces to be configured on each port. Additional virtual interfaces
use the naming scheme vcxgbeX or vcxlX.
Additional VIs are enabled by setting the hw.cxgbe.num_vis tunable to a
value greater than 1 before loading the cxgbe(4) or cxl(4) driver.
NB: The first VI on each port is the "main" interface (cxgbeX or cxlX).
T4/T5 NICs provide a limited number of MAC addresses for each physical port.
As a result, a maximum of six VIs can be configured on each port (including
the "main" interface and the netmap interface when netmap is enabled).
One user-visible result is that when netmap is enabled, packets received
or transmitted via the netmap interface are no longer counted in the stats
for the "main" interface, but are not accounted to the netmap interface.
The netmap interfaces now also have a new-bus device and export various
information sysctl nodes via dev.n(cxgbe|cxl).X.
The cxgbetool 'clearstats' command clears the stats for all VIs on the
specified port along with the port's stats. There is currently no way to
clear the stats of an individual VI.
291685:
Fix build for !TCP_OFFLOAD case.
291856:
Fix RSS build.
297467:
Remove #ifdef's from various structures used in the cxgbe/cxl driver.
This provides a constant ABI and layout for these structures (especially
struct adapter) avoiding some foot shooting.
302110:
cxgbe(4): Merge netmap support from the ncxgbe/ncxl interfaces to the
vcxgbe/vcxl interfaces and retire the 'n' interfaces. The main
cxgbe/cxl interfaces and tunables related to them are not affected by
any of this and will continue to operate as usual.
The driver used to create an additional 'n' interface for every
cxgbe/cxl interface if "device netmap" was in the kernel. The 'n'
interface shared the wire with the main interface but was otherwise
autonomous (with its own MAC address, etc.). It did not have normal
tx/rx but had a specialized netmap-only data path. r291665 added
another set of virtual interfaces (the 'v' interfaces) to the driver.
These had normal tx/rx but no netmap support.
This revision consolidates the features of both the interfaces into the
'v' interface which now has a normal data path, TOE support, and native
netmap support. The 'v' interfaces need to be created explicitly with
the hw.cxgbe.num_vis tunable. This means "device netmap" will not
result in the automatic creation of any virtual interfaces.
The following tunables can be used to override the default number of
queues allocated for each 'v' interface. nofld* = 0 will disable TOE on
the virtual interface and nnm* = 0 to will disable native netmap
support.
# number of normal NIC queues
hw.cxgbe.ntxq_vi
hw.cxgbe.nrxq_vi
# number of TOE queues
hw.cxgbe.nofldtxq_vi
hw.cxgbe.nofldrxq_vi
# number of netmap queues
hw.cxgbe.nnmtxq_vi
hw.cxgbe.nnmrxq_vi
hw.cxgbe.nnm{t,r}xq{10,1}g tunables have been removed.
--- tl;dr version ---
The workflow for netmap on cxgbe starting with FreeBSD 11 is:
1) "device netmap" in the kernel config.
2) "hw.cxgbe.num_vis=2" in loader.conf. num_vis > 2 is ok too, you'll
end up with multiple autonomous netmap-capable interfaces for every
port.
3) "dmesg | grep vcxl | grep netmap" to verify that the interface has
netmap queues.
4) Use any of the 'v' interfaces for netmap. pkt-gen -i vcxl<n>... .
One major improvement is that the netmap interface has a normal data
path as expected.
5) Just ignore the cxl interfaces if you want to use netmap only. No
need to bring them up. The vcxl interfaces are completely independent
and everything should just work.
---------------------
302263:
cxgbe(4): Do not bring up an interface when IFCAP_TOE is enabled on it.
The interface's queues are functional after VI_INIT_DONE (which is short
of interface-up) and that's all that's needed for t4_tom to communicate
with the chip.
jhb [Mon, 31 Oct 2016 22:03:44 +0000 (22:03 +0000)]
MFC 289401: cxgbe(4): support for the kernel RSS option.
You need PCBGROUP and RSS in the kernel config to use this.
Note: Since RSS is not present in 10.x this is mostly a no-op and is
stubbed out by removing the #include of opt_rss.h. This is merged
primarily to reduce conflicts in future merges, however it does add a
couple of diagnostic messages related to RSS buckets vs RX queue
counts.
dim [Mon, 31 Oct 2016 18:37:44 +0000 (18:37 +0000)]
Pull in r228705 from upstream libc++ trunk (by Eric Fiselier):
[libcxx] Fix PR 22468 - std::function<void()> does not accept
non-void-returning functions
Summary:
The bug can be found here: https://llvm.org/bugs/show_bug.cgi?id=22468
`__invoke_void_return_wrapper` is needed to properly handle calling a
function that returns a value but where the std::function return type
is void. Without this '-Wsystem-headers' will cause
`function::operator()(...)` to not compile.
sbruno [Mon, 31 Oct 2016 16:48:16 +0000 (16:48 +0000)]
MFC r308038:
The buffer address is always overwritten in the extended descriptor format,
we have to refresh it ... always. This fixes problems reported in NetMap
with em(4) devices after conversion to extended descriptor format in
svn r293331.
mav [Mon, 31 Oct 2016 07:21:37 +0000 (07:21 +0000)]
MFC r307523: Make pass driver better support CAM_CDB_POINTER flag.
Previously pass driver just ignored the flag, making random kernel code
access user-space pointer, sometime causing crashes even for correctly
written applications if user-level context was switched or swapped out.
This patch tries to copyin the CDB into kernel space to avoid it.
ed [Sat, 29 Oct 2016 15:04:24 +0000 (15:04 +0000)]
Add posix_tnode to <search.h>.
In r307227 I've refactored the binary search tree functions to use the
posix_tnode type. As this change does not apply cleanly to this version
of FreeBSD, only make the change that matters: add the definition of the
newly introduced type.
This will ease source-level compatibility going forward.
Until we can resolve the numerous hole_birth bugs that have cropped up
recently, and come up with a way going forwards to protect users from
corruption, we should disable the hole_birth feature. Using a tunable
allows those who are confident that their data is correct to continue to
take advantage of the feature.
Closes #188
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>
dsl_dataset_space is looking at the ds_bp's fill count while
dmu_objset_write_ready() is concurrently modifying it. This fix adds an
rrwlock to protect the ds_bp.
Closes #180
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>
mav [Sat, 29 Oct 2016 08:48:01 +0000 (08:48 +0000)]
MFC r307507, r307509, r307515:
Consider device as clean even if SYNCHRONIZE CACHE failed.
If device reservation was preempted by other initiator, our sync request
will always fail. Without this change CAM tried to sync cache on every
following device close, including numerous GEOM tasting opens/closes,
causing lots of useless noise in logs.
mav [Sat, 29 Oct 2016 08:45:06 +0000 (08:45 +0000)]
MFC r307350: Add LUN options to limit UNMAP and WRITE SAME sizes.
CTL itself has no limits on on UNMAP and WRITE SAME sizes. But depending
on backends large requests may take too much time. To avoid that new
configuration options allow to hint initiator maximal sizes it should not
exceed.
mav [Fri, 28 Oct 2016 18:25:32 +0000 (18:25 +0000)]
MFC r300881, r302058 (by asomers):
Avoid issuing spa config updates for physical path when not necessary
ZFS's configuration needs to be updated whenever the physical path for a
device changes, but not when a new device is introduced. This is because new
devices necessarily cause config updates, but only if they are actually
accepted into the pool.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Split vdev_geom_set_physpath out of vdev_geom_attrchanged. When
setting the vdev's physical path, only request a config update if
the physical path has changed. Don't request it when opening a
device for the first time, because the config sync will happen
anyway upstack.
sys/geom/geom_dev.c
Split g_dev_set_physpath and g_dev_set_media out of
g_dev_attrchanged
mav [Fri, 28 Oct 2016 18:24:05 +0000 (18:24 +0000)]
MFC r300059 (by asomers): Speed up vdev_geom_open_by_guids
Speedup is hard to measure because the only time vdev_geom_open_by_guids
gets called on many drives at the same time is during boot. But with
vdev_geom_open hacked to always call vdev_geom_open_by_guids, operations
like "zpool create" speed up by 65%.
* Read all of a vdev's labels in parallel instead of sequentially.
* In vdev_geom_read_config, don't read the entire label, including
the uberblock. That's a waste of RAM. Just read the vdev config
nvlist. Reduces the IO and RAM involved with tasting from 1MB to
448KB.
mav [Fri, 28 Oct 2016 18:22:00 +0000 (18:22 +0000)]
MFC r298814 (by asomers): Fix a use-after-free when "zpool import" fails
clear vd->vdev_tsd in vdev_geom_close_locked instead of vdev_geom_detach.
In the latter function, it would fail to happen in certain circumstances
where cp->private was unset. Ideally, the latter should never happen, but
it can happen when vdev open fails, or where spares are involved.
mav [Fri, 28 Oct 2016 18:20:14 +0000 (18:20 +0000)]
MFC r298786 (by asomers):
Refactor vdev_geom_attach and friends to reduce code duplication
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Move checks for provider's sectorsize and mediasize into a single
location in vdev_geom_attach. Remove the zfs::vdev::taste class;
it's ok to use the regular vdev class for tasting. Consolidate guid
checks into a single location in vdev_attach_ok. Consolidate some
error handling code from vdev_geom_attach into vdev_geom_detach,
closing a resource leak of geom consumers in the process.
Using zvols as backing devices for ZFS pools is fraught with panics and
deadlocks. For example, attempting to online a missing device in the
presence of a zvol can cause a panic when vdev_geom tastes the zvol. Better
to completely disable vdev_geom from ever opening a zvol. The solution
relies on setting a thread-local variable during vdev_geom_open, and
returning EOPNOTSUPP during zvol_open if that thread-local variable is set.
Remove the check for MUTEX_HELD(&zfsdev_state_lock) in zvol_open. Its intent
was to prevent a recursive mutex acquisition panic. However, the new check
for the thread-local variable also fixes that problem.
Also, fix a panic in vdev_geom_taste_orphan. For an unknown reason, this
function was set to panic. But it can occur that a device disappears during
tasting, and it causes no problems to ignore this departure.
kib [Fri, 28 Oct 2016 12:58:40 +0000 (12:58 +0000)]
MFC r306807:
When making a pause after detecting hard kill of the single-user
shell, ensure that we do sleep for at least the specified time, in
presence of signals.
jhb [Fri, 28 Oct 2016 03:54:19 +0000 (03:54 +0000)]
MFC 303002: Include process IDs in core dumps.
When threads were added to the kernel, the pr_pid member of the
NT_PRSTATUS note was repurposed to store LWP IDs instead of process
IDs. However, the process ID was no longer recorded in core dumps.
This change adds a pr_pid field to prpsinfo (NT_PRSINFO). Rather than
bumping the prpsinfo version number, note parsers can use the note's
payload size to determine if pr_pid is present.
davidcs [Wed, 26 Oct 2016 18:13:30 +0000 (18:13 +0000)]
MFC r307578
1. Use taskqueue_create() instead of taskqueue_create_fast() for both
fastpath and slowpath taskqueues.
2. Service all transmits in taskqueue threads.
3. additional stats counters for keeping track of
- bd availability
- tx buf ring not emptied in the fp task queue.
These are drained via timeout taskqueue.
- tx attempts during link down.
jch [Tue, 25 Oct 2016 12:58:36 +0000 (12:58 +0000)]
MFC r307551:
Fix a double-free when an inp transitions to INP_TIMEWAIT state
after having been dropped.
This change enforces in_pcbdrop() logic in tcp_input():
"in_pcbdrop() is used by TCP to mark an inpcb as unused and avoid future packet
delivery or event notification when a socket remains open but TCP has closed."
sephe [Wed, 19 Oct 2016 08:45:19 +0000 (08:45 +0000)]
MFC 307261
hyperv/stor: Fix off-by-one bug; this brings back TRIM support.
Submitted by: Hongjiang Zhang <honzhan microsoft com>
Reported by: Lili Deng <v-lide microsoft com>
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8238
sevan [Sun, 16 Oct 2016 23:44:02 +0000 (23:44 +0000)]
MFC r306732:
Document the history of fdisk based on the original post to comp.unix.bsd by Julian Elischer [1] and the Mach 2.
5 Installation notes [2].
I was unable to pin point the exact version of Mach the fdisk utility appeared as I could not find documentation
older than version 2.5 & no source code or repo history.
fdisk utility appears as a separate utility[3] in v2.5. Due to this, I have avoided stating the exact version fd
isk first appeared in Mach.
Add authors section.
sevan [Sun, 16 Oct 2016 23:39:15 +0000 (23:39 +0000)]
MFC r306731:
Document the history of fdisk based on the original post to comp.unix.bsd by Julian Elischer [1] and the Mach 2.5 Installation notes [2].
I was unable to pin point the exact version of Mach the fdisk utility appeared as I could not find documentation older than version 2.5 & no source code or repo history.
fdisk utility appears as a separate utility[3] in v2.5. Due to this, I have avoided stating the exact version fdisk first appeared in Mach.
Add authors section.
Make correction pointed by igor
[1] https://groups.google.com/d/topic/comp.unix.bsd/Hhi45vAHxDg/discussion
[2] ftp://ftp.mcs.vuw.ac.nz/doc/misc/mach-i386-doc/i386_install.ps
[3] ftp://ftp.mcs.vuw.ac.nz/doc/misc/mach-i386-doc/i386_manpages.ps
PR: 212469
Approved by: bcr (mentor)
Differential Revision: https://reviews.freebsd.org/D8104
sevan [Sun, 16 Oct 2016 23:28:58 +0000 (23:28 +0000)]
MFC r306724:
Add history section for bsdlabel(8)
http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-Tahoe/usr/man/cat8/disklabel.0
Remove tab after space, highlighted by igor
sevan [Sun, 16 Oct 2016 23:09:04 +0000 (23:09 +0000)]
MFC r306718:
Add history section for echo(1)
Sourced using the draft copy of the second edition manual
http://www.tuhs.org/Archive/PDP-11/Distributions/research/1972_stuff/unix_2nd_edition_manual.pdf
sevan [Sun, 16 Oct 2016 22:22:46 +0000 (22:22 +0000)]
MFC r306611:
Amend history to mention predecessor originated from 386BSD[1] & current implementation from NetBSD[2].
Reword history since the utility was renamed once more in FreeBSD 5.0.
Separate out author & historical information regarding character code conversion.
Add AUTHORS section.
arybchik [Sat, 15 Oct 2016 13:45:12 +0000 (13:45 +0000)]
MFC r307038
sfxge(4): update external port mapping for Medford
Extend the mapping table for external port numbering to support port modes
which output to the second external port only. Where supported, map from
the current port mode rather than inferring from all the available modes.
Updated comments for clarity.
Submitted by: Richard Houldsworth <rhouldsworth at solarflare.com>
Sponsored by: Solarflare Communications, Inc.
bapt [Sat, 15 Oct 2016 12:38:21 +0000 (12:38 +0000)]
MFC r306852
Incorporate a change from OpenBSD by millert@OpenBSD.org
Don't warn about valid time zone abbreviations. POSIX
through 2000 says that an abbreviation cannot start with ':', and
cannot contain ',', '-', '+', NUL, or a digit. POSIX from 2001
on changes this rule to say that an abbreviation can contain only
'-', '+', and alphanumeric characters from the portable character
set in the current locale. To be portable to both sets of rules,
an abbreviation must therefore use only ASCII letters." Adapted
from tzcode2015f.
This is needed to be able to update tzdata to a newer version
5959 clean up per-dataset feature count code
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
This patch simply removes this macro from dsl_dataset.h.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Author: Matthew Ahrens <mahrens@delphix.com>
mav [Fri, 14 Oct 2016 07:45:10 +0000 (07:45 +0000)]
MFC r305561: MFV r305560:
7278 tuning zfs_arc_max does not impact arc_c_min
When changing zfs_arc_max (e.g. as zdb does), it may be set to less
than the default arc_c_min. arc_c_min should decrease to not be more than
arc_c_max, but it doesn't; therefore tuning of arc_c_max is ineffective.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Author: Matthew Ahrens <mahrens@delphix.com>
mav [Fri, 14 Oct 2016 07:40:20 +0000 (07:40 +0000)]
MFC r305340: MFC r305337:
7004 dmu_tx_hold_zap() does dnode_hold() 7x on same object
Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).
dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:
This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.
Sponsored by: Intel Corp.
Closes #109
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Matthew Ahrens <mahrens@delphix.com>
This resolves two 'zfs recv' issues. First, when receiving into an
existing filesystem, a snapshot created during the receive process is
not added to the guid->dataset map for the stream, resulting in failed
lookups for deduped streams when a WRITE_BYREF record refers to a
snapshot received earlier in the stream. Second, the newly created
snapshot was also not set properly, referencing the snapshot before the
new receiving dataset rather than the existing filesystem.
Closes #159
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Author: Chris Williamson <chris.williamson@delphix.com>
mav [Fri, 14 Oct 2016 07:35:43 +0000 (07:35 +0000)]
MFC r305338: MFV r305335: 7003 zap_lockdir() should tag hold
zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which
tags the hold on the zap. This will help diagnose programming errors
which misuse the hold on the ZAP.
Sponsored by: Intel Corp.
Closes #108
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Matthew Ahrens <mahrens@delphix.com>
https://www.illumos.org/issues/7230
A test failure occurred where a send stream had only a BEGIN record. This
should not be possible if the send returns without error. Prevented this from
happening in the future by adding an assertion to dmu_send_impl() to verify
that if the function returns 0 (success) both a BEGIN and END record are
present. Did this by adding flags to dmu_sendarg_t (indicating whether BEGIN o
r
END records sent), having dump_record() set flags appropriately, adding VERIFY
statement to dmu_send_impl().
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matt Krantz <matt.krantz@delphix.com>