sephe [Fri, 13 Oct 2017 05:09:56 +0000 (05:09 +0000)]
MFC 324489,324516
324489
hyperv/hn: Workaround erroneous hash type observed on WS2016.
Background:
- UDP 4-tuple hash type is unconditionally enabled in Hyper-V on WS2016,
which is _not_ affected by NDIS_OBJTYPE_RSS_PARAMS.
- Non-fragment UDP/IPv4 datagrams' hash type is delivered to VM as
TCP_IPV4.
Currently this erroneous behavior only applies to WS2016/Windows10.
Force l3/l4 protocol check, if the RXed packet's hash type is TCP_IPV4,
and the Hyper-V is running on WS2016/Windows10. If the RXed packet is
UDP datagram, adjust mbuf hash type to UDP_IPV4.
Sponsored by: Microsoft
324516
hyperv/hn: Workaround erroneous hash type observed on WS2016 for VF.
hselasky [Thu, 12 Oct 2017 08:27:57 +0000 (08:27 +0000)]
MFC r324320:
Add support for new cuse(3) error code, CUSE_ERR_NO_DEVICE.
This error code is useful when emulating Linux input event
devices from userspace.
rmacklem [Wed, 11 Oct 2017 13:20:24 +0000 (13:20 +0000)]
MFC: r324074
Fix a memory leak that occurred in the pNFS client.
When a "pnfs" NFSv4.1 mount was unmounted, it didn't free up the layouts
and deviceinfo structures. This leak only affects "pnfs" mounts and only
when the mount is umounted.
Found while testing the pNFS Flexible File layout client code.
hselasky [Wed, 11 Oct 2017 10:12:22 +0000 (10:12 +0000)]
MFC r315405, r323351 and r323364:
Add helper function similar to ip_dev_find() to the LinuxKPI to lookup
a network device by its IPv6 address in the given VNET.
emaste [Wed, 11 Oct 2017 00:31:54 +0000 (00:31 +0000)]
MFC r309151: Use explicit 0x200000 for the amd64 kernel physaddr
(instead of MAXPAGESIZE)
MAXPAGESIZE is not well defined by the GNU ld documentation.
Different linkers, and different versions of the same linker, use
different MAXPAGESIZE values. Current versions of GNU gold and LLVM's
lld use 4K. When set to 4K the kernel panics at boot due to an issue
with x86bios.
Here we want the kernel physaddr to be the amd64 superpage size, so use
that value (2MB) explicitly. With this change GNU gold and LLVM lld can
link a working amd64 kernel.
PR: 214718 (x86bios)
Sponsored by: The FreeBSD Foundation
sephe [Tue, 10 Oct 2017 05:52:28 +0000 (05:52 +0000)]
MFC 323728,323729
323728
hyperv/hn: Fix MTU setting
- Add size of an ethernet header to the value configured to NVS. This
does not seem to have any effects if MTU is 1500, but fix hypervisor
side's setting if MTU > 1500.
- Override the MTU setting according to the view from the hypervisor
side.
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D12352
323729
hyperv/hn: Incease max supported MTU
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D12365
sephe [Tue, 10 Oct 2017 05:46:57 +0000 (05:46 +0000)]
MFC 323727,324316
323727
hyperv/hn: Apply VF's RSS setting
Since in Azure SYN and SYN|ACK go through the synthetic parts while the
rest of the same TCP flow goes through the VF, apply VF's RSS settings
to synthetic parts to have a consistent hash value/type for the same TCP
flow.
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D12333
sephe [Tue, 10 Oct 2017 05:25:46 +0000 (05:25 +0000)]
MFC 323170
if: Add ioctls to get RSS key and hash type/function.
It will be needed by hn(4) to configure its RSS key and hash
type/function in the transparent VF mode in order to match VF's
RSS settings. The description of the transparent VF mode and
the RSS hash value issue are here:
https://svnweb.freebsd.org/base?view=revision&revision=322299
https://svnweb.freebsd.org/base?view=revision&revision=322485
These are generic enough to promise two independent IOCs instead
of abusing SIOCGDRVSPEC.
Setting RSS key and hash type/function is a different story,
which probably requires more discussion.
Comment about UDP_{IPV4,IPV6,IPV6_EX} were only in the patch
in the review request; these hash types are standardized now.
Reviewed by: gallatin
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D12174
MFC r324098:
Some mbuf related fixes in icmp_error()
* check mbuf length before doing mtod() and accessing to IP header;
* update oip pointer and all depending pointers after m_pullup();
* remove extra checks and extra parentheses, wrap long lines;
rmacklem [Sun, 8 Oct 2017 21:20:25 +0000 (21:20 +0000)]
MFC: r323978
Change a panic to an error return.
There was a panic() in the NFS server's write operation that didn't
need to be a panic() and could just be an error return.
This patch makes that change.
Found by code inspection during development of the pNFS service.
Relevant vendor changes:
PR #905: Support for Zstandard read and write filters
PR #922: Avoid overflow when reading corrupt cpio archive
Issue #935: heap-based buffer overflow in xml_data (CVE-2017-14166)
OSS-Fuzz 2936: Place a limit on the mtree line length
OSS-Fuzz 2394: Ensure that the ZIP AES extension header is large enough
OSS-Fuzz 573: Read off-by-one error in RAR archives (CVE-2017-14502)
alc [Sun, 8 Oct 2017 17:14:45 +0000 (17:14 +0000)]
MFC r324173
When an I/O error occurs on page out, there is no need to dirty the page,
because it is already dirty. Instead, assert that the page is dirty.
ngie [Sat, 7 Oct 2017 22:59:09 +0000 (22:59 +0000)]
MFC r306743,r317712:
r306743 (by markj):
gmirror: Bump the syncid if broken disks are found during startup.
Consider a mirror with two components, m1 and m2. Suppose a hardware error
results in the removal of m2, with m1's genid bumped. Suppose further that
a replacement mirror component m3 is created and synchronized, after which
the system is shut down uncleanly. During a subsequent bootup, if gmirror
tastes m1 and m2 first, m2 will be removed from the mirror because it is
broken, but the mirror will be started without bumping the syncid on m1
because all elements of the mirror are accounted for. Then m3 will be
added to the already-running mirror with the same syncid as m1, so the
components will not be synchronized despite the unclean shutdown.
Handle this scenario by bumping the syncid of healthy components if any
broken mirrors are discovered during mirror startup.
r317712 (by markj):
Synchronize unclean mirrors before adding them to a running gmirror.
During gmirror startup, if component mirrors are found to be dirty as is
typical after a system crash, the mirrors are synchronized to the mirror
with highest priority. However if a gmirror starts without all of its
mirrors present, for example because of some transient delays during
tasting, the remaining mirrors must be synchronized before they may become
active.
alc [Sat, 7 Oct 2017 21:13:54 +0000 (21:13 +0000)]
MFC r305685
Various changes to pmap_ts_referenced()
Move PMAP_TS_REFERENCED_MAX out of the various pmap implementations and
into vm/pmap.h, and describe what its purpose is. Eliminate the archaic
"XXX" comment about its value. I don't believe that its exact value, e.g.,
5 versus 6, matters.
Update the arm64 and riscv pmap implementations of pmap_ts_referenced()
to opportunistically update the page's dirty field.
On amd64, use the PDE value already cached in a local variable rather than
dereferencing a pointer again and again.
alc [Sat, 7 Oct 2017 20:22:04 +0000 (20:22 +0000)]
MFC r321386,321393
Utilize pmap_enter(..., psind=1) in vm_fault_soft_fast() on amd64. (The
Differential Revision discusses the benefits of this change.)
Add a function, vm_reserv_to_superpage(), that returns the superpage
containing the specified base page.
Previously we'd have an assertion failure in cap_rights_is_set if
sysdecode_cap_rights is called with an invalid cap_rights_t, so test for
validity first.
emaste [Sat, 7 Oct 2017 20:18:20 +0000 (20:18 +0000)]
MFC r323405: newvers.sh: speed up failing git-svn revision search
In the case of running newvers.sh on a git tree w/o git-svn-id notes we
previously piped the entire 'git log' to grep. Add --grep to the log
invocation to avoid processing log entries of no interest.
This saves about 2-3 seconds of newvers.sh run time on my SSD laptop.
Later changes will bring further speedups.
emaste [Sat, 7 Oct 2017 20:17:03 +0000 (20:17 +0000)]
MFC r323394: newvers.sh: accept "git-svn-id:" at the start of a line only
This prevents incorrect subversion revision detection when "git svn" is
not being used to get the sources but git is available. Previously old
subversion revisions included in commit messages were favoured over the
more recent and correct revisions in git notes.
For example cf1f35574722 represents r315395 but was treated as r313908
which is referenced in the commit message. Commits following
r315395/cf1f35574722 but before another commit with a git-svn-id
reference in the commit message would be treated as r313908 as well.
Patch from PR updated to accommodate the initial four space indent in
`git log` ouptut.
emaste [Sat, 7 Oct 2017 20:14:30 +0000 (20:14 +0000)]
MFC r323438: make-memstick.sh: use UFSv2
There's not much practical difference as far as install media is
concerned but newfs creates UFSv2 by default and it is sensible to use
the contemporary UFS version.
I also intend to change makefs to create UFSv2 by default (to match
newfs) so we'll want make-memstick.sh to be explicit, rather than
relying on the host tool's default.
alc [Sat, 7 Oct 2017 18:36:42 +0000 (18:36 +0000)]
MFC r319542,321003,321378
Eliminate duplication of the pmap and pv list unlock operations in
pmap_enter() by implementing a single return path. Otherwise, the
duplication will only increase with the upcoming support for psind == 1.
Extract the innermost loop of pmap_remove() out into its own function,
pmap_remove_ptes(). (This new function will also be used by an upcoming
change to pmap_enter() that adds support for psind == 1 mappings.)
Add support for pmap_enter(..., psind=1) to the amd64 pmap. In other words,
add support for explicitly requesting that pmap_enter() create a 2MB page
mapping. (Essentially, this feature allows the machine-independent layer to
create superpage mappings preemptively, and not wait for automatic promotion
to occur.)
Export pmap_ps_enabled() to the machine-independent layer.
Add a flag to pmap_pv_insert_pde() that specifies whether it should fail or
reclaim a PV entry when one is not available.
Refactor pmap_enter_pde() into two functions, one by the same name, that is
a general-purpose function for creating PDE PG_PS mappings, and another,
pmap_enter_2mpage(), that is used to prefault 2MB read- and/or execute-only
mappings for execve(2), mmap(2), and shmat(2).
alc [Sat, 7 Oct 2017 18:08:37 +0000 (18:08 +0000)]
MFC r320980,321377
Generalize vm_page_ps_is_valid() to support testing other predicates on
the (super)page, renaming the function to vm_page_ps_test().
In vm_page_ps_test(), always check that the base pages within the specified
superpage all belong to the same object. To date, that check has not been
needed, but upcoming changes require it.
alc [Sat, 7 Oct 2017 17:32:39 +0000 (17:32 +0000)]
MFC r321015
Style-only change: Consistently use the variable name "pdpg" throughout
this file. Previously, half of the pointers to a vm_page being used as
a page directory page were named "pdpg" and the rest were named "mpde".
alc [Sat, 7 Oct 2017 17:20:31 +0000 (17:20 +0000)]
MFC r323973,324087
Optimize vm_page_try_to_free(). Specifically, the call to pmap_remove_all()
can be avoided when the page's containing object has a reference count of
zero. (If the object has a reference count of zero, then none of its pages
can possibly be mapped.)
Address nearby style issues in vm_page_try_to_free(), and change its
return type to "bool".
Optimize vm_object_page_remove() by eliminating pointless calls to
pmap_remove_all(). If the object to which a page belongs has no
references, then that page cannot possibly be mapped.
alc [Sat, 7 Oct 2017 16:55:44 +0000 (16:55 +0000)]
MFC r323656
Modify blst_leaf_alloc to take only the cursor argument.
Modify blst_leaf_alloc to find allocations that cross the boundary between
one leaf node and the next when those two leaves descend from the same
meta node.
Update the hint field for leaves so that it represents a bound on how
large an allocation can begin in that leaf, where it currently represents
a bound on how large an allocation can be found within the boundaries of
the leaf.
The first phase of blst_leaf_alloc currently shrinks sequences of
consecutive 1-bits in mask until each has been shrunken by count-1 bits,
so that any bits remaining show where an allocation can begin, or until
all the bits have disappeared, in which case the allocation fails. This
change amends that so that the high-order bit is copied, as if, when the
last block was free, it was followed by an endless stream of free
blocks. It also amends the early stopping condition, so that the shrinking
of 1-sequences stops early when there are none, or there is only one
unbounded one remaining.
The search for the first set bit is unchanged, and the code path
thereafter is mostly unchanged unless the first set bit is in a position
that makes some of those copied sign bits matter. In that case, we look
for a next leaf, and at what blocks it can provide, to see if a
cross-boundary allocation is possible.
The hint is updated on a successful allocation that clears the last bit,
but it not updated on a failed allocation that leaves the last bit
set. So, as long as the last block is free, the hint value for the leaf is
large. As long as the last block is free, and there's a next leaf, a large
allocation can begin here, perhaps. A stricter rule than this would mean
that allocations and frees in one leaf could require hint updates to the
preceding leaf, and this change seeks to leave the freeing code
unmodified.
Define BLIST_BMAP_MASK, and use it for bit masking in blst_leaf_free and
blist_leaf_fill, as well as in blst_leaf_alloc.
avg [Thu, 5 Oct 2017 06:54:25 +0000 (06:54 +0000)]
MFC r323578,r323769: dounmount: do not release the mount point's reference
on the covered vnode
As long as mnt_ref is not zero there can be a consumer that might try
to access mnt_vnodecovered. For this reason the covered vnode must not
be freed until mnt_ref goes to zero.
So, move the release of the covered vnode to vfs_mount_destroy.
trasz [Wed, 4 Oct 2017 12:06:24 +0000 (12:06 +0000)]
MFC r320892:
Make fsck_y_enable default to passing pass -R to fsck_ffs(8) in addition
to -y. To me, fsck_y_enable means "try as hard as possible", and without
-R, it... well, doesn't.
trasz [Wed, 4 Oct 2017 12:04:35 +0000 (12:04 +0000)]
MFC r320803:
Fix "mount -uw /" when the filesystem type doesn't match.
This basically makes "mount -uw /" work when the filesystem
mounted on / is NFS, but the one configured in fstab(5) is UFS,
which can happen when you forget to modify fstab.
Note that the whole special case ("else if (argv[0][0] == '/'")
is probably not needed anyway. I'll take a look at removing it
altogether; for now this is a minimally intrusive fix.
trasz [Wed, 4 Oct 2017 12:01:02 +0000 (12:01 +0000)]
MFC r323225:
Reflect realtime and idle priorities in ps(1) state flags, same like
we do for the usual nice values. It could be argued that they should
use another set of indicators, since the underlying mechanism is
different, but they match the description in the manual page, and so
I think it's ok to not overcomplicate things.
trasz [Wed, 4 Oct 2017 11:59:53 +0000 (11:59 +0000)]
MFC r321422:
Improve the ktrace(1) man page to make it slightly more obvious that there
are _two_ options that control its behaviour wrt child processes; slightly
improve the example[1], and add Xrefs.
trasz [Wed, 4 Oct 2017 11:58:52 +0000 (11:58 +0000)]
MFC r323183:
Make root_mount_rel(9) ignore NULL arguments, like it used to before r313351.
It would be better to fix API consumers to not pass NULL there - most of them,
such as gmirror, already contain the neccessary checks - but this is easier
and much less error-prone.
One known user-visible result is that it fixes panic on a failed "graid label".
This brings the CloudABI code more or less in sync with HEAD.
r321514:
Upgrade to the latest sources generated from the CloudABI specification.
The CloudABI specification has had some minor changes over the last half
year. No substantial features have been added, but some features that
are deemed unnecessary in retrospect have been removed:
- mlock()/munlock():
These calls tend to be used for two different purposes: real-time
support and handling of sensitive (cryptographic) material that
shouldn't end up in swap. The former use case is out of scope for
CloudABI. The latter may also be handled by encrypting swap.
Removing this has the advantage that we no longer need to worry about
having resource limits put in place.
- SOCK_SEQPACKET:
Support for SOCK_SEQPACKET is rather inconsistent across various
operating systems. Some operating systems supported by CloudABI (e.g.,
macOS) don't support it at all. Considering that they are rarely used,
remove support for the time being.
- getsockname(), getpeername(), etc.:
A shortcoming of the sockets API is that it doesn't allow you to
create socket(pair)s, having fake socket addresses associated with
them. This makes it harder to test applications or transparently
forward (proxy) connections to them.
With CloudABI, we're slowly moving networking connectivity into a
separate daemon called Flower. In addition to passing around socket
file descriptors, this daemon provides address information in the form
of arbitrary string labels. There is thus no longer any need for
requesting socket address information from the kernel itself.
This change also updates consumers of the generated code accordingly.
Even though system calls end up getting renumbered, this won't cause any
problems in practice. CloudABI programs always call into the kernel
through a kernel-supplied vDSO that has the numbers updated as well.
Obtained from: https://github.com/NuxiNL/cloudabi
r322885:
Sync CloudABI compatibility against the latest upstream version (v0.13).
With Flower (CloudABI's network connection daemon) becoming more
complete, there is no longer any need for creating any unconnected
sockets. Socket pairs in combination with file descriptor passing is all
that is necessary, as that is what is used by Flower to pass network
connections from the public internet to listening processes.
Remove all of the kernel bits that were used to implement socket(),
listen(), bindat() and connectat(). In principle, accept() and
SO_ACCEPTCONN may also be removed, but there are still some consumers
left.
Obtained from: https://github.com/NuxiNL/cloudabi
r323015:
Complete the CloudABI networking refactoring.
Now that all of the packaged software has been adjusted to either use
Flower (https://github.com/NuxiNL/flower) for making incoming/outgoing
network connections or can have connections injected, there is no longer
need to keep accept() around. It is now a lot easier to write networked
services that are address family independent, dual-stack, testable, etc.
Remove all of the bits related to accept(), but also to
getsockopt(SO_ACCEPTCONN).
r323177:
Merge pipes and socket pairs.
Now that CloudABI's sockets API has been changed to be addressless and
only connected socket instances are used (e.g., socket pairs), they have
become fairly similar to pipes. The only differences on CloudABI is that
socket pairs additionally support shutdown(), send() and recv().
To simplify the ABI, we've therefore decided to remove pipes as a
separate file descriptor type and just let pipe() return a socket pair
of type SOCK_STREAM. S_ISFIFO() and S_ISSOCK() are now defined
identically.
avg [Mon, 2 Oct 2017 12:54:01 +0000 (12:54 +0000)]
MFC r323433,r323793,r323915: MFV r323110: 8558 lwp_create() returns EAGAIN
on system with more than 80K ZFS filesystems, and followups
r323433: MFV r323110: 8558 lwp_create() returns EAGAIN on system with more than 80K ZFS filesystems
r323793: MFV r323792: 8602 remove unused "dp_early_sync_tasks" field from "dsl_pool" structure
r323915: MFV r323914: 8661 remove "zil-cw2" dtrace probe
rmacklem [Sun, 1 Oct 2017 21:45:15 +0000 (21:45 +0000)]
MFC: r323689
Fix bogus FREAD with NFSV4OPEN_ACCESSREAD. No functional change.
The code in nfscl_doflayoutio() bogusly used FREAD instead of
NFSV4OPEN_ACCESSREAD. Since both happen to be defined as "1", this
worked and the patch doesn't result in a functional change.
Found by inspection during development of Flex File Layout support.
eugen [Sun, 1 Oct 2017 19:39:27 +0000 (19:39 +0000)]
MFC r323873, r324081: Unprotected modification of ng_iface(4)
private data leads to kernel panic. Fix a race with per-node
read-mostly lock and refcounting for a hook.
alc [Sun, 1 Oct 2017 16:57:08 +0000 (16:57 +0000)]
MFC r323961
Since the page "frame" doesn't belong to a vm object, it can't be paged
out. Since it can't be paged out, it is never actually enqueued in a
paging queue. Nonetheless, passing PQ_INACTIVE to vm_page_unwire()
creates the appearance that the page "frame" is being enqueued in the
inactive queue. As of r288122, we can avoid this false impression by
passing PQ_NONE.
alc [Sun, 1 Oct 2017 16:29:20 +0000 (16:29 +0000)]
MFC r323981
Modernize the use of vm_page_unwire(). Since r288122, vm_page_unwire()
has returned TRUE when the wire count transitions to zero, eliminating
the need for callers to inspect the page's wire count.
avg [Sun, 1 Oct 2017 15:03:44 +0000 (15:03 +0000)]
MFV r323796: fix memory leak in g_bio zone introduced in r320452
I overlooked the fact that that ZIO_IOCTL_PIPELINE does not include
ZIO_STAGE_VDEV_IO_DONE stage. We do allocate a struct bio for an ioctl
zio (a disk cache flush), but we never freed it.
This change splits bio handling into two groups, one for normal
read/write i/o that passes data around and, thus, needs the abd data
tranform; the other group is for "data-less" i/o such as trim and cache
flush.
avg [Sun, 1 Oct 2017 14:58:43 +0000 (14:58 +0000)]
MFC r323797: add vfs_zfs.abd_chunk_size tunable
It is reported that the default value of 4KB results in a substantial
memory use overhead (at least, on some configurations). Using 1KB seems
to reduce the overhead significantly.
Respect MK_TCSH with build-tools and native-xtools
This helps reduce the WORLDTMP footprint slightly.
Based on a patch I submitted 5 years ago to GNATS.
PR: 174051
Relnotes: yes (anyone who cross-builds with MK_TCSH=yes will run into
build failures if the host doesn't have tcsh(1))
Reminded by: Fabian Keil <fk@fabiankeil.de>
MFC r323391
To analyze the allocation of swap blocks by blist functions, add a method
for analyzing the radix tree structures and reporting on the number, and
sizes, of maximal intervals of free blocks. The report includes the number
of maximal intervals, and also the number of them in each of several size
ranges, from small (size 1, or 3 to 4) to large (28657 to 46367) with size
boundaries defined by Fibonacci numbers. The report is written in the test
tool with the 's' command, or in a running kernel by sysctl.
The analysis of the radix tree frequently computes the position of the lone
bit set in a u_daddr_t, a computation that also appears in leaf allocation.
That computation has been moved into a function of its own, and optimized
for cases where an inlined machine instruction can replace the usual binary
search.
MFC r322459,322897
The *_meta_* functions include a radix parameter, a blk parameter, and
another parameter that identifies a starting point in the memory address
block. Radix is a power of two, blk is a multiple of radix, and the
starting point is in the range [blk, blk+radix), so that blk can always be
computed from the other two. This change drops the blk parameter from the
meta functions and computes it instead. (On amd64, for example, this
change reduces subr_blist.o's text size by 7%.)
It also makes the radix parameters unsigned to address concerns that the
calculation of '-radix' might overflow without the -fwrapv option. (See
https://reviews.freebsd.org/D11819.)
Correct a regression in the previous change, r322459. Specifically, the
removal of the "blk" parameter from blst_meta_alloc() had the unintended
effect of generating an out-of-range allocation when the cursor reaches
the end of the tree if the number of managed blocks in the tree equals
the so-called "radix" (which in the blist code is not the standard notion
of what a radix is but rather the maximum number of leaves in a tree of
the current height.) In other words, only certain swap configurations
were affected, which is why earlier testing did not reveal the problem.
MFC r323868
Modernize calls to vm_page_unwire(). As of r288122, vm_page_unwire()
accepts PQ_NONE as the specified queue and returns a Boolean indicating
whether the page's wire count transitioned to zero. Use these features
in dev/drm2.
MFC r323786
In r288122, we changed vm_page_unwire() so that it returns a Boolean
indicating whether the page's wire count transitioned to zero. Use that
return value in zbuf_page_free() rather than checking the wire count.
MFC r323785
Sync with amd64/arm/arm64/i386/mips pmap change r288256:
Exploit r288122 to address a cosmetic issue. Since PV chunk pages don't
belong to a vm object, they can't be paged out. Since they can't be paged
out, they are never enqueued in a paging queue. Nonetheless, passing
PQ_INACTIVE to vm_page_unwire() creates the appearance that these pages
are being enqueued in the inactive queue. As of r288122, we can avoid
this false impression by passing PQ_NONE.