Jilles Tjoelker [Sun, 24 Mar 2019 22:10:26 +0000 (22:10 +0000)]
MFC r344502: sh: Add set -o pipefail
The pipefail option allows checking the exit status of all commands in a
pipeline more easily, at a limited cost of complexity in sh itself. It works
similarly to the option in bash, ksh93 and mksh.
Like ksh93 and unlike bash and mksh, the state of the option is saved when a
pipeline is started. Therefore, even in the case of commands like
A | B &
a later change of the option does not affect the exit status, the same way
(A | B) &
works.
Since SIGPIPE is not handled specially, more work in the script is required
for a proper exit status for pipelines containing commands such as head that
may terminate successfully without reading all input. This can be something
like
(
cmd1
r=$?
if [ "$r" -gt 128 ] && [ "$(kill -l "$r")" = PIPE ]; then
exit 0
else
exit "$r"
fi
) | head
The hardcoded ident is exactly 20 bytes long but sprintf adds terminating zero,
so there is one byte written out of array bounds.As a fix use strncpy it
appends \0 only if space allows and its behavior matches virtio spec:
When VIRTIO_BLK_T_GET_ID is issued, the device identifier, up to 20 bytes, is
written to the buffer. The identifier should be interpreted as an ascii string.
It is terminated with \0, unless it is exactly 20 bytes long.
[ndis] Fix unregistered use of FPU by NDIS in kernel on amd64
amd64 miniport drivers are allowed to use FPU which triggers "Unregistered use
of FPU in kernel" panic.
Wrap all variants of MSCALL with fpu_kern_enter/fpu_kern_leave. To reduce
amount of allocations/deallocations done via
fpu_kern_alloc_ctx/fpu_kern_free_ctx maintain cache of fpu_kern_ctx elements.
Marcel Moolenaar [Fri, 22 Mar 2019 23:39:16 +0000 (23:39 +0000)]
MFC 344790:
Revert revision 254095
In revision 254095, gpt_entries is not set to match the on-disk
hdr_entries, but rather is computed based on available space.
There are 2 problems with this:
1. The GPT backend respects hdr_entries and only reads and writes
that number of partition entries. On top of that, CRC32 is
computed over the table that has hdr_entries elements. When
the common code works on what is possibly a larger number, the
behaviour becomes inconsistent and problematic. In particular,
it would be possible to add a new partition that on a reboot
isn't there anymore.
2. The calculation of gpt_entries is based on flawed assumptions.
The GPT specification does not dictate that sectors are layed
out in a particular way that the available space can be
determined by looking at LBAs. In practice, implementations
do the same thing, because there's no reason to do it any
other way. Still, GPT allows certain freedoms that can be
exploited in some form or shape if the need arises.
Extend descriptions and comments about the need to create /etc/pf.conf.
FreeBSD removed the default /etc/pf.conf file in previous releases, but
the documentation kept mentioning it like any other file present in the
system. Change pf.conf(5) to mention in the description of the
default ruleset location that this file needs to be created manually. Also,
the default rc.conf file had it's comment extended a bit to let people
know that this file does not exist by default.
Kristof Provost [Thu, 21 Mar 2019 14:17:10 +0000 (14:17 +0000)]
MFC r345366:
pf: Ensure that IP addresses match in ICMP error packets
States in pf(4) let ICMP and ICMP6 packets pass if they have a
packet in their payload that matches an exiting connection. It was
not checked whether the outer ICMP packet has the same destination
IP as the source IP of the inner protocol packet. Enforce that
these addresses match, to prevent ICMP packets that do not make
sense.
Reported by: Nicolas Collignon, Corentin Bayet, Eloi Vanderbeken, Luca Moro at Synacktiv
Obtained from: OpenBSD
Security: CVE-2019-5598
Alan Somers [Thu, 21 Mar 2019 00:17:43 +0000 (00:17 +0000)]
MFC r344559:
ifconfig: eliminate trailing whitespace
Eliminate trailing whitespace on inet, inet6, and groups lines. I think the
"list txpower" command will still show some, but I'm not able to test that.
PR: 153731 Reported-by: Nikolay Denev <ndenev@gmail.com>
Differential Revision: https://reviews.freebsd.org/D19004
Kristof Provost [Tue, 19 Mar 2019 00:27:45 +0000 (00:27 +0000)]
MFC r344794:
tun: VIMAGE fix for if_tun cloner
The if_tun cloner is not virtualised, but if_clone_attach() does use a
virtualised list of cloners.
The result is that we can't find the if_tun cloner when we try to remove
a renamed tun interface. Virtualise the cloner, and move the final
cleanup into a sysuninit so that we're sure this happens after all of
the vnet_sysuninits
Note that we need unit numbers to be system-unique (rather than unique
per vnet, as is done by if_clone_simple()). The unit number is used to
create the corresponding /dev/tunX device node, and this node must match
with the interface.
Switch to if_clone_advanced() so that we have control over the unit
numbers.
Reproduction scenario:
jail -c -n foo persist vnet
jexec test ifconfig tun create
jexec test ifconfig tun0 name wg0
jexec test ifconfig wg0 destroy
Fedor Uporov [Mon, 18 Mar 2019 12:34:13 +0000 (12:34 +0000)]
MFC: r344757:
Fix double free in case of mount error.
Reported by: Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert of Fraunhofer FKIE
Reported as: FS-9-EXT3-2: Denial Of Service in nmount-5 (vm_fault_hold)
Reviewed by: pfg
Fedor Uporov [Mon, 18 Mar 2019 12:22:04 +0000 (12:22 +0000)]
MFC: r344756, r345179:
Do not read the on-disk inode in case of vnode allocation.
Reported by: Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert of Fraunhofer FKIE
Reported as: FS-6-EXT2-4: Denial Of Service in mkdir-0 (ext2_mkdir/vn_rdwr)
Reviewed by: pfg
Fedor Uporov [Mon, 18 Mar 2019 12:15:58 +0000 (12:15 +0000)]
MFC: r344755:
Fix integer overflow possibility.
Reported by: Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert of Fraunhofer FKIE
Reported as: FS-2-EXT2-1: Out-of-Bounds Write in nmount (ext2_vget)
Reviewed by: pfg
Fedor Uporov [Mon, 18 Mar 2019 12:04:43 +0000 (12:04 +0000)]
MFC r344751:
Make superblock reading logic more strict.
Add more on-disk superblock consistency checks to ext2_compute_sb_data() function.
It should decrease the probability of mounting filesystems with corrupted superblock data.
MFC r345004 (with modification):
Add IP_FW_NAT64 to codes that ipfw_chk() can return.
It will be used by upcoming NAT64 changes. We use separate code
to avoid propagating EACCES error code to user level applications
when NAT64 consumes a packet.
MFC r345003:
Add NULL pointer check to nat64_output().
It is possible that a processed packet was originated by local host,
in this case m->m_pkthdr.rcvif is NULL. Check and set it to V_loif to
avoid NULL pointer dereference in IP input code, since it is expected
that packet has valid receiving interface when netisr processes it.
FFS: allow sendfile(2) to work with block sizes greater than the page size
Implement ffs_getpages_async(), which when possible calls the asynchronous
flavor of the generic pager's getpages function. When the underlying
block size is larger than the system page size, however, it will invoke
the (synchronous) buffer cache pager, followed by a call to the client
completion routine. This retains true asynchronous completion in the most
common (block size <= page size) case, which is important for the performance
of the new sendfile(2). The behavior in the larger block size case mirrors
the default implementation of VOP_GETPAGES_ASYNC, which most other
filesystems use anyway as they do not override the getpages_async method.
To easier potential MFC of the AT_BENEATH feature, some vestiges of it were
left in the merged product but commented out.
Due to a lot of conflicts, it was impossible to split the merge and
regeneration of the syscall tables, because I needed to test the result.
It is fine for stable branch to commit the whole change with the
generated diff.
Kristof Provost [Fri, 15 Mar 2019 11:01:49 +0000 (11:01 +0000)]
MFC r344921:
pf: Fix DIOCGETSRCNODES
r343295 broke DIOCGETSRCNODES by failing to reset 'nr' after counting the
number of source tracking nodes.
This meant that we never copied the information to userspace, leading to '? ->
?' output from pfctl.
r344141:
Add AES-CCM encryption, and plumb into OCF.
r344142:
Pasting in a source control line missed the last quote. Fixed.
r344143:
Fix another issue from r344141, having to do with size of a shift amount.
This did not show up in my testing.
r344388:
It turns out that setting the IV length is necessary with CCM in OpenSSL.
This adds that back.
r344547:
Fix another bug introduced during the review process of r344140:
the tag wasn't being computed properly due to chaning a >= comparison
to an == comparison.
Alexander Motin [Thu, 14 Mar 2019 00:58:39 +0000 (00:58 +0000)]
MFC r344903: Improve entropy for ZFS taskqueue selection.
I just found that at least on Skylake CPUs cpu_ticks() never returns odd
values, only even, and possibly has even bigger step (176/2?), that makes
its lower bits very bad entropy source, leaving half of taskqueues unused.
Switch to sbinuptime(), closer to upstreams, mitigates the problem by the
rate conversion working as kind of hash function. In case that is somehow
not enough (timer rate is too low or too divisible) mix in curcpu.
Alexander Motin [Wed, 13 Mar 2019 20:27:48 +0000 (20:27 +0000)]
MFC r344636: Refactor command ordering/blocking mechanism in CTL.
Replace long per-LUN queue of blocked commands, scanned on each command
completion and sometimes even twice, causing up to O(n^^2) processing cost,
by much shorter per-command blocked queues, scanned only when respective
command completes, and check only commands before the previous blocker,
reducing cost to O(n).
While there, unblock aborted commands to make them "complete" ASAP to be
removed from the OOA queue and so not waste time ordering other commands
against them. Aborted commands that were not sent to execution yet should
have no visible side effects, so this is safe and easy optimization now,
comparing to commands already in processing, which are a still pain.
Together those two optimizations should fix quite pathological case, when
due to backend slowness CTL accumulated many thousands of blocked requests,
partially aborted by initiator and so supposedly not even existing, but
still wasting CTL CPU time.
Kristof Provost [Tue, 12 Mar 2019 19:03:47 +0000 (19:03 +0000)]
pf tests: Disable noalias test
Direct commit to stable/12 to disable the noalias test. The noalias feature has
not been merged to stable/12 as it is a (small) behaviour change. This means
this test will fail. Disable it.
Enji Cooper [Mon, 11 Mar 2019 18:17:26 +0000 (18:17 +0000)]
MFC r342952:
Add Linux compatibility support for `SC_NPROCESSORS_{CONF,ONLN}` as `_SC_NPROCESSORS_{CONF,ONLN}`
The goal of this change is to make it easier to use getconf to query
the number of available processors.
Sadly it's unclear per POSIX, which form (with a preceding _ or
lacking it) is correct. I will bring this up on the Austin Group list so
this point is clarified for implementors that might rely on this getconf
variable in future POSIX spec versions.
This is something I noticed when trying to import GoogleTest to FreeBSD
as one of the CI scripts uses this variable on Linux.
Alexander Motin [Mon, 11 Mar 2019 13:55:47 +0000 (13:55 +0000)]
MFC r344743: Reduce CTL threads priority to about PUSER.
Since in most configurations CTL serves as network service, we found
that this change improves local system interactivity under heavy load.
Priority of main threads is set slightly higher then worker taskqueues
to make them quickly sort incoming requests not creating bottlenecks,
while plenty of worker taskqueues should be less sensitive to latency.
Sean Eric Fagan [Mon, 11 Mar 2019 02:42:49 +0000 (02:42 +0000)]
MFC r344402
* Handle SIGPIPE in gssd
We've got some cases where the other end of gssd's AF_LOCAL socket gets
closed, resulting in an error (and SIGPIPE) when it tries to do I/O to it.
Closing without cleaning up means the next time nfsd starts up, it hangs,
unkillably; this allows gssd to handle that particular error.
* Limit the retry cound in gssd_syscall to 5.
The default is INT_MAX, which effectively means forever. And it's an
uninterruptable RPC call, so it will never stop.
Alexander Motin [Mon, 11 Mar 2019 01:44:18 +0000 (01:44 +0000)]
MFC r344489: Free some space in struct ctl_io_hdr for better use.
- Collapse original_sc and serializing_sc fields into one, since they
are never used simultanously, we have only one local I/O and one remote.
- Move remote_sglist and local_sglist fields into CTL_PRIV_BACKEND,
since they are used only on Originating SC in XFER mode, where requests
don't ever reach backends, so we can reuse backend's private storage.
evdev: export event device properties through sysctl interface
A big security advantage of Wayland is not allowing applications to read
input devices all the time. Having /dev/input/* accessible to the user
account subverts this advantage.
libudev-devd was opening the evdev devices to detect their types (mouse,
keyboard, touchpad, etc). This don't work if /dev/input/* is inaccessible.
With the kernel exposing this information as sysctls (kern.evdev.input.*),
we can work w/o /dev/input/* access, preserving the Wayland security model.
Add [initial] functional tests for sendfile(2) as lib/libc/sys/sendfile
These testcases exercise a number of functional requirements for sendfile(2).
The testcases use IPv4 and IPv6 domain sockets with TCP, and were confirmed
functional on UFS and ZFS. UDP address family sockets cannot be used per the
sendfile(2) contract, thus using UDP sockets is outside the scope of
testing the syscall in positive cases. As seen in
`:s_negative_udp_socket_test`, UDP is used to test the sendfile(2) contract
to ensure that EINVAL is returned by sendfile(2).
The testcases added explicitly avoid testing out `SF_SYNC` due to the
complexity of verifying that support. However, this is a good next logical
item to verify.
The `hdtr_positive*` testcases work to a certain degree (the header
testcases pass), but the trailer testcases do not work (it is an expected
failure). In particular, the value received by the mock server doesn't match
the expected value, and instead looks something like the following (using
python array notation):
`trailer[:]message[1:]`
instead of:
`message[:]trailer[:]`
This makes me think there's a buffer overrun issue or problem with the
offset somewhere in the sendfile(2) system call, but I need to do some
other testing first to verify that the code is indeed sane, and my
assumptions/code isn't buggy.
The `sbytes_negative` testcases that check `sbytes` being set to an
invalid value resulting in `EFAULT` fails today as the other change
(which checks `copyout(9)`) has not been committed [1]. Thus, it
should remain an expected failure (see bug 232210 for more details
on this item).
Next steps for testing sendfile(2):
1. Fix the header/trailer testcases so that they pass.
2. Setup if_tap interface and test with it, instead of using "localhost", per
@asomers's suggestion.
3. Handle short recv(2)'s in `server_cat(..)`.
4. Add `SF_SYNC` support.
5. Add some more negative tests outside the scope of the functional contract.
PR: 232210
r343365:
Unbreak the gcc build with sendfile_test after r343362
gcc 8.x is more pedantic than clang 7.x with format strings and the tests
passed `void*` variables while supplying `%s` (which is technically
incorrect).
Make the affected `void*` variables use `char*` storage instead to address
this issue, as the compiler will upcast the values to `char*`.
MFC with: r343362
r343367:
Unbreak the build on architectures where size_t isn't synonymous with uintmax_t
I should have used `%zu` instead of `%ju` with `size_t` types.
MFC with: r343362, r343365
Pointyhat to: ngie
r343368:
Fix up r343367
I should have only changed the format qualifier with the `size_t` value,
`length`, not the other [`off_t`] value, `dest_file_size`.
MFC with: r343362, r343365, r343367
r343461:
Fix reporting errors with `gai_strerror(..)`
The return value (`err`) should be checked; not the `errno` value.
ci.FreeBSD.org does not have access to a DNS resolver/network (unlike my test
VM), so in order for the test to pass on the host, it needs to avoid the DNS
lookup by using the numeric host address representation.
In short, the prior code was far too simplistic when it came to calling recv(2)
and failed intermittently (or in the case of Jenkins, deterministically).
Handle short recv(2)s by checking the return code and incrementing the window
into the buffer by the number of received bytes. If the number of received
bytes <= 0, then bail out of the loop, and test the total number of received
bytes vs the expected number of bytes sent for equality, and base whether or
not the test passes/fails on that fact.
Remove the expected failure, now that the hdtr testcases deterministically pass
on my host after this change [1].
PR: 234809 [1], 235200
Approved by: emaste (mentor, implicit: MFC to ^/stable/11 in r344977)
Differential Revision: https://reviews.freebsd.org/D19523
MFC r344870:
Fix the problem with O_LIMIT states introduced in r344018.
dyn_install_state() uses `rule` pointer when it creates state.
For O_LIMIT states this pointer actually is not struct ip_fw,
it is pointer to O_LIMIT_PARENT state, that keeps actual pointer
to ip_fw parent rule. Thus we need to cache rule id and number
before calling dyn_get_parent_state(), so we can use them later
when the `rule` pointer is overrided.
Kristof Provost [Sun, 10 Mar 2019 00:56:38 +0000 (00:56 +0000)]
pf: Small performance tweak
Because fetching a counter is a rather expansive function we should use
counter_u64_fetch() in pf_state_expires() only when necessary. A "rdr
pass" rule should not cause more effort than separate "rdr" and "pass"
rules. For rules with adaptive timeout values the call of
counter_u64_fetch() should be accepted, but otherwise not.
From the man page:
The adaptive timeout values can be defined both globally and for
each rule. When used on a per-rule basis, the values relate to the
number of states created by the rule, otherwise to the total number
of states.
This handling of adaptive timeouts is done in pf_state_expires(). The
calculation needs three values: start, end and states.
1. Normal rules "pass .." without adaptive setting meaning "start = 0"
runs in the else-section and therefore takes "start" and "end" from
the global default settings and sets "states" to pf_status.states
(= total number of states).
2. Special rules like
"pass .. keep state (adaptive.start 500 adaptive.end 1000)"
have start != 0, run in the if-section and take "start" and "end"
from the rule and set "states" to the number of states created by
their rule using counter_u64_fetch().
Thats all ok, but there is a third case without special handling in the
above code snippet:
3. All "rdr/nat pass .." statements use together the pf_default_rule.
Therefore we have "start != 0" in this case and we run the
if-section but we better should run the else-section in this case and
do not fetch the counter of the pf_default_rule but take the total
number of states.
Submitted by: Andreas Longwitz <longwitz@incore.de>
Kristof Provost [Sat, 9 Mar 2019 10:34:42 +0000 (10:34 +0000)]
MFC r344764
tests: Move common (vnet) test functions into a common file
The netipsec and pf tests have a number of common test functions. These
used to be duplicated, but it makes more sense for them to re-use the
common functions.
Kristof Provost [Sat, 9 Mar 2019 10:28:36 +0000 (10:28 +0000)]
MFC r340073, r341359:
pf: Keep a reference to struct ifnets we're using
Ensure that the struct ifnet we use can't go away until we're done with
it.
pf: Fix panic on overlapping interface names
In rare situations[*] it's possible for two different interfaces to have
the same name. This confuses pf, because kifs are indexed by name (which
is assumed to be unique). As a result we can end up trying to
if_rele(NULL), which panics.
Explicitly checking the ifp pointer before if_rele() prevents the panic.
Note pf will likely behave in unexpected ways on the the overlapping
interfaces.
[*] Insert an interface in a vnet jail. Rename it to an interface which
exists on the host. Remove the jail. There are now two interfaces with
the same name in the host.
John Baldwin [Fri, 8 Mar 2019 19:20:46 +0000 (19:20 +0000)]
MFC 344671: Don't assume all children of a nexus are ports.
Specifically, ccr(4) devices are also children of cxgbe nexus devices.
Rather than making assumptions about the child device's softc, walk
the list of ports from the nexus' softc to determine if a child is a
port in t4_child_location_str(). This fixes a panic when detaching a
ccr device.
John Baldwin [Fri, 8 Mar 2019 19:07:41 +0000 (19:07 +0000)]
MFC 343620: Don't set IFCAP_TXRTLMT during lagg_clone_create().
lagg_capabilities() will set the capability once interfaces supporting
the feature are added to the lagg. Setting it on a lagg without any
interfaces is pointless as the if_snd_tag_alloc call will always fail
in that case.
John Baldwin [Fri, 8 Mar 2019 19:03:28 +0000 (19:03 +0000)]
MFC 343456: Fix a few more places to handle ofld tx queues for RATELIMIT.
- Drain offload transmit queues when RATELIMIT is enabled but
TCP_OFFLOAD is not.
- Expose the per-VI nofldtxq and first_ofld_txq sysctls when
RATELIMIT is enabled but TCP_OFFLOAD is not.
- Clear offload transmit queue stats as part of a 'cxgbetool clearstats'
request when RATELIMIT is enabled but TCP_OFFLOAD is not.
John Baldwin [Fri, 8 Mar 2019 18:59:37 +0000 (18:59 +0000)]
MFC 343056: Reject new sessions if the necessary queues aren't initialized.
ccr reuses the control queue and first rx queue from the first port on
each adapter. The driver cannot send requests until those queues are
initialized. Refuse to create sessions for now if the queues aren't
ready. This is a workaround until cxgbe allocates one or more
dedicated queues for ccr.
John Baldwin [Fri, 8 Mar 2019 18:53:54 +0000 (18:53 +0000)]
MFC 343048:
Update the note about the need for COMPAT_FREEBSD<n> kernel options.
Rather than mentioning the requirement for 4.x binaries but not
explaining why (it was assuming an upgrade from 4.x to 5.0-current),
explain when compat options are needed (for running existing host
binaries) in a more general way while using a more modern example
(COMPAT_FREEBSD11 for 11.x binaries). While here, explicitly mention
that a GENERIC kernel should always work.
Enji Cooper [Fri, 8 Mar 2019 15:48:19 +0000 (15:48 +0000)]
MFC r343845:
Clean up all directories created by `make hier`
The logic I introduced in r322511 unfortunately left chflags schg'ed
directories behind created by `make hier` (in the stock /etc/mtree
files, this is limited to /var/empty).
The proposed change calls `chflags -R 0` and `rm -Rf ...` to clean all
of the directories that could not be removed by `${MAKE} clean`.
`${MAKE} clean` in bsd.obj.mk calls `cleandir`/`cleanobj`, which handles
the first directory tree walk/removal.
The send_packets() function was using ring->cur as index to scan
the transmit ring. This function may also set ring->cur ahead of
ring->head, in case no more slots are available. However, the function
also uses nm_ring_space() which looks at ring->head to check how many
slots are available. If ring->head and ring->cur are different, this
results in pkt-gen advancing ring->cur beyond ring->tail.
This patch fixes send_packets() (and similar source locations) to
use ring->head as a index, rather than using ring->cur.