To check if txsync can be skipped, it is necessary to look for
unseen TX space. However, this means comparing ring->cur
against ring->tail, rather than ring->head against ring->tail
(like nm_ring_empty() does).
This change also adds some more comments to explain the optimization
performed at the beginning of netmap_poll().
MFC after: 3 days
Sponsored by: Sunny Valley Networks
The bug was introduced by r339639, although it is present in the upstream
netmap code since 2015. It is due to resetting the want_rx variable to
POLLIN, rather than resetting it to POLLIN|POLLRDNORM.
It only affects select(), which uses POLLRDNORM. poll() is not affected,
because it uses POLLIN.
Also, it only affects FreeBSD, because Linux skips the optimization
implemented by the piece of code where the bug occurs.
MFC after: 3 days
Sponsored by: Sunny Valley Networks
Eugene Grosbein [Sat, 22 Dec 2018 11:38:54 +0000 (11:38 +0000)]
ifconfig.4, lagg.4: fix documentation bug: -use_flowid needs to be used
to force local hash computation and disable usage of RSS hash
provided by driver.
Bruce Evans [Sat, 22 Dec 2018 09:31:55 +0000 (09:31 +0000)]
Oops, rounddown() for the start was misspelled roundup() in r342295,
so only aligned starts worked. This broke releasing caches in most
cases where the i/o size is smaller than the fs block size.
Kyle Evans [Sat, 22 Dec 2018 06:08:06 +0000 (06:08 +0000)]
config(8): Remove all instances of an option when opting out
Quick follow-up to r342362: options can appear multiple times now, so
clean up all of them as needed. For non-OPTIONS options, this has no effect
since they're already de-duplicated.
Kyle Evans [Sat, 22 Dec 2018 06:02:34 +0000 (06:02 +0000)]
config(8): Allow duplicate options to be specified
config(8)'s option handling has been written to allow duplicate options; if
the value changes, then the latest value is used and an informative message
is printed to stderr like so:
/usr/src/sys/amd64/conf/TEST: option "VERBOSE_SYSINIT" redefined from 0 to 1
Currently, this is only a possibility for cpu types, MAXUSERS, and
MACHINE_ARCH. Anything else duplicated in a config file will use the first
value set and error about duplicated options on subsequent appearances,
which is arguably unfriendly since one could specify:
include GENERIC
nooptions VERBOSE_SYSINIT
options VERBOSE_SYSINIT
Warner Losh [Fri, 21 Dec 2018 23:22:37 +0000 (23:22 +0000)]
Try the first 256 units with nvmecontrol devlist.
The nvmecontrol code that did the devlist assumed that we had a
tightly-packed allocation of units. Since pci writing exists, this
isn't the case. Loop over the first 256 units, which is a reasonable
number of possible units.
Bruce Evans [Fri, 21 Dec 2018 21:17:45 +0000 (21:17 +0000)]
Fix clobbering of the fatchain cache for clustered i/o's when full
clustering is not done. The bug caused extreme slowness for large
files in some cases.
There is no way to tell VOP_BMAP() how many blocks are wanted, so for
all file systems it has to waste time in some cases by searching for
more contiguous blocks than will be accessed. For msdosfs, it also
clobbered the fatchain cache in these cases by advancing the cache to
point to the chain entry for block that won't be read. This makes
the cache useless for the next sequential i/o (or VOP_BMAP()), so the
fat chain is searched from the beginning. The cache only has 1 relevant
entry, so it is similarly useless for random i/o.
Fix this by only advancing the cache to point to the chain entry for
the first block that will be read. Clustering uses results from
VOP_BMAP(), so when more than 1 block is read by clustering, the cache
is not advanced as optimally as before, but it is at most 1 cluster
size behind and searching the chain through the blocks for this cluster
doesn't take too long.
Conrad Meyer [Fri, 21 Dec 2018 20:30:52 +0000 (20:30 +0000)]
mps(4), mpr(4): remove SATA ID command cancellation hack
Add a generic mechanism to override mp?_wait_command's timeout behavior,
which continues to invoke reinit by default. Invokers who set
cm_timeout_handler may avoid automatic reinit and do their own handling.
Adapt mp?sas_get_sata_identify to this mechanism and remove its callout
hack.
Conrad Meyer [Fri, 21 Dec 2018 20:29:16 +0000 (20:29 +0000)]
mps(4), mpr(4): Fix lifetime of command buffer for mp?sas_get_sata_identify
In the event that the ID command timed out, mps(4)/mpr(4) did not free the
command until it could be cancelled. However, it freed the associated
buffer (cm_data). Fix the lifetime issue by freeing the associated buffer
only after Abort Task or controller reset.
Bruce Evans [Fri, 21 Dec 2018 20:12:43 +0000 (20:12 +0000)]
Quick fix for initialization of mnt_iosize_max. (This limit controls
mainly clustering and read-ahead.) Copy the initialization from ffs,
and also copy a couple of lines of ffs's nearby style for initialization
order and whitespace.
A correct fix would de-duplicate the initialization and fix bitrot in it
instead of adding another instance of the duplication. Complications to
use the size preferred by the device have been reduced to hard-coding
slightly pessimal and/or inconsistent defaults, using large code that was
almost needed to support the complications.
For msdosfs, the result was that mnt_iosize_max was DFTLPHYS (64K) but is
now MAXPHYS (128K).
netmap: nmreplay: import various fixes from upstream (2704a51839906)
Changelist:
- General reformatting
- Fix packet duplication in cons(). Whenever cons() reached the
burst limit it would send all pending packets without advancing
head. This caused the last injected packet to be sent again in
the next round.
- Fix full-speed transmissions after first loop.
netmap: move buf_size validation code to its own function
This code validates the netmap buf_size against the interface MTU
and maximum descriptor size, to make sure the values are consistent.
Moving this functionality to its own function is needed because this
function is also called by Linux-specific code.
Bruce Evans [Fri, 21 Dec 2018 08:15:31 +0000 (08:15 +0000)]
Use VOP_ADVISE() with POSIX_FADV_DONTNEED instead of IO_DIRECT to
implement not double-caching for reads from vnode-backed md devices.
Use VOP_ADVISE() similarly instead of !IO_DIRECT unsimilarly for writes.
Add a "cache" option to mdconfig to allow changing the default of not
caching.
This depends on a recent commit to fix VOP_ADVISE(). A previous version
had optimizations for sequential i/o's (merge the i/o's and only uncache
for discontiguous i/o's and for full blocks), but optimizations and
knowledge of block boundaries belong in VOP_ADVISE(). Read-ahead should
also be handled better, by supporting it in md and discarding it in
VOP_ADVISE().
POSIX_FADV_DONTNEED is ignored by zfs, but so is IO_DIRECT.
POSIX_FADV_DONTNEED works better than IO_DIRECT if it is not ignored,
since it only discards from the buffer cache immediately, while
IO_DIRECT also discards from the page cache immediately.
IO_DIRECT was not used for writes since it was claimed to be too slow,
but most of the slowness for writes is from doing them synchronously by
default. Non-synchronous writes still deadlock in many cases.
IO_DIRECT only has a special implementation for ffs reads with DIRECTIO
configured. Otherwise, if it is not ignored than it uses the buffer and
page caches normally except for discarding everything after each i/o,
and then it has much the same overheads as POSIX_FADV_DONTNEED. The
overheads for reading with ffs and DIRECTIO were similar in tests of md.
Bruce Evans [Fri, 21 Dec 2018 04:57:59 +0000 (04:57 +0000)]
Fix rounding in vop_stdadvise() for POSIX_FADV_NOREUSE (really
POSIX_FADV_DONTNEED). The most broken case was for applications that
advise for the whole file and then do block-aligned i/o's 1 block at
a time. Then advice is sent to VOP_ADVISE() 1 block at a time, but
in vop_stdadvise() the 1-block advice was turned into 0-block advice
for the buffer cache part.
The bugs were caused partly by callers representing the region as
(a_start, a_end), where a_end is actually the maximum, and everything
else representing the region as (start, end) where 'end' is actually
the end (1 after the maximum). The maximum a_end must be rounded up,
but was rounded down. Also, rounding to page boundaries was inconsistent.
The bugs and fixes have no effect for zfs and other file systems that
don't use the buffer cache or the page cache. Most or all file systems
currently use the default VOP_FADVISE(), but it finds a null buffer cache
and a null page cache for file systems that don't use normal methods.
Kirk McKusick [Fri, 21 Dec 2018 01:09:25 +0000 (01:09 +0000)]
Some filesystems (like cd9660 and ext3) require that VFS_STATFS()
be called before VFS_ROOT() is called. Move the call for VFS_STATFS()
so that it is done after VFS_MOUNT(), but before VFS_ROOT().
This change actually improves the robustness of the mount system
call because it returns an error rather than failing silently
when VFS_STATFS() returns failure.
Rick Macklem [Thu, 20 Dec 2018 22:21:41 +0000 (22:21 +0000)]
Fix the NFSv4 server to obey vfs.nfsd.nfs_privport.
When the NFSv4 server was coded, I believed that the specification authors
did not want NFSv4 servers to require a client to use a reserved port#.
However, recently it has been noted that the Linux NFSv4 server does support
a check for a reserved port#.
Since both the FreeBSD and Linux NFSv4 clients use a reserved port# by
default, enabling vfs.nfsd.nfs_privport to require a reserved port# for
NFSv4 the same as it does for NFSv2, 3 seems reasonable.
The only case where this could cause a POLA violation is a FreeBSD NFSv4
server with vfs.nfsd.nfs_privport set, but with NFSv4 clients doing mounts
without using a reserved port# (< 1024).
Conrad Meyer [Thu, 20 Dec 2018 20:55:33 +0000 (20:55 +0000)]
tpm(4): Fix GCC build after r342084 (TPM 2.0 driver commit)
Move static variable definition (cdevsw) to a more conventional location
(the C file it is used in), rather than a header.
This fixes the GCC warning, -Wunused-variable ("defined but not used") when
the tpm20.h header is included in files other than tpm20.c (e.g.,
tpm_tis.c).
Rebecca Cran [Thu, 20 Dec 2018 19:39:37 +0000 (19:39 +0000)]
Rework UEFI ESP generation
Currently, the installer uses pre-created 800KB FAT12 filesystems that
it dd's onto the ESP partition.
This changeset improves that by having the installer generate a FAT32
filesystem directly onto the ESP using newfs_msdos and then copying
loader.efi into /EFI/freebsd.
For live installs it then runs efibootmgr to add a FreeBSD boot entry
in the BIOS.
Rebecca Cran [Thu, 20 Dec 2018 19:27:46 +0000 (19:27 +0000)]
Wait a maximum of 300 seconds for network send/recv in libsa
The reason for this change is that currently, a send/recv
takes many hours to time out.
This is suboptimal in the bootloader because it means for example
that NFS will take hours to fail before allowing subsequent access
methods such as gzip to be tried.
Setting MAXWAIT to 300 seconds (5 minutes) still allows slow
connections of 1Mb to be used to download a 30MB kernel file.
Michael Tuexen [Thu, 20 Dec 2018 16:05:30 +0000 (16:05 +0000)]
Fix a regression in the TCP handling of received segments.
When receiving TCP segments the stack protects itself by limiting
the resources allocated for a TCP connections. This patch adds
an exception to these limitations for the TCP segement which is the next
expected in-sequence segment. Without this patch, TCP connections
may stall and finally fail in some cases of packet loss.
Reported by: jhb@
Reviewed by: jtl@, rrs@
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D18580
Marcin Wojtas [Thu, 20 Dec 2018 01:05:09 +0000 (01:05 +0000)]
Fix obtaining RSP address in TPM CRB for non-amd64 platforms
On amd64 the RSP address can be read in single 8-byte transaction,
which is obviously not possible on 32-bit platforms. Fix that
by performing 2 4-byte read on them.
Warner Losh [Wed, 19 Dec 2018 23:15:49 +0000 (23:15 +0000)]
32-bit mips SMP is unsupported
Per discussions on mips@, 32-bit mips SMP is now unsupported. The
files in the tree will compile for a while longer, but when the
atomic_swap_64 or similar atomic enters into the MI part of the tree,
as currently foreseen sometime next year, these ports will start to no
longer link. The JZ4780 is the only such system we have.
The UP version of this chip is unaffected by this, and will remain
supported.
Warner Losh [Wed, 19 Dec 2018 22:56:31 +0000 (22:56 +0000)]
Fix the date
The first part of the mips pruning has been commited. This part
is uncontested. Fix the date in the UPDATING file to reflect when
I made the commit. The contested parts will be committed (or not)
once those discussions complete.
Warner Losh [Wed, 19 Dec 2018 22:54:34 +0000 (22:54 +0000)]
Remove old config file for SENTRY5
This is an older broadcom part that implements the mips32 ISA. 32-bit
FreeBSD/mips now requires mips32r2, so retire this config. Most of the
broadcom port is shared with newer ports, so what little code may be
unique to this part has not been GC'd at this time.
Discussed on: freebsd-mips@
Differential Revision: https://reviews.freebsd.org/D18543
Warner Losh [Wed, 19 Dec 2018 22:54:23 +0000 (22:54 +0000)]
Remove the GXEMUL support.
gxemul was a nice stop-gap while qemu support for mips was firmed
up. Now MALTA* + qemu is the platform of choice retire gxemul support.
It's unknown when this was last confirmed working.
Discussed on: freebsd-mips@
Differential Revision: https://reviews.freebsd.org/D18543
Warner Losh [Wed, 19 Dec 2018 22:54:03 +0000 (22:54 +0000)]
Remove support for the now very old SiByte MIPS platform. It's not
relevant and is unused. It's also getting in the way of progress in
some admittedly minor ways. Better to retire it to reduce the burden
on the project.
Discussed on: freebsd-mips@
Differential Revision: https://reviews.freebsd.org/D18543
Marcin Wojtas [Wed, 19 Dec 2018 22:47:37 +0000 (22:47 +0000)]
Fix alignment issue in uefisign
The pe_certificate structure has to be aligned to 8 bytes. [1]
Since this is now checked in edk2, any binaries signed with
older version of this tool will fail verification.
Mateusz Guzik [Wed, 19 Dec 2018 22:30:26 +0000 (22:30 +0000)]
mac: reduce pessimization of sdt probe handling
Prior to the change the code would branch on return value and then check
if probes are enabled. Since vast majority of the time they are not, this
is clearly wasteful. Check probes first.
Ed Maste [Wed, 19 Dec 2018 18:16:29 +0000 (18:16 +0000)]
bootpd: validate hardware type
Due to insufficient validation of network-provided data it may have been
possible for a malicious actor to craft a bootp packet which could cause
a stack buffer overflow.
admbugs: 850
Reported by: Reno Robert
Reviewed by: markj
Approved by: so
Security: FreeBSD-SA-18:15.bootpd
Sponsored by: The FreeBSD Foundation
Mark Johnston [Wed, 19 Dec 2018 04:54:32 +0000 (04:54 +0000)]
Remove a use of a negative array index from fxp(4).
This fixes a warning seen when compiling amd64 GENERIC with clang 7.
Also remove the workaround added in r337324. clang 7 and gcc 4.2
generate the same code with or without the code change.
net80211: fix out-of-bounds read in ieee80211_amrr(9).
ieee80211_alloc_node() does not initialize rateset tables; that's not
expected by rate control modules and will result in array access at
index -1 - where ni_essid[] array is located (zeroed at allocation, so
there are no user-visible consequences).
Just delay rate control initialization to the moment, when rateset
tables are initiaziled; nothing will use rates here anyway.
Navdeep Parhar [Wed, 19 Dec 2018 01:37:00 +0000 (01:37 +0000)]
cxgbe/t4_tom: fixes for issues on the passive open side.
- Fix PR 227760 by getting the TOE to respond to the SYN after the call
to toe_syncache_add, not during it. The kernel syncache code calls
syncache_respond just before syncache_insert. If the ACK to the
syncache_respond is processed in another thread it may run before the
syncache_insert and won't find the entry. Note that this affects only
t4_tom because it's the only driver trying to insert and expand
syncache entries from different threads.
- Do not leak resources if an embryonic connection terminates at
SYN_RCVD because of L2 lookup failures.
- Retire lctx->synq and associated code because there is never a need to
walk the list of embryonic connections associated with a listener.
The per-tid state is still called a synq entry in the driver even
though the synq itself is now gone.
Andriy Gapon [Tue, 18 Dec 2018 17:17:53 +0000 (17:17 +0000)]
ichwd: add a few assertions about tco_version
Those should ensure correctness of ichwd_find_ich_lpc_bridge() and
ichwd_find_ich_lpc_bridge() as well as make it easier for both humans
and static analyzers to see the relation between tco_version and ich and
smb variables in ichwd_identify().
Reported by: Coverity
CID: 1396314, 1396317
MFC after: 10 days
Mark Johnston [Mon, 17 Dec 2018 21:48:20 +0000 (21:48 +0000)]
Remove UMS support code from radeonkms.
The code is unreachable since the entries of radeon_ioctls[] are not
associated with any device: we provide only the KMS entry points.
Moreover, r600_cp_dispatch_texture() contains an integer overflow bug
that can be triggered from userspace.[1]
Reported by: Anonymous of the Shellphish Grill Team[1]
Reviewed by: dumbbell
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18516
Mark Johnston [Mon, 17 Dec 2018 21:13:05 +0000 (21:13 +0000)]
Revert r336326.
In testing on a Dell Latitude 7480, having ig4.ko loaded during a
suspend caused the system to hang. It turns out that ig4iic_intr() was
being called after the device entered D3, and entered an infinite loop
because a read of the I2C status register returned all ones, causing us
to attempt to read a byte from the data buffer until one of the status
bits clears. This occured because ig4iic_pci0 shares an interrupt with
the VGA device on this laptop, so ig4iic_intr() gets called even when
there is no work to do. This is exactly the problem fixed by r342170,
which resolves the hang for me and allows suspend/resume to work with
ig4.ko loaded. So, re-enable autoloading of ig4.ko in the hope that
r342170 resolves the problem universally.
Reviewed by: gonzo
MFC after: 1 month (pending an MFC of r342170)
Differential Revision: https://reviews.freebsd.org/D18587
Alan Somers [Mon, 17 Dec 2018 18:11:06 +0000 (18:11 +0000)]
audit(4) tests: require /etc/rc.d/auditd
These tests should be skipped if /etc/rc.d/auditd is missing, which could be
the case if world was built with WITHOUT_AUDIT set. Also, one test case
requires /etc/rc.d/accounting.
Andriy Gapon [Mon, 17 Dec 2018 17:11:00 +0000 (17:11 +0000)]
add support for marking interrupt handlers as suspended
The goal of this change is to fix a problem with PCI shared interrupts
during suspend and resume.
I have observed a couple of variations of the following scenario.
Devices A and B are on the same PCI bus and share the same interrupt.
Device A's driver is suspended first and the device is powered down.
Device B generates an interrupt. Interrupt handlers of both drivers are
called. Device A's interrupt handler accesses registers of the powered
down device and gets back bogus values (I assume all 0xff). That data is
interpreted as interrupt status bits, etc. So, the interrupt handler
gets confused and may produce some noise or enter an infinite loop, etc.
This change affects only PCI devices. The pci(4) bus driver marks a
child's interrupt handler as suspended after the child's suspend method
is called and before the device is powered down. This is done only for
traditional PCI interrupts, because only they can be shared.
At the moment the change is only for x86.
Notable changes in core subsystems / interfaces:
- BUS_SUSPEND_INTR and BUS_RESUME_INTR methods are added to bus
interface along with convenience functions bus_suspend_intr and
bus_resume_intr;
- rman_set_irq_cookie and rman_get_irq_cookie functions are added to
provide a way to associate an interrupt resource with an interrupt
cookie;
- intr_event_suspend_handler and intr_event_resume_handler functions
are added to the MI interrupt handler interface.
I added two new interrupt handler flags, IH_SUSP and IH_CHANGED, to
implement the new intr_event functions. IH_SUSP marks a suspended
interrupt handler. IH_CHANGED is used to implement a barrier that
ensures that a change to the interrupt handler's state is visible
to future interrupts.
While there, I fixed some whitespace issues in comments and changed a
couple of logically boolean variables to be bool.
Andriy Gapon [Mon, 17 Dec 2018 16:01:37 +0000 (16:01 +0000)]
add a knob that disables detection of write protected disks
It has been reported that on some systems (with real hardware passed
through to a virtual machine) the WP detection causes USB disk probing
failures.
While here, also fix the selection of the next state in the case
of malloc failure in DA_STATE_PROBE_WP. It was DA_STATE_PROBE_RC
unconditionally even when it should have been DA_STATE_PROBE_RC16.
Maxim Sobolev [Mon, 17 Dec 2018 16:00:35 +0000 (16:00 +0000)]
Allow ng_nat to be attached to a ethernet interface directly via ng_ether(4)
or the likes. Add new control message types: setdlt and getdlt to switch
from default DLT_RAW (no encapsulation) to DLT_EN10MB (ethernet).
Toomas Soome [Mon, 17 Dec 2018 07:43:29 +0000 (07:43 +0000)]
loader: zfs reader should not probe partitionless disks (UEFI case)
With r342151 I did fix the BIOS version of zfs_probe_dev() from accessing
the whole disk, but the fix was not complete - we actually did not check
if the device name was really for whole disk. Since UEFI version
is only calling the zfs_probe_dev() with partitions and not with whole
disk, the UEFI loader was not able to find the zfs pools.
This update does correct the issue by calling archsw.arch_getdev() to
translate the device name back to dev_desc, and we have whole disk when both
partition and slice values are -1.
Toomas Soome [Sun, 16 Dec 2018 08:58:14 +0000 (08:58 +0000)]
loader: zfs reader should not probe partitionless disks
First of all, normal setups can not boot such pools as the tools
do not support installing boot programs.
Secondly, for proper pool configuration detection, we need to checks all
four label copies on disk, 2 from front and 2 from the end of the disk,
but zfs label does not contain the size of the disk - so we depend on
firmware to report the correct disk size or use information from the
partition table.
Without partition table, we only can rely on firmware to report and support
disk IO properly.
There is a specific case: 8TB disks are reported by BIOS to have 4294967295
sectors (0x00000000ffffffff), the sectors reported by OS is 15628053168
(0x00000003a3812ab0), so the reported size is less than actual but is hitting
32-bit max. Unfortuantely the real limit must be even lower because probing
this disk in this system will wnd up with hung system.
UEFI boot of this system seems not to be affected.