Andriy Gapon [Thu, 31 Oct 2019 09:14:50 +0000 (09:14 +0000)]
MFC r353176,r353304,r353556,r353559: large_dnode improvements and fixes
r353176: MFV r350898, r351075: 8423 8199 7432 Implement large_dnode pool feature
This updates FreeBSD large_dnode code (that was imported from ZoL) to a
version that was committed to illumos. It has some cleanups,
improvements and fixes comparing to what we have in FreeBSD now.
I think that the most significant update is 8199 multi-threaded
dmu_object_alloc().
r353304: zfs: use atomic_load_64 to read atomic variable in dmu_object_alloc_impl
r353556: MFV r353551: 10452 ZoL: merge in large dnode feature fixes
r353559: MFV r353558: 10572 10579 Fix race in dnode_check_slots_free()
Alan Somers [Wed, 30 Oct 2019 02:33:43 +0000 (02:33 +0000)]
MFC r353439:
MFZol: Fix performance of "zfs recv" with many deletions
This patch fixes 2 issues with the DMU free throttle implemented
in dmu_free_long_range(). The first issue is that get_next_chunk()
was calculating the number of L1 blocks the free would dirty
incorrectly. In some cases involving extremely large files, this
code would greatly overestimate the number of affected L1 blocks,
causing excessive calls to txg_wait_open(). This patch corrects
the calculation.
The second issue is that the free throttle uses the total number
of free'd blocks in all (open, quiescing, and syncing) txgs to
determine whether to throttle. This causes large frees (such as
those created by the first issue) to cause 4 txg syncs before
any further frees were allowed to proceed. This patch ensures
that the accounting is done entirely in a per-txg fashion, so
that frees from a given txg don't affect those that immediately
follow it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com>
zfsonlinux/zfs@f4c594da94d856c422512a54e48070f890b2685b
Freeing throttle should account for holes
Deletion throttle currently does not account for holes in a file.
This means that it can activate when it shouldn't.
To fix it we switch the throttle to be based on the number of
L1 blocks we will have to dirty when freeing
Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
zfsonlinux/zfs@65282ee9e06b130f1f0169baf5d9bf0dd8fc1ef9
r353117:
ZFS: the hotspare_add_004_neg test needs at least two disks
Sponsored by: Axcient
r353118:
ZFS: fix several of the "zpool create" tests
* Remove zpool_create_013_neg. FreeBSD doesn't have an equivalent of
Solaris's metadevices. GEOM would be the equivalent, but since all geoms
are the same from ZFS's perspective, this test would be redundant with
zpool_create_012_neg
* Remove zpool_create_014_neg. FreeBSD does not support swapping to regular
files.
* Remove zpool_create_016_pos. This test is redundant with literally every
other test that creates a disk-backed pool.
* s:/etc/vfstab:/etc/fstab in zpool_create_011_neg
* Delete the VTOC-related portion of zpool_create_008_pos. FreeBSD doesn't
use VTOC.
* Replace dumpadm with dumpon and swap with swapon in multiple tests.
* In zpool_create_015_neg, don't require "zpool create -n" to fail. It's
reasonable for that variant to succeed, because it doesn't actually open
the zvol.
* Greatly simplify zpool_create_012_neg. Make it safer, too, but not
interfering with the system's regular swap devices.
* Expect zpool_create_011_neg to fail (PR 241070)
* Delete some redundant cleanup steps in various tests
* Remove some unneeeded ATF timeout specifications. The default is fine.
PR: 241070
Sponsored by: Axcient
r353281:
ZFS: fix several zvol_misc tests
* Adapt zvol_misc_001_neg to use dumpon instead of Solaris's dumpadm
* Disable zvol_misc_003_neg, zvol_misc_005_neg, and zvol_misc_006_pos,
because they involve using a zvol as a dump device, which FreeBSD does not
yet support.
Sponsored by: Axcient
r353282:
zfs: fix the slog_012_neg test
This test attempts to corrupt a file-backed vdev by deleting it and then
recreating it with truncate. But that doesn't work, because the pool
already has the vdev open, and it happily hangs on to the open-but-deleted
file. Fix by truncating the file without deleting it.
Sponsored by: Axcient
r353284:
ZFS: fix the zpool_get_002_pos test
ZFS has grown some additional properties that hadn't been added to the
config file yet. While I'm here, improve the error message, and remove a
superfluous command.
Sponsored by: Axcient
r353285:
zfs: fix the zdb_001_neg test
The test needed to be updated for r331701 (MFV illumos 8671400), which added
a "-k" option.
Sponsored by: Axcient
r353286:
zfs: skip the zfsd tests if zfsd is not running
* Replace runwattr with sudo
* Fix a scoping bug with the "dtst" variable
* Cleanup user properties created during tests
* Eliminate the checks for refreservation and send support. They will always
be supported.
* Fix verify_fs_snapshot. It seemed to assume that permissions would not yet
be delegated, but that's not how it's actually used.
* Combine verify_fs_promote with verify_vol_promote
* Remove some useless sleeps
* Fix backwards condition in verify_vol_volsize
* Remove some redundant cleanup steps in the tests. cleanup.ksh will handle
everything.
* Disable some parts of the tests that FreeBSD doesn't support:
* Creating snapshots with mkdir
* devices
* shareisci
* sharenfs
* xattr
* zoned
The sharenfs parts could probably be reenabled with more work to remove the
Solarisms.
r353288:
ZFS: mark hotspare_scrub_002_pos as an expected failure
"zpool scrub" doesn't detect all errors on active spares in raidz arrays
PR: 241069
Sponsored by: Axcient
r353289:
ZFS: fix the redundancy tests
* Fix force_sync_path, which ensures that a file is fully flushed to disk.
Apparently "zpool history"'s performance has improved, but exporting and
importing the pool still works.
* Fix file_dva by using undocumented zdb syntax to clarify that we're
interested in the pool's root file system, not the pool itself. This
should also fix the zpool_clear_001_pos test.
* Remove a redundant cleanup step
r353309:
zfs: fix the zfsd_autoreplace_003_pos test
The test declared that it only needed 5 disks, but actually tried to use 6.
Fix it to use just 5, which is all it really needs.
Sponsored by: Axcient
r353310:
zfs: fix the zfsd_hotspare_007_pos test
It was trying to destroy the pool while zfsd was detaching the spare, and
"zpool destroy" failed. Fix by waiting until the spare has fully detached.
Sponsored by: Axcient
r353360:
ZFS: multiple fixes to the zpool_import tests
* Don't create a UFS mountpoint just to store some temporary files. The
tests should always be executed with a sufficiently large TMPDIR.
Creating the UFS mountpoint is not only unneccessary, but it slowed
zpool_import_missing_002_pos greatly, because that test moves large files
between TMPDIR and the UFS mountpoint. This change also allows many of
the tests to be executed with just a single test disk, instead of two.
* Move zpool_import_missing_002_pos's backup device dir from / to $PWD to
prevent cross-device moves. On my system, these two changes improved that
test's speed by 39x. It should also prevent ENOSPC errors seen in CI.
* If insufficient disks are available, don't try to partition one of them.
Just rely on Kyua to skip the test. Users who care will configure Kyua
with sufficient disks.
Sponsored by: Axcient
r353361:
ZFS: in the tests, don't override PWD
The ZFS test suite was overriding the common $PWD variable with the path to
the pwd command, even though no test wanted to use it that way. Most tests
didn't notice, because ksh93 eventually restored it to its proper meaning.
Sponsored by: Axcient
r353366:
ZFS: fix the zpool_add_010_pos test
The test is necessarily racy, because it depends on being able to complete a
"zpool add" before a previous resilver finishes. But it was racier than it
needed to be. Move the first "zpool add" to before the resilver starts.
Sponsored by: Axcient
r353379:
zfs: multiple improvements to the zpool_add tests
* Don't partition a disk if too few are available. Just rely on Kyua to
ensure that the tests aren't run with insufficient disks.
* Remove redundant cleanup steps
* In zpool_add_003_pos, store the temporary file in $PWD so Kyua will
automatically clean it up.
* Update zpool_add_005_pos to use dumpon instead of dumpadm. This test had
never been ported to FreeBSD.
* In zpool_add_005_pos, don't format the dump disk with UFS. That was
pointless.
Rick Macklem [Wed, 30 Oct 2019 01:57:40 +0000 (01:57 +0000)]
MFC: r352736
Replace all mtx_assert() calls for n_mtx and ncl_iod_mutex with macros.
To be consistent with replacing the mtx_lock()/mtx_unlock() calls on
the NFS node mutex (n_mtx) and ncl_iod_mutex, this patch replaces
all mtx_assert() calls on these mutexes with macros as well.
This will simplify changing these locks to sx locks in a future commit.
However, this change may be delayed indefinitely, since it appears there
is a deadlock when vnode_pager_setsize() is called to shrink the size
and the NFS node lock is held.
There is no semantic change as a result of this commit.
Alan Somers [Wed, 30 Oct 2019 01:35:00 +0000 (01:35 +0000)]
MFC r352404, r352413-r352414
r352404:
fusefs: fix some minor issues with fuse_vnode_setparent
* When unparenting a vnode, actually clear the flag. AFAIK this is basically
a no-op because we only unparent a vnode when reclaiming it or when
unlinking.
* There's no need to call fuse_vnode_setparent during reclaim, because we're
about to free the vnode data anyway.
Reviewed by: emaste
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21630
r352413:
fusefs: fix some minor Coverity CIDs in the tests
Where open(2) is expected to fail, the tests should assert or expect that
its return value is -1. These tests all accepted too much but happened to
pass anyway.
Rick Macklem [Mon, 28 Oct 2019 22:54:36 +0000 (22:54 +0000)]
MFC: r352664
Replace all mtx_lock()/mtx_unlock() on the iod lock with macros.
Since the NFS node mutex needs to change to an sx lock so it can be held when
vnode_pager_setsize() is called and the iod lock is held when the NFS node lock
is acquired, the iod mutex will need to be changed to an sx lock as well.
To simply the future commit that changes both the NFS node lock and iod lock
to sx locks, this commit replaces all mtx_lock()/mtx_unlock() calls on the
iod lock with macros.
There is no semantic change as a result of this commit.
Rick Macklem [Mon, 28 Oct 2019 01:44:31 +0000 (01:44 +0000)]
MFC: r352636
Replace all mtx_lock()/mtx_unlock() on n_mtx with the macros.
For a long time, some places in the NFS code have locked/unlocked the
NFS node lock with the macros NFSLOCKNODE()/NFSUNLOCKNODE() whereas
others have simply used mtx_lock()/mtx_unlock().
Since the NFS node mutex needs to change to an sx lock so it can be held when
vnode_pager_setsize() is called, replace all occurrences of mtx_lock/mtx_unlock
with the macros to simply making the change to an sx lock in future commit.
There is no semantic change as a result of this commit.
John Baldwin [Fri, 25 Oct 2019 22:17:24 +0000 (22:17 +0000)]
MFC 353023: Fix the EMBEDFS_FORMAT helper variable for riscv64.
It was defined with the wrong MACHINE_ARCH previously. This permits
using an MFS image without defining MD_ROOT_SIZE which has various
benefits (one being that the build is able to treat the MFS image as
a dependency and properly re-link the kernel with the new image when
building with NO_CLEAN).
John Baldwin [Fri, 25 Oct 2019 22:15:20 +0000 (22:15 +0000)]
MFC 353059: Restore description of packets dropped due to full reassembly queue.
r265408 renamed tcps_rcvmemdrop to tcps_rcvreassfull and gave it a more
specific description. r279122 (libxo-ification) reverted that change.
This commit brings it back, but with a small tweak to the description.
John Baldwin [Fri, 25 Oct 2019 22:04:05 +0000 (22:04 +0000)]
MFC 353585,353586: Support hot insertion and removal of PCI devices on EC2.
353585:
Export pci_attach() and pci_detach().
353586:
Support hot insertion and removal of PCI devices on EC2.
Install ACPI notify handlers on PCI devices with an _EJ0 method. This
handler is invoked when devices are added or removed.
- When an ACPI_NOTIFY_DEVICE_CHECK event posts, rescan the parent bus
device. Note that strictly speaking we only need to rescan the
specified device, but BUS_RESCAN is what is available, so we rescan
the entire bus.
- When an ACPI_NOTIFY_EJECT_REQUEST event posts, detach the device
associated with the ACPI handle, invoke the _EJ0 method, and then
delete the device.
Eventually this might be changed to vector notify events to devd in
userspace where devctl can be used instead to permit more complex
actions such as graceful unmounting of filesystems.
John Baldwin [Fri, 25 Oct 2019 21:23:44 +0000 (21:23 +0000)]
MFC 353371: Don't free the cursor boundary tag during vmem_destroy().
The cursor boundary tag is statically allocated in the vmem instead of
from the vmem_bt_zone. Explicitly remove it from the vmem's segment
list in vmem_destroy before freeing all the segments from the vmem.
John Baldwin [Fri, 25 Oct 2019 21:14:43 +0000 (21:14 +0000)]
MFC 353323: Set the FID field in lookaside crypto requests to the rx queue ID.
The PCI block in the adapter requires this field to be set to a valid
queue ID. It is not clear why it did not fail on all machines, but
the effect was that crypto operations reading input data via DMA
failed with an internal PCI read error on machines with 128G or more
of RAM.
Many extern struct pcpu <something>__pcpu declarations were
copied/pasted in sources. The issue is that the definition is MD, but
it cannot be provided by machine/pcpu.h due to actual struct pcpu
defined in sys/pcpu.h later than the inclusion of machine/pcpu.h.
This forced the copying when other code needed direct access to
__pcpu. There is no way around it, due to machine/pcpu.h supplying
part of struct pcpu fields.
To work around the problem, add a new machine/pcpu_aux.h header, which
should fill any needed MD definitions after struct pcpu definition is
completed. This allows to remove copies of __pcpu spread around the
source. Also on x86 it makes it possible to remove work arounds like
OFFSETOF_CURTHREAD or clang specific warnings supressions.
Michael Tuexen [Fri, 25 Oct 2019 18:17:56 +0000 (18:17 +0000)]
MFC r354044:
Ensure that the flags indicating IPv4/IPv6 are not changed by failing
bind() calls. This would lead to inconsistent state resulting in a panic.
A fix for stable/11 was committed in
https://svnweb.freebsd.org/base?view=revision&revision=338986
Reported by: syzbot+2609a378d89264ff5a42@syzkaller.appspotmail.com
Obtained from: jtl@
Sponsored by: Netflix, Inc.
Alexander Motin [Fri, 25 Oct 2019 15:02:50 +0000 (15:02 +0000)]
MFC r353454: Allocate device softc from the device domain.
Since we are trying to bind device interrupt threads to the device domain,
it should have sense to make memory often accessed by them local. If domain
is not known, fall back to round-robin.
Alexander Motin [Fri, 25 Oct 2019 14:55:37 +0000 (14:55 +0000)]
MFC r352630: Make nvme(4) driver some more NUMA aware.
- For each queue pair precalculate CPU and domain it is bound to.
If queue pairs are not per-CPU, then use the domain of the device.
- Allocate most of queue pair memory from the domain it is bound to.
- Bind callouts to the same CPUs as queue pair to avoid migrations.
- Do not assign queue pairs to each SMT thread. It just wasted
resources and increased lock congestions.
- Remove fixed multiplier of CPUs per queue pair, spread them even.
This allows to use more queue pairs in some hardware configurations.
- If queue pair serves multiple CPUs, bind different NVMe devices to
different CPUs.
Alexander Motin [Fri, 25 Oct 2019 14:51:21 +0000 (14:51 +0000)]
MFC r342771 (by cem): Expose threads-per-core and physical core count information
With new sysctls (to the best of our ability do detect them). Restructured
smp.4 slightly for clarity (keep relevant stuff closer to the top) while
documenting.
Note: A little more than just the tun/tap merge has been MFC'd to ease
auditing correctness/differences, as some later commits were cherry-picked
back to tun+tap.
r347241:
tun/tap: merge and rename to `tuntap`
tun(4) and tap(4) share the same general management interface and have a lot
in common. Bugs exist in tap(4) that have been fixed in tun(4), and
vice-versa. Let's reduce the maintenance requirements by merging them
together and using flags to differentiate between the three interface types
(tun, tap, vmnet).
This fixes a couple of tap(4)/vmnet(4) issues right out of the gate:
- tap devices may no longer be destroyed while they're open [0]
- VIMAGE issues already addressed in tun by kp
[0] emaste had removed an easy-panic-button in r240938 due to devdrn
blocking. A naive glance over this leads me to believe that this isn't quite
complete -- destroy_devl will only block while executing d_* functions, but
doesn't block the device from being destroyed while a process has it open.
The latter is the intent of the condvar in tun, so this is "fixed" (for
certain definitions of the word -- it wasn't really broken in tap, it just
wasn't quite ideal).
ifconfig(8) also grew the ability to map an interface name to a kld, so
that `ifconfig {tun,tap}0` can continue to autoload the correct module, and
`ifconfig vmnet0 create` will now autoload the correct module. This is a
low overhead addition.
r347394:
tuntap: Properly detach tap ifp
r347404:
tuntap: Don't down tap interfaces if LINK0 is set
r347483:
tuntap: Improve style
No functional change.
tun_flags of the tuntap_driver was renamed to ident_flags to reflect the
fact that it's a subset of the tun_flags that identifies a tuntap device.
This maps more easily (visually) to the TUN_DRIVER_IDENT_MASK that masks off
the bits of tun_flags that are applicable to tuntap driver ident. This is a
purely cosmetic change.
r351220:
if_tuntap: minor improvements
Rewrite a loop to avoid duplicating the exit condition.
Simplify mask processing in tunpoll().
Fix minor typos.
r351229:
tuntap: belatedly add MODULE_VERSION for if_tun and if_tap
When tun/tap were merged, appropriate MODULE_VERSION should have been added
for things like modfind(2) to continue to do the right thing with the old
names.
r352148:
Remove empty tap/tun modules directories after r347241
r353056:
if_tuntap: add a busy/unbusy mechanism, replace destroy OPEN check
A future commit will create device aliases when a tuntap device is renamed
so that it's still easily found in /dev after the rename. Said mechanism
will want to keep the tun alive long enough to either realize that it's
about to go away or complete the alias creation, even if the alias is about
to get destroyed.
While we're introducing it, using it to prevent open devices from going away
makes plenty of sense and keeps the logic on waking up tun_destroy clean, so
we don't have multiple places trying to cv_broadcast unless it's still in
use elsewhere.
r353057:
if_tuntap: create /dev aliases when a tuntap device gets renamed
Currently, if you do:
$ ifconfig tun0 create
$ ifconfig tun0 name wg0
$ ls -l /dev | egrep 'wg|tun'
You will see tun0, but no wg0. In fact, it's slightly more annoying to make
the association between the new name and the old name in order to open the
device (if it hadn't been opened during the rename).
Register an eventhandler for ifnet_arrival_events and catch interface
renames. We can determine if the ifnet is a tun easily enough from the
if_dname, which matches the cevsw.d_name from the associated tuntap_driver.
Some locking dance is required because renames don't require the device to
be opened, so it could go away in the middle of handling the ioctl, but as
soon as we've verified this isn't the case we can attempt to busy the tun
and either bail out if the tun device is dying, or we can proceed with the
rename.
We only create these aliases on a best-effort basis. Renaming a tun device
to "usbctl", which doesn't exist as an ifnet but does as a /dev, is clearly
not that disastrous, but we can't and won't create a /dev for that.
r353781:
tuntap(4): Drop TUN_IASET
This flag appears to have been effectively unused since introduction to
if_tun(4) -- drop it now.
r353782:
tuntap(4): break out after setting TUN_DSTADDR
This is now the only flag we set in this loop, terminate early.
r353785:
tuntap(4): Use make_dev_s to avoid si_drv1 race
This allows us to avoid some dance in tunopen for dealing with the
possibility of dev->si_drv1 being NULL as it's set prior to the devfs node
being created in all cases.
There's still the possibility that the tun device hasn't been fully
initialized, since that's done after the devfs node was created. Alleviate
this by returning ENXIO if we're not to that point of tuncreate yet.
This work is what sparked r353128, full initialization of cloned devices
w/ specified make_dev_args.
r353786:
tuntap(4): use cdevpriv w/ dtor for last close instead of d_close
cdevpriv dtors will be called when the reference count on the associated
struct file drops to 0, while d_close can be unreliable for cleaning up
state at "last close" for a number of reasons. As far as tunclose/tundtor is
concerned the difference is minimal, so make the switch.
r353877:
tuntap(4): properly declare if_tun and if_tap modules
Simply adding MODULE_VERSION does not do the trick, because the modules
haven't been declared. This should actually fix modfind/kldstat, which
r351229 aimed and failed to do.
This should make vm-bhyve do the right thing again when using the ports
version, rather than the latest version not in ports.
Kyle Evans [Fri, 25 Oct 2019 00:47:37 +0000 (00:47 +0000)]
MFC r353872-r353873: lualoader color handling fixes
r353872:
lualoader: don't botch disabling of color
When colors are disabled, color.escape{fg,bg} would return the passed in
color rather than the proper ANSI sequence for the color.
color.escape{fg,bg} would be wrong.
Instead return '', as the associated reset* functions will also return ''.
This should get rid of the funky '2' and '4' in the kernel selector if
you're booting serial.
r353873:
lualoader: fix setting of loader_color=NO in loader.conf(5)
Previously color.disabled would be calculated at color module load time,
then never touched again. We can detect serial boots beyond just what we're
told by loader.conf(5) so this works out in many cases, but we must
re-evaluate the situation after the config is loaded to make sure we're not
supposed to be forcing it enabled/disabled.
John Baldwin [Fri, 25 Oct 2019 00:16:57 +0000 (00:16 +0000)]
MFC 350662:
Detect invalid PCI devices more correctly in PCI interrupt router drivers.
- Check for an invalid device (vendor is invalid) before reading the
header type register when examining function 0 of a possible device.
- When iterating over functions of a device, reject any device whose
16-bit vendor is invalid rather than requiring the full 32-bit
vendor+device to be all 1's. In practice the latter check is
probably fine, but checking the vendor is what the PCI spec
recommends.
Kyle Evans [Thu, 24 Oct 2019 21:43:01 +0000 (21:43 +0000)]
MFC r353788: picobsd: add deprecation notices
Notices appear both in picobsd(8) (near the top for easy notice) and are
also printed to stderr on every invocation of picobsd for visibility.
The tentative date for removal is October 31st, as no volunteers have
stepped forward at all from postings to -arch@ at least.
picobsd: add deprecation notices
Move pcpu KVA out of .bss into dynamically allocated VA at
pmap_bootstrap(). This avoids demoting superpage mapping .data/.bss.
Also it makes possible to use pmap_qenter() for installation of
domain-local pcpu page on NUMA configs.
Refactor pcpu and IST initialization by moving it to helper functions.
John Baldwin [Thu, 24 Oct 2019 21:02:24 +0000 (21:02 +0000)]
MFC 351434: Fix universe to include arm LINT kernel configs.
Strip comments from the NOTES.armv[57] files as is done for other
NOTES files when building the corresponding LINT configs. Without
this, the LINT configs contained the NO_UNIVERSE comment from the
NOTES.armv[57] files.
John Baldwin [Thu, 24 Oct 2019 20:48:30 +0000 (20:48 +0000)]
MFC 350549: Set ISOPEN in namei flags when opening executable interpreters.
These vnodes are explicitly opened via VOP_OPEN via
exec_check_permissions identical to the main exectuable image.
Setting ISOPEN allows filesystems to perform suitable checks in
VOP_LOOKUP (e.g. close-to-open consistency in the NFS client).
Doing some tests with very high interrupt rates I've noticed that one of
conditions I added in r232207 to make interrupt threads in most cases
run on local CPU never worked as expected (worked only if previous time
it was executed on some other CPU, that is quite opposite). It caused
additional CPU usage to run full CPU search and could schedule interrupt
threads to some other CPU.
This patch removes that code and instead reuses existing non-interrupt
code path with some tweaks for interrupt case:
- On SMT systems, if current thread is idle, don't look on other threads.
Even if they are busy, it may take more time to do fill search and bounce
the interrupt thread to other core then execute it locally, even sharing
CPU resources. It is other threads should migrate, not bound interrupts.
- Try hard to keep interrupt threads within LLC of their original CPU.
This improves scheduling cost and supposedly cache and memory locality.
On a test system with 72 threads doing 2.2M IOPS to NVMe this saves few
percents of CPU time while adding few percents to IOPS.
John Baldwin [Thu, 24 Oct 2019 19:07:52 +0000 (19:07 +0000)]
MFC 350178: Improve the precision of bhyve's vPIT.
Use 'struct bintime' instead of 'sbintime_t' to manage times in vPIT
to postpone rounding to final results rather than intermediate
results. In tests performed by Joyent, this reduced the error measured
by Linux guests by 59 ppm.
Marius Strobl [Thu, 24 Oct 2019 14:18:06 +0000 (14:18 +0000)]
MFC: r353778
- In em_intr(), just call em_handle_link() instead of duplicating it.
- In em_msix_link(), properly handle IGB-class devices after the iflib(4)
conversion again by only setting EM_MSIX_LINK for the EM-class 82574
and by re-arming link interrupts unconditionally, i. e. not only in
case of spurious interrupts. This fixes the interface link state change
detection for the IGB-class. [1]
- In em_if_update_admin_status(), only re-arm the link state change
interrupt for 82574 and also only if such a device uses MSI-X, i. e.
takes advantage of autoclearing. In case of INTx and MSI as well as
for LEM- and IGB-class devices, re-arming isn't appropriate here and
setting EM_MSIX_LINK isn't either.
While at it, consistently take advantage of the hw variable.
Kyle Evans [Thu, 24 Oct 2019 04:08:24 +0000 (04:08 +0000)]
MFC r349928: Allow efi loader to get network params from uboot
Summary:
efi loader does not work with static network parameters. It always uses
BOOTP/DHCP and also uses RARP as a fallback. Problems with DHCP servers can
cause the loader to fail to populate network parameters.
Kyle Evans [Thu, 24 Oct 2019 04:04:53 +0000 (04:04 +0000)]
MFC r349471, r351166: Tweak EFI_STAGING_SIZE
r349471:
Increase EFI_STAGING_SIZE to 100MB on x64
To avoid failures when the large 18MB nvidia.ko module is being loaded,
increase EFI_STAGING_SIZE from 64MB to 100MB on x64 systems.
Leave the other platforms at 64MB.
r351166:
Reduce size of EFI_STAGING_SIZE to 32 on arm
Reduce the size of the EFI_STAGING area we allocate on arm to 32. On arm SBC
such as the NanoPi-NEOLTS the staging area allocation will fail on the 256MB
model with a staging size of 64.
Add support for an HTTP "network filesystem" using the UEFI's HTTP
stack.
This also supports HTTPS, but TianoCore EDK2 implementations currently
crash while fetching loader files.
Only IPv4 is supported at the moment. IPv6 support is planned for a
follow-up changeset.
Note that we include some headers from the TianoCore EDK II project in
stand/efi/include/Protocol verbatim, including links to the license instead
of including the full text because that's their preferred way of
communicating it, despite not being normal FreeBSD project practice.
r349395:
Disconnect EFI HTTP support
The EFI HTTP code has been causing boot failures for people, so disable it
while a fix is being worked on.
r349404:
Re-enable loader efi http boot and fix dv_open bug if dv_init failed
The code in efihttp.c was assuming that dv_open wouldn't be called if
dv_init failed. But the dv_init return value is currently ignored.
Add a new variable, `efihttp_init_done` and only proceed in dv_open if
it's true. This fixes the loader on systems without efi http support.
r349564:
Clean efihttp pointer-sign warnings
The Http protocol structure is using unsigned char strings, Use type casts
where needed.
r349565:
efihttp: comparison of integers of different signs
message.HeaderCount is UINTN (unsigned int), so should be i.
r349566:
efihttp: mark unused arguments with __unused
we do have __unused, lets use it.
r349613:
efihttp: mac and err can be used uninitialized
While there, also check if mac != NULL, and use pointer compare for ipv4
and dns.
r350444:
Fix EFI loader build when LOADER_NET_SUPPORT=no.
Add map-vdisk and unmap-vdisk commands to create virtual disk interface on
top of file. This will allow to use disk image from file system to load and
start the kernel.
By mapping file, we create vdiskX device, the device will be listed by lsdev
[-v] and can be accessed directly as ls vdisk0p1:/path or can be used as
value for currdev variable.
vdisk strategy function does not use bcache as we have bcache used with
backing file. vdisk can be unmapped when all consumers have closed the open
files.
In first iteration we do not support the zfs images because zfs pools do
keep the device open (there is no "zpool export" mechanism). Adding zfs
support is relatively simple, we just need to run zfs disk probe after
mapping is done.
Kyle Evans [Thu, 24 Oct 2019 03:52:32 +0000 (03:52 +0000)]
MFC r345477, r346675, r346984, r348748
r345477:
Distinguish between "no partition" and "choose best partition" with a
constant.
The values of the d_slice and d_partition fields of a disk_devdesc have a
few values with special meanings in the disk_open() routine. Through various
evolutions of the loader code over time, a d_partition value of -1 has
meant both "use the first ufs partition found in the bsd label" and "don't
open a bsd partition at all, open the raw slice."
This defines a new special value of -2 to mean open the raw slice, and it
gives symbolic names to all the special values used in d_slice and
d_partition, and adjusts all existing uses of those fields to use the new
constants.
The phab review for this timed out without being accepted, but I'm still
citing it below because there is useful commentary there.
r346675:
Restore the ability to open a raw disk or partition in loader(8).
The disk_open() function searches for "the best partition" when slice and
partition information is not provided as part of the device name. As of
r345477 the slice and partition fields of a disk_devdesc are initialized to
D_SLICEWILD and D_PARTWILD; in the past they were initialized to -1, which
was sometimes interpreted as meaning 'wildcard' and sometimes as 'open the
raw partition' depending on the context. So as an unintended side effect of
r345477 it became basically impossible to ever open a disk or partition
without doing the 'best partition' search. One visible effect of that was
the inability to open the raw disk to read the partition table correctly in
zfs_probe_dev(), leading to failures to find the zfs pool unless it was on
the first partition.
Now instead of always initializing slice and partition to wildcards, the
disk_parsedev() function initializes them based on the presence of a
path/file name following the device. If there is any path or filename
following the ':' that ends the device name, then slice and partition are
initialized to D_SLICEWILD and D_PARTWILD. If there is nothing after the
':' then it is considered to be a request to open the raw device or
partition itself (not a file stored within it), and the fields are
initialized to D_SLICENONE and D_PARTNONE.
With this change in place, all the tests in src/tools/boot are succesful
again, including the recently-added cases of booting from a zfs pool on
a partition other than slice 1 of the device.
r346984:
Use D_PARTISGPT rather than bare 255
These three cases dovetail with other places in the code where we use
or set D_PARTISGPT when we mean that the partitioning scheme is
GPT. Use this #define to make the code easier to undertand.
r348748:
loader: disk_open() should honor D_PARTNONE
The D_PARTNONE is documented to make it possible to open raw MBR
partition, but the current disk_open() does not really implement this
statement.
The current code is checking partition against -1 (D_PARTNONE) but does
attempt to open partition table in case we do have FreeBSD MBR partition
type.
Instead, we should check -2 (D_PARTWILD).
In case we do have MBR + BSD label, this code is only working because
by default, the first BSD partiton is created starting with relative sector
0, and we can still access the BSD table from that MBR slice.
Kyle Evans [Thu, 24 Oct 2019 03:48:28 +0000 (03:48 +0000)]
MFC r344560, r344718
r344560:
stand: Remove unused i386 EFI MD bits
r328169 removed the copy of bootinfo that would've made this somewhat
functional. However, this is irrelevant- earlier work in r292338 was done to
exit boot services in the MI bi_load() rather than having N copies of the
GetMemoryMap/ExitBootServices dance.
i386 never quite caught up to that; ldr_enter was still being called but
the prereq for that, ldr_bootinfo, was no longer. As a consequence, this
ExitBootServices() was being called with a mapkey=0, clearly bogus, and
reportedly breaking the boot in some instances.
r344718:
EFI: don't call printf after ExitBootServices, since it uses Boot Services
ExitBootServices terminates all boot services including console access.
Attempting to call printf afterwards can result in a crash, depending on the
implementation.
Move any printf statements to before we call bi_load, and remove any that
depend on calling bi_load first.
Kyle Evans [Thu, 24 Oct 2019 03:40:20 +0000 (03:40 +0000)]
MFC (proactively; not required yet) r339673: Fix stand/ build after r339671.
ffs_subr.c requires calculate_crc32c() from libkern. Unfortunately we
cannot just add libkern/crc32.c to libstand because crc32.o is already
compiled from contrib/zlib/crc32.c. Use the include trick to rename
the source.
Note that libstand also provides crc32.c which seems to be unused.
Kyle Evans [Thu, 24 Oct 2019 03:37:17 +0000 (03:37 +0000)]
MFC r340834: Disable build-id in i386 binary boot components
A user may enable build-id for all builds by adding
LDFLAGS=-Wl,--build-id=sha1 to /etc/make.conf. In this case the build-id
note ends added up to mbr and pmbr's .text, which makes it too large (it
ends up being 532 bytes). To avoid this explicitly turn off build-id for
these components.
Kyle Evans [Thu, 24 Oct 2019 03:20:27 +0000 (03:20 +0000)]
MFC r349201: efinet: Defer exclusively opening the network handles
Don't commit to exclusive access to the network device handle by
efinet until the loader has decided to load something through the
network. This allows for the possibility of other users of the
network device.
Kyle Evans [Thu, 24 Oct 2019 03:19:45 +0000 (03:19 +0000)]
MFC r349217: Tell loader to ignore newer features enabled on the root pool.
There are many new features in ZoF. Most, if not all, do not effect read
only usage.
Encryption in particular is enabled at the pool level but used at the
dataset level.
The loader obviously will not be able to boot if the boot dataset is
encrypted, but
should not care if some other dataset in the root pool is encrypted.
This is like efi_devpath_match, but allows differing device media
paths. Those just specify the partition information.
r348659:
Use newly minted efi_devpath_same_disk() instead of
efi_devpath_match(). This fixes a regression in r347193.
r348674:
Don't shadow a global zfsmount variable.
r348675:
ufs_module.c can't currently be compiled with -Wcast-align, but the
code is safe enough. Turn off the warning for now until I can find the
right construct to silence it in the code.
r348678:
Eliminate unused uuid parameters from gptread and gptread_table. We
only need it for the gptfind() function, where it's used.
r348760:
Use simple malloc/free instead of dropping down to the UEFI
BootServices AllocatePool/FreePool calls. They are simpler to use and
result in the same thing happening.
r348766:
Remove left-over status variables
r348768:
Rework the reporting of the priority.
Simplify the code a bit and rework how we report the results
of the probing.
r348811:
Break out the disk selection protocol from the rest of boot1.
Segregate the disk probing and selection protocol from the rest of the
boot loader.
r348812:
Create gptboot.efi
This is a primary boot loader that is intended to implement the
gptboot partition selection algorithm just like we did for BIOS
booting. While the preferred method for UEFI is to use the UEFI Boot
Manager protocol, there are situations where that can't be done: some
BIOS makers interfere with the protocol in unhelpful ways, there's a
new standard for a zero variable write from the client OS, and finally
for USB drives that might be mobile between systems with multiple
partitions there needs to be a media stable way to select.
r348814:
Add stuff to disable warning for %S
Add the customary warnings to disable format checking on armv7. Code
move to new files, and the unconditional setting of WARNS to 6
provoked it on tinerbox...
r348194:
loader: Add pnp functions for autoloading modules based on linker.hints
This adds some new commands to loader :
- pnpmatch
This takes a pnpinfo string as argument and tries to find a kernel module
associated with it. -v and -d option are available and are the same as in
devmatch (v is verbose, d dumps the hints).
- pnpload
This takes a pnpinfo string as argument and tries to load a kernel module
associated with it.
- pnpautoload
This will attempt to load every kernel module for each buses. Each buses are
probed, the probe function will generate pnpinfo string and load kernel module
associated with it if it exists.
Only simplebus for FDT system is implemented for now.
Since we need the dtb and overlays to be applied before searching the tree
fdt_devmatch_next will load and apply the dtb + overlays.
All the pnp parsing code comes from devmatch and is the same at 99%.
r348196:
loader: Remove unused variable
r348204:
Remove yet another unused variable.
r348207:
Initialize a variable to fix build with GCC.
Some of these files using <FOO>_DEBUG defined a DEBUG() macro to serve as a
debug-printf. -DDEBUG is useful to enable some debugging output across
multiple ELF/common parts, so switch the DEBUG-as-printf macros over to
something more like DPRINTF that is more commonly used for this kind of
thing and less likely to conflict.
userboot/elf64_freebsd debugging also assumed %llx for uint64; use PRIx64
instead.
r347219:
loader: use safer DPRINTF body for non-debug case
r347220:
loader: bcache code does not need to check argument for free()
The current console code is printing out 8 spaces for tab, calculate
the amount of spaces based on tab stops.
r347389:
loader: ptable_print() needs two tabs sometimes
Since the partition/slice names do vary in length, check the length
of the fixed part of the line against 3 * 8, if the lenth is less than
3 tab stops, print out extra tab.
use snprintf() instead of sprintf.
r347391:
loader: no-TERM_EMU is broken now
If TERM_EMU is not defined, we do not have curx variable. Use conout mode
for efi and expose get_pos() for i386.
r347393:
loader: use DPRINTF in biosdisk.c and define safe DPRINTF
r345066 did miss biosdisk.c.
Also define DPRINTF as ((void)0) for case we do not want debug printouts.
r347553:
loader: fix memory handling errors in module.c
file_loadraw():
check for file_alloc() and strdup() results.
we leak 'name'.
mod_load() does leak 'filename'.
mod_loadkld() does not need to check fp, file_discard() does check.
r348040:
stand: TARGET_ARCH is spelled MACHINE_ARCH in Makefiles
Add a wrapper around efi_delenv akin to efi_freebsd_getenv and
efi_getenv.
r346703:
Move initialization of the block device handles earlier (we're just
snagging them from UEFI BIOS). Call the device type init routines
earlier as well, as they don't depend on how the console is
setup. This will allow us to read files earlier in boot, so any rare
error messages that this might move only to the EFI console will be an
acceptable price to pay. Also tweak the order of has_kbd so it resides
next to the rest of the console code. It needs to be after we initialize
the buffer cache.
r346704:
Add the proper range of years for Netflix's copyright on this
file. Note that I wrote it.
r346879:
Read in and parse /efi/freebsd/loader.env from the boot device's
partition as if it were on the command line.
Fetch FreeBSD-LoaderEnv UEFI enviornment variable. If set, read in
loader environment variables from it. Otherwise read in
/efi/freebsd/loader.env. Both are read relative to the device
loader.efi loaded from (they aren't full UEFI device paths)
Next fetch FreeBSD-NextLoaderEnv UEFI environment variable. If
present, read the file it points to in as above and delete the UEFI
environment variable so it only happens once.
This lets one set environment variables in the bootloader.
Unfortunately, we don't have all the mechanisms in place to parse the
file, nor do we have the magic pattern matching in place that
loader.conf has. Variables are of the form foo=bar. No quotes are
supported, so spaces aren't allowed, for example. Also, variables like
foo_load=yes are intercepted when we parse the loader.conf file and
things are done based on that. Since those aren't done here, variables
that cause an action to happen won't work.
r346880:
Implement uefi_rootdev
If uefi_rootdev is set in the environment, then treat it like a device
path. Convert the string to a device path and see if we can find a
device that matches. If so, use that device at our root dev no matter
what. If it's bad in any way, the boot will fail.
When set, we ignore all the hints that the UEFI boot manager has set
for us. We also always fail back to the OK prompt when we can't find
the right thing to boot rather than failing back to the UEFI boot
manager. This has the side effect of also expanding the cases where we
fail back to the OK prompt to include when we're booted under UEFI,
but UEFI::BootCurrent isn't set in the environment and we can't find a
proper place to boot from.
r347023:
stand: correct mis-merge from r346879
Small mis-merge from multiple WIP resulted in block io media handles getting
double-initialized. This resulted in some installations oddly landing at the
mountroot prompt.
r347059:
Remove stray '*'
We're storing an EFI_HANDLE, not an pointer to a handle. Since
EFI_HANDLE is a void * anyway, this has little practical effect since
the conversion to / from void * and void ** is silent.
r347060:
When we can't get memory, trying again right away is going to
fail. Rather than print N failure messages, bail on the first one.
r347061:
Substitute boot1 with ${BOOT1}
Allow for other names to be built, so parameterize this makefile to
avoid hard coding boot1.
r347062:
Use SRC+= rather than SRC=
To allow boot1/Makefile to be included, use SRC+= rathern than SRC=
so the including Makefile can add additional sources to the build.
r347193:
Reach over and pull in devpath.c from libefi
This allows us to remove three nearly identical functions because the
differences don't matter, and the size difference is trivial.
r347194:
We only ever need one devinfo per handle. So allocate it outside of
looping over the filesystem modules rather than doing a malloc + free
each time through the loop. In addition, nothing changes from loop to
loop, so setup the new devinfo outside the loop as well.
r347201:
Simplify boot1 allocation of handles.
There's no need to pre-malloc the number of handles. Instead call
LocateHandles twice, once to get the size, and once to get the
data.
Kyle Evans [Thu, 24 Oct 2019 02:46:36 +0000 (02:46 +0000)]
MFC r346701: loader: fdt: Add fdt_is_setup function
When efi_autoload is called it will call fdt_setup_fdtp which setup the
dtb and overlays. If a user already loaded at dtb or overlays or just
printed the efi provided dtb, this will re-setup everything and also
re-applying the overlays.
Test that everything is setup before doing it again.
Add an interface to remove / delete UEFI variables.
r346353:
Minor tweak to the debug
Make it clear we're loading from UFS.
r346407:
Add define for CONST.
Newer interfaces take CONST parameters, so define CONST to minimize
differences between our headers and the standards docs.
r346408:
Add UEFI definitions related to converting string to DEVICE_PATH
Add definitions from UEFI 2.7 Errata B standards doc for converting a
text string to a device path. Added clearly missing 'e' at the end of
Device to resolve mismatch in that document in
EFI_DEVICE_PATH_FROM_TEXT_PROTOCOL element names.
r346409:
Add wrapper functions to convert strings to EFI_DEVICE_PATH
In anticipation of new functionality, create routines to convert char *
and a CHAR16 * to a EFI_DEVICE_PATH
EFI_DEVICE_PATH *efi_name_to_devpath(const char *path);
EFI_DEVICE_PATH *efi_name_to_devpath16(CHAR16 *path);
void efi_devpath_free(EFI_DEVICE_PATH *dp);
The first two return an EFI_DEVICE_PATH for the passed in paths. The
third frees up the storage the first two return when the caller is
done with it.
r346430:
Start to reduce the number of #ifdef EFI_ZFS_BOOT
There's a number of EFI_ZFS_BOOT #ifdefs that aren't needed, or can be
eliminated with some trivial #defines. Remove the EFI_ZFS_BOOT ifdefs
that aren't needed. Replace libzfs.h include which is not safe to
include without EFI_ZFS_BOOT with efizfs.h which is and now
conditionally included libzfs.h. Define efizfs_set_preferred away
and define efi_zfs_probe to NULL when ZFS is compiled out.
r346573:
Move setting of console earlier in boot.
There's no reason we can't setup the console first thing after the
arch flags are setup. We set it undconditionally to efi. This is a
good default, and will get us error messages to at least the efi
console no matter what. This will also prime the pump so that as other
variables are set, they will take effect and the console will be
correct as soon as those env vars are set. Also remove the redundant
setting of the console to efi when we know the console is efi.
r346575:
Create boot_img as a global variable
Get the information from the image that we're booting and store it in
a global variable. Prefer using this to passing it around. Remove the
special case for zfs that set the preferred boot handle by having it
uses this global variable diretly.
Kyle Evans [Thu, 24 Oct 2019 02:36:42 +0000 (02:36 +0000)]
MFC r345998-r346002, r346007-r346008: various loader improvements
r345998:
loader: malloc+bzero is calloc
Replace malloc+bzero in module.c with calloc.
r345999:
loader: file_addmodule should check for memory allocation
strdup() can return NULL.
r346000:
loader: remove pointer checks before free() in module.c
free() does check for NULL argument, remove duplicate checks.
r346001:
loader: file_addmetadata() should check for memory allocation
malloc() can return NULL.
r346002:
loader: mod_loadkld() error: we previously assumed 'last_file' could be null
The last_file variable is used to reset the loadaddr variable back to
original
value; however, it is possible the last_file is NULL, so we can not blindly
trust it. But then again, we can just save the original loadaddr and use
the saved value for recovery.
r346007:
loader: add file_remove() function to undo file_insert_tail().
346002 did miss the fact that we do not only undo the loadaddr, but also
we need to remove the inserted module. Implement file_remove() to do the
job.
r346008:
loader: command_lsefi: ret can be used uninitialized