manu [Thu, 25 Apr 2019 20:08:43 +0000 (20:08 +0000)]
loader: fdt: Add fdt_is_setup function
When efi_autoload is called it will call fdt_setup_fdtp which setup the
dtb and overlays. If a user already loaded at dtb or overlays or just
printed the efi provided dtb, this will re-setup everything and also
re-applying the overlays.
Test that everything is setup before doing it again.
Rob's patch in D18564 cemented the SHLIBDIR because bsd.own.mk (included by
src.opts.mk) sets it to /usr/lib. r346546 did somehow not apply this part of
the patch, leaving it to get installed to the wrong place and subsequently
removed via ObsoleteFiles.
Reported by: jkim
MFC after: 3 days
X-MFC-With: r346546
Contrary to the comments, it was never used by core dumps or
debuggers. Instead, it used to hold the signal code of a pending
signal, but that was replaced by the 'ksi_code' member of ksiginfo_t
when signal information was reworked in 7.0.
Restore doing nothing for calls to VGLEnd() after the first. I broke this
in r346631. VGLEnd() clears some state variables as it restores state,
but not all of them, so it still needs to clear a single state variable
to indicate that it has completed. Put this clearing back where it was
(at the start instead of the end) to avoid moving bugs in the signal
handling.
Drivers can now pass up numa domain information via the
mbuf numa domain field. This information is then used
by TCP syncache_socket() to associate that information
with the inpcb. The domain information is then fed back
into transmitted mbufs in ip{6}_output(). This mechanism
is nearly identical to what is done to track RSS hash values
in the inp_flowid.
Follow on changes will use this information for lacp egress
port selection, binding TCP pacers to the appropriate NUMA
domain, etc.
ian [Thu, 25 Apr 2019 15:09:21 +0000 (15:09 +0000)]
Restore the ability to open a raw disk or partition in loader(8).
The disk_open() function searches for "the best partition" when slice and
partition information is not provided as part of the device name. As of
r345477 the slice and partition fields of a disk_devdesc are initialized to
D_SLICEWILD and D_PARTWILD; in the past they were initialized to -1, which
was sometimes interpreted as meaning 'wildcard' and sometimes as 'open the
raw partition' depending on the context. So as an unintended side effect of
r345477 it became basically impossible to ever open a disk or partition
without doing the 'best partition' search. One visible effect of that was
the inability to open the raw disk to read the partition table correctly in
zfs_probe_dev(), leading to failures to find the zfs pool unless it was on
the first partition.
Now instead of always initializing slice and partition to wildcards, the
disk_parsedev() function initializes them based on the presence of a
path/file name following the device. If there is any path or filename
following the ':' that ends the device name, then slice and partition are
initialized to D_SLICEWILD and D_PARTWILD. If there is nothing after the
':' then it is considered to be a request to open the raw device or
partition itself (not a file stored within it), and the fields are
initialized to D_SLICENONE and D_PARTNONE.
With this change in place, all the tests in src/tools/boot are succesful
again, including the recently-added cases of booting from a zfs pool on
a partition other than slice 1 of the device.
Previously, a pid check was used to prevent open of the tun(4); this works,
but may not make the most sense as we don't prevent the owner process from
opening the tun device multiple times.
The potential race described near tun_pid should not be an issue: if a
tun(4) is to be handed off, its fd has to have been sent via control message
or some other mechanism that duplicates the fd to the receiving process so
that it may set the pid. Otherwise, the pid gets cleared when the original
process closes it and you have no effective handoff mechanism.
Close up another potential issue with handing a tun(4) off by not clobbering
state if the closer isn't the controller anymore. If we want some state to
be cleared, we should do that a little more surgically.
Additionally, nothing prevents a dying tun(4) from being "reopened" in the
middle of tun_destroy as soon as the mutex is unlocked, quickly leading to a
bad time. Return EBUSY if we're marked for destruction, as well, and the
consumer will need to deal with it. The associated character device will be
destroyed in short order.
It seems that there should be a better way to handle this, but this seems to
be the more common approach and it should likely get replaced in all of the
places it happens... Basically, thread 1 is in the process of destroying the
tun/tap while thread 2 is executing one of the ioctls that requires the
tun/tap mutex and the mutex is destroyed before the ioctl handler can
acquire it.
This is only one of the races described/found in PR 233955.
"The mcbin (and likely others) have a nonstandard uart clock. This means
that the earlycon programming will incorrectly set the baud rate if it is
specified. The way around this is to tell the kernel to continue using the
preprogrammed baud rate. This is done by setting the baud to 0."
Our drivers (uart_dev_ns8250) do respect zero, but SPCR would error. Let's
not error.
ian [Thu, 25 Apr 2019 00:08:15 +0000 (00:08 +0000)]
For the geli-gpt-zfs test images, both bios and uefi flavors, add a dummy
ufs partition as p2, and put the zfs partition at p3, to test the ability
of the zfs probe code to find a zfs pool on something other than the first
partition.
Parse MIPS relocations to unbreak kldxref on MIPS.
Parse the R_MIPS_32 and R_MIPS_64 relocations. Both Elf_Rel and
Elf_Rela relocations are handled since O32 MIPS uses Elf_Rel while N64
uses Elf_Rela. Note that R_MIPS_32 is only handled for 32-bit mips
and R_MIPS_64 for 64-bit. N32 is untested.
This is fairly similar to the AES-GCM support in ccr(4) in that it will
fall back to software for certain cases (requests with only AAD and
requests that are too large).
A request to encrypt an empty payload without any AAD is unusual, but
it is defined behavior. Removing this assertion removes a panic and
instead returns the correct tag for an empty buffer.
Fix requests for "plain" SHA digests of an empty buffer.
To workaround limitations in the crypto engine, empty buffers are
handled by manually constructing the final length block as the payload
passed to the crypto engine and disabling the normal "final" handling.
For HMAC this length block should hold the length of a single block
since the hash is actually the hash of the IPAD digest, but for
"plain" SHA the length should be zero instead.
tools/boot/install-boot.sh was assuming that if a device was passed in,
it should operate on the current system and run efibootmgr etc. to
update the boot manager. However, rootgen.sh passes a md(4) device and
not a fixed disk.
Add a -u option to install-boot.sh to tell it to update the system
in-place and run efibootmgr etc.
Also, source install-boot.sh in rootgen.sh to allow it to find and
call make_esp_file etc. And pass the loader file to make_esp_file instead
of a directory name.
Reported by: ian
Reviewed by: ian,imp,tsoome
Differential Revision: https://reviews.freebsd.org/D19992
destroy_dev_sched_cb() is excessively asynchronous, and during media change
retaste new provider may appear sooner then device of the previous one get
destroyed.
cem [Wed, 24 Apr 2019 18:24:22 +0000 (18:24 +0000)]
x86: Halt non-BSP CPUs on panic IPI_STOP
We may need the BSP to reboot, but we don't need any AP CPU that isn't the
panic thread. Any CPU landing in this routine during panic isn't the panic
thread, so we can just detect !BSP && panic and shut down the logical core.
The savings can be demonstrated in a bhyve guest with multiple cores; before
this change, N guest threads would spin at 100% CPU. After this change,
only one or two threads spin (depending on if the panicing CPU was the BSP
or not).
Konstantin points out that this may break any future patches which allow
switching ddb(4) CPUs after panic and examining CPU-local state that cannot
be inspected remotely. In the event that such a mechanism is incorporated,
this behavior could be made configurable by tunable/sysctl.
VGLMouseFreeze() now only defers mouse signals and leaves it to higher
levels to hide and unhide the mouse cursor if necessary. (It is never
necessary, but is done to simplify the implementation. It is slow and
flashes the cursor. It is still done for copying bitmaps and clearing.)
VGLMouseUnFreeze() now only undoes 1 level of freezing. Its old
optimization to reduce mouse redrawing is too hard to do with unhiding
in higher levels, and its undoing of multiple levels was a historical
mistake.
VGLMouseOverlap() determines if a region overlaps the (full) mouse region.
VGLMouseFreezeXY() is the freezing and a precise overlap check combined
for the special case of writing a single pixel. This is the single-pixel
case of the old VGLMouseFreeze() with cleanups.
Fixes:
- check in more cases that the application didn't pass an invalid VIDBUF
- check for errors from copying a bitmap to the shadow buffer
- freeze the mouse before writing to the shadow buffer in all cases. This
was not done for the case of writing a single pixel (there was a race)
- don't spell the #defined values for VGLMouseShown as 0, 1 or boolean.
As with mlx5en, the idea is to drop unwanted traffic as early
in receive as possible, before mbufs are allocated and anything
is passed up the stack. This can save considerable CPU time
when a machine is under a flooding style DOS attack.
The major change here is to remove the unneeded abstraction where
callers of rxd_frag_to_sd() get back a pointer to the mbuf ring, and
are responsible for NULL'ing that mbuf themselves. Now this happens
directly in rxd_frag_to_sd(), and it returns an mbuf. This allows us
to use the decision (and potentially mbuf) returned by the pfil
hooks. The driver can now recycle mbufs to avoid re-allocation when
packets are dropped.
Reviewed by: marius (shurd and erj also provided feedback)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19645
The mouse signal SIGUSR2 was not turned off for normal termination and
in some other cases. Thus mouse signals arriving after the frame
buffer was unmapped always caused fatal traps. The fatal traps occurred
about 1 time in 5 if the mouse was wiggled while vgl is ending.
The screen switch signal SIGUSR1 was turned off after clearing the
flag that it sets. Unlike the mouse signal, this signal is handled
synchronously, but VGLEnd() does screen clearing which does the
synchronous handling. This race is harder to lose. I think it can
get vgl into deadlocked state (waiting in the screen switch handler
with SIGUSR1 to leave that state already turned off).
Turn off the mouse cursor before clearing the screen in VGLEnd().
Otherwise, clearing is careful to not clear the mouse cursor. Undrawing
an active mouse cursor uses a lot of state, so is dangerous for abnormal
termination, but so is clearing. Clearing is slow and is usually not
needed, since the kernel also does it (not quite right).
Add GRE-in-UDP encapsulation support as defined in RFC8086.
This GRE-in-UDP encapsulation allows the UDP source port field to be
used as an entropy field for load-balancing of GRE traffic in transit
networks. Also most of multiqueue network cards are able distribute
incoming UDP datagrams to different NIC queues, while very little are
able do this for GRE packets.
When an administrator enables UDP encapsulation with command
`ifconfig gre0 udpencap`, the driver creates kernel socket, that binds
to tunnel source address and after udp_set_kernel_tunneling() starts
receiving of all UDP packets destined to 4754 port. Each kernel socket
maintains list of tunnels with different destination addresses. Thus
when several tunnels use the same source address, they all handled by
single socket. The IP[V6]_BINDANY socket option is used to be able bind
socket to source address even if it is not yet available in the system.
This may happen on system boot, when gre(4) interface is created before
source address become available. The encapsulation and sending of packets
is done directly from gre(4) into ip[6]_output() without using sockets.
Keep two versions of the FreeBSD.conf pkg configuration file; one which
points at the "latest" branch and one which points at the "quarterly"
branch. Install the "latest" version unless overridden via the newly
added PKGCONFBRANCH variable.
This does not change user-visible behaviour (assuming said vairable is
not set) but will make it easier to change the defaults in the future --
on stable branches we will want "latest" on x86 but "quarterly" elsewhere.
Discussed with: gjb
MFC after: 3 days
X-MFC: After MFCing this I'll make a direct commit to stable/* to
switch non-x86 architectures to "quarterly".
`xrange` is a pre-python 2.x compatible idiom. Use `range` instead. The values
being iterated over are sufficiently small that using range on python 2.x won't
be a noticeable issue.
Reapply whitespace style changes from r346443 after recent changes to tests/sys/opencrypto
From r346443:
"""
Replace hard tabs with four-character indentations, per PEP8.
This is being done to separate stylistic changes from the tests from functional
ones, as I accidentally introduced a bug to the tests when I used four-space
indentation locally.
mtmsr and mtsr require context synchronizing instructions to follow. Without
a CSI, there's a chance for a machine check exception. This reportedly does
occur on a MPC750 (PowerMac G3).
r346307 inadvertently started installing FDT_DTS_FILE along with the kernel.
While this isn't necessarily bad, it was not intended or discussed and it
actively breaks some current setups that don't anticipate any .dtb being
installed when it's using static fdt. This change could be reconsidered down
the line, but it needs to be done with prior discussion.
Fix it by pushing FDT_DTS_FILE build down into the raw dtb.build.mk bits.
This technically allows modules building DTS to accidentally specify an
FDT_DTS_FILE that gets built but isn't otherwise useful (since it's not
installed), but I suspect this isn't a big deal and would get caught with
any kind of testing -- and perhaps this might end up useful in some other
way, for example by some module wanting to embed fdt in some other way than
our current/normal mechanism.
Reported by: Mori Hiroki <yamori813@yahoo.co.jp>
MFC after: 3 days
X-MFC-With: r346307
Test the AES-CCM test vectors from the NIST Known Answer Tests.
The CCM test vectors use a slightly different file format in that
there are global key-value pairs as well as section key-value pairs
that need to be used in each test. In addition, the sections can set
multiple key-value pairs in the section name. The CCM KAT parser
class is an iterator that returns a dictionary once per test where the
dictionary contains all of the relevant key-value pairs for a given
test (global, section name, section, test-specific).
Note that all of the CCM decrypt tests use nonce and tag lengths that
are not supported by OCF (OCF only supports a 12 byte nonce and 16
byte tag), so none of the decryption vectors are actually tested.
Pass in an explicit digest length to the Crypto constructor since it
was assuming only sessions with a MAC key would have a MAC. Passing
an explicit size allows us to test the full digest in HMAC tests as
well.
Since r339624 HEAD does not need for backslashes in syscalls.master,
however to make a merge r345471 to the stable add backslashes
to the syscalls.master.
tun destruction will not continue until TUN_OPEN is cleared. There are brief
moments in tunclose where the mutex is dropped and we've already cleared
TUN_OPEN, so tun_destroy would be able to proceed while we're in the middle
of cleaning up the tun still. tun_destroy should be blocked until these
parts (address/route purges, mostly) are complete.
cem [Tue, 23 Apr 2019 17:18:20 +0000 (17:18 +0000)]
ip6_randomflowlabel: Avoid blocking if random(4) is not available
If kern.random.initial_seeding.bypass_before_seeding is disabled, random(4)
and arc4random(9) will block indefinitely until enough entropy is available
to initially seed Fortuna.
It seems that zero flowids are perfectly valid, so avoid blocking on random
until initial seeding takes place.
As mphyp_pte_unset() can also remove PTE entries, and as this can
happen in parallel with PTEs evicted by mphyp_pte_insert(), there
is a (rare) chance the PTE being evicted gets removed before
mphyp_pte_insert() is able to do so. Thus, the KASSERT should
check wether the result is H_SUCCESS or H_NOT_FOUND, to avoid
panics if the situation described above occurs.
More details about this issue can be found in PR 237470.
RFC 4391 specifies that the IB interface GID should be re-used as IPv6
link-local address. Since the code in in6_get_hw_ifid() ignored
IFT_INFINIBAND case, ibX interfaces ended up with the local address
borrowed from some other interface, which is non-compliant.
Use lowest eight bytes from GID for filling the link-local address,
same as Linux.
In r297225 the initial INP_RLOCK() was replaced by an early
acquisition of an r- or w-lock depending on input variables
possibly extending the write locked area for reasons not entirely
clear but possibly to avoid a later case of unlock and relock
leading to a possible race condition and possibly in order to
allow the route cache to work for connected sockets.
Unfortunately the conditions were not 1:1 replicated (probably
because of the route cache needs). While this would not be a
problem the legacy IP code compared to IPv6 has an extra case
when dealing with IP_SENDSRCADDR. In a particular case we were
holding an exclusive inp lock and acquired the shared udbinfo
lock (now epoch).
When then running into an error case, the locking assertions
on release fired as the udpinfo and inp lock levels did not match.
Break up the special case and in that particular case acquire
and udpinfo lock depending on the exclusitivity of the inp lock.
Add the ability to report ATA device power mode with the cmmand 'powermode'
to compliment the existing ability to set it using idle, standby and sleep
commands.
[PowerPC64] pseries-llan: increment packet output counters on error and success
Summary: when using pseries-llan driver, Opkts and Oerrs counters (netstat
-i) are always zero. This patch adds an small error handling to increment
these counters.
powerpc64/pseries: Fix hypervisor call with extra arguments
Some hypervisor calls, such as H_SEND_LOGICAL_LAN, take more arguments than
are traditionally passed in registers. The HCALL ABI will accept these
arguments in r11 and r12. With ELFv2 ABI, these arguments are 2
double-words lower than ELFv1 ABI, as two double-words in the stack frame
are no longer used, and therefore removed from the frame. Fix the offsets
for loading the registers for the HCALL. This fixes the phyp_llan driver
with ELFv2 kernel.
ar: shuffle symbol offsets during conversion for 32-bit ar archives
During processing we maintain symbol offsets in the 64-bit s_so array,
and when writing the archive convert to 32-bit if no offsets are greater
than 4GB. However, this was somewhat inefficient as we looped over the
array twice: first, converting to big endian and second, writing each
32-bit value one at a time (and incorrectly so on big-endian platforms).
Instead, when writing a 32-bit archive shuffle convert symbol data to
big endian (as required by the ar format) and shuffle to the beginning
of the allocation at the same time.
Also correct emission of the symbol count on big endian platforms.
Further changes are planned, but this should fix powerpc64.
Reported by: jhibbits, mlinimon
Reviewed by: jhibbits, Gerald Aryeetey (earlier)
Tested by: jhibbits
MFC after: 10 days
MFC with: r346079
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20007
Fix mouse cursor coloring in depths > 8 (previously, a hack that only
worked right for white interiors and black borders was used). Advertise
this by changing the default colors to a red interior and a white
border (the same as the kernel default). Add undocumented env variables
for changing these colors. Also change to the larger and better-shaped
16x10 cursor sometimes used in the kernel. The kernel choice is
fancier, but libvgl is closer to supporting the larger cursors needed
in newer modes.
The (n)and-or logic for the cursor doesn't work right for more than 2
colors. The (n)and part only masks out all color bits for the pixel
under the cursor when all bits are set in the And mask. With more
complicated logic, the non-masked bits could be used to implement
translucent cursors, but they actually just gave strange colors
(especially in packed and planar modes where the bits are indirect
through 1 or 2 palettes so it is hard to predict the final color).
They also gave a bug for writing pixels under the cursor. The
non-masked bits under the cursor were not combined in this case.
Drop support for combining with bits under the cursor by making any nonzero
value in the And mask mean all bits set.
Convert the Or mask (which is represented as a half-initialized 256-color
bitmap) to a fully initialized bitmap with the correct number of colors.
The 256-color representation must be as in 3:3:2 direct mode iff the final
bitmap has more than 256 colors. The conversion of colors is not very
efficient, so convert at initialization time.
Track device's NUMA domain in ifnet & alloc ifnet from NUMA local memory
This commit adds new if_alloc_domain() and if_alloc_dev() methods to
allocate ifnets. When called with a domain on a NUMA machine,
ifalloc_domain() will record the NUMA domain in the ifnet, and it will
allocate the ifnet struct from memory which is local to that NUMA
node. Similarly, if_alloc_dev() is a wrapper for if_alloc_domain
which uses a driver supplied device_t to call ifalloc_domain() with
the appropriate domain.
Note that the new if_numa_domain field fits in an alignment pad in
struct ifnet, and so does not alter the size of the structure.
Build libclang_rt/profile on all clang-supported architectures
There's no reason why a special case needs to be added specifically for amd64,
arm, and i386, as the code is written in machine architecture agnostic C/C++.
This will make it possible for all supporting clang architectures to produce
runtime coverage with `--coverage`.
MFC after: 2 weeks
Reviewed by: dim
Differential Revision: https://reviews.freebsd.org/D20003
r345708 worked for the base system, but unfortunately, caused a lot of
disruption for third-party packages that relied on C++, since bsd.sys.mk is
used by applications outside the base system. The defaults picked didn't match
the compiler's defaults and broke some builds that didn't specify a standard,
as well as some that overrode the value by setting `-std=gnu++14` (for
example) manually.
This change takes a more relaxed approach to appending `-std=${CXXSTD}` to
CXXFLAGS, by only doing so when the value is specified, as opposed to
overriding the standard set by an end-user. This avoids the need for having
to bake NOP default into bsd.sys.mk for supported compiler-toolchain
versions.
In order to make this change possible, add CXXSTD to Makefile snippets which
relied on the default value (c++11) added in r345708.
MFC after: 2 weeks
MFC with: r345708, r346574
Reviewed by: emaste
Reported by: jbeich
Differential Revision: https://reviews.freebsd.org/D19895 (as part of a larger change)
Get the information from the image that we're booting and store it in
a global variable. Prefer using this to passing it around. Remove the
special case for zfs that set the preferred boot handle by having it
uses this global variable diretly.
This change allows the user to once again override the C++ standard, restoring
high-level pre-r345708 behavior.
This also unbreaks building lib/ofed/libibnetdisc/Makefile with a non-C++11
capable compiler, e.g., g++ 4.2.1, as the library supported being built with
older C++ standards.
MFC after: 2 weeks
MFC with: r345708
Reviewed by: emaste
Reported by: jbeich
Differential Revision: https://reviews.freebsd.org/D19895 (as part of a larger change)
There's no reason we can't setup the console first thing after the
arch flags are setup. We set it undconditionally to efi. This is a
good default, and will get us error messages to at least the efi
console no matter what. This will also prime the pump so that as other
variables are set, they will take effect and the console will be
correct as soon as those env vars are set. Also remove the redundant
setting of the console to efi when we know the console is efi.
cxgbe(4): Make sure bundled_fw is always initialized before use.
This fixes a bug that prevented the driver from auto-flashing the
firmware when it didn't see one on the card. This feature was
introduced in r321390 and this bug was introduced in r343269.
cem [Mon, 22 Apr 2019 16:29:34 +0000 (16:29 +0000)]
random.3: Remove obsolete BUGS section
Relative performance to rand(3) is sort of irrelevant; they do different things
and a user with sensitivity to RNG performance won't use libc random(3) anyway.
The historical note about bad seeding is long obsolete, referring to a 1996 or
earlier version of FreeBSD.
Use separate descriptors in bhyve's stdio uart backend.
bhyve was previously using stdin for both reading and writing to the
console, which made it difficult to redirect console output. Use
stdin for reading and stdout for writing. This makes it easier to use
bhyve as a backend for syzkaller.
As a side effect, the change fixes a minor bug which would cause bhyve
to fail with ENOTCAPABLE if configured to use nmdm for com1 and stdio
for com2.
bhyveload already uses separate descriptors, as does the bvmcons driver.
Reviewed by: jhb
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19788
libbe(3): allow creation of arbitrary depth boot environments
libbe currently only provides an API to create a recursive boot environment,
without any formal support for intentionally limiting the depth. This
changeset adds an API, be_create_depth, that may be used to arbitrarily
restrict the depth of the new BE.
Disable vm map consistency checking by default on INVARIANTS kernels.
The checks are too expensive for a general-purpose kernel. Enable the
checks when DIAGNOSTIC is defined and provide a sysctl to enable the
checks in a non-DIAGNOSTIC INVARIANTS kernel.
Reviewed by: kib
Discussed with: Doug Moore <dougm@rice.edu>
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19999
Fix sys.kern.coredump_phnum_test.coredump_phnum on i386
The zero-padding when printing out the Size field is on 32-bit architectures is
5, not 15. Adjust the regular expression to work with both the 32-bit and
64-bit case.
Initialize `oldlen` to the size of the value, instead of leaving the value
unitialized. Leaving it unitialized seems to work by accident on amd64 when
running 64-bit programs, but not on i386.
Fix panic in network stack due to memory use after free in relation to
fragmented packets.
When sending IPv4 and IPv6 fragmented packets and a fragment is lost,
the mbuf making up the fragment will remain in the temporary hashed
fragment list for a while. If the network interface departs before the
so-called slow timeout clears the packet, the fragment causes a panic
when the timeout kicks in due to accessing a freed network interface
structure.
Make sure that when a network device is departing, all hashed IPv4 and
IPv6 fragments belonging to it, get freed.
Backtrace:
panic()
icmp6_reflect()
hlim = ND_IFINFO(m->m_pkthdr.rcvif)->chlim;
^^^^ rcvif->if_afdata[AF_INET6] is NULL.