Sigh, r346279 was also a test version with the reduced size doubled (so
it was actually double the full size in current kernels where the reduction
is null, so overran the mmapped() buffer).
Quick fix for slow clearing and context switches of large frame buffers
with old kernels, by breaking the support for large frame buffers in the
same way as for current kernels.
Large frame buffers may be too large to map into kva, and the kernel
(syscons) only uses the first screen page anyway, so r203535, r205557
and 248799 limit the buffer size in VESA modes to the first screen
page, apparently without noticing that this breaks applications by
using the same limit for user mappings as for kernel mappings. In
vgl, this makes the virtual screen the same as the physical screen.
However, this is almost a feature since clearing and switching large
(usually mostly unused) frame buffers takes too long. E.g., on a 16
year old low-end AGP card it takes about 12 seconds to clear the 128MB
frame buffer in old kernels that map it all and also map it with slow
attributes (e.g., uncacheable). Older PCI cards are even slower, but
usually have less memory. Newer PCIe cards are faster, but may have
many GB of memory. Also, vgl malloc()s a shadow buffer with the same
size as the frame buffer, so large frame buffers are even more wasteful
in applications than in the kernel.
Use the same limit in vgl as in newer kernels.
Virtual screens and panning still work in non-VESA modes that have
more than 1 page. The reduced buffer size in the kernel also breaks
mmap() of the last physical page in modes where the reduced size is
not a multiple of the physical page size. The same reduction in vgl
only reduces the virtual screen size.
Fix a variable name in r346215. Clearing of the right of the screen was
broken, except it worked accidentally in most cases where the virtual
screen is larger than the physical screen.
r176215 corrected readlink(2)'s return type and the type of the last
argument. readlink(2) was introduced in r177788 after being developed
as part of Google Summer of Code 2007; it appears to have inherited the
wrong return type.
Man pages and header files were already ssize_t; update syscalls.master
to match.
manu [Tue, 16 Apr 2019 12:40:49 +0000 (12:40 +0000)]
aw_syscon: Add a new compatible
Since 5.0 DTS the syscon controller have a new compatible as it
exports new subnodes, we currently only use it as a syscon provider
so just add the new compatible.
manu [Tue, 16 Apr 2019 12:39:31 +0000 (12:39 +0000)]
aw_rtc: Register the clocks
Since latest DTS update the rtc is supposed to register two clocks :
- osc32k (the 32k oscillator on the board that the RTC uses directly and
that other peripheral can use)
- iosc (the internal oscillator of the RTC when available which frequency
depend on the SoC revision)
Since we need the RTC before the proper clock control unit (because it uses
those clocks) attach it a BUS_PASS_BUS + MIDDLE and attach the clock control
unit at BUS_PASS_BUS + LAST for the SoC that requires it.
Correct a typo in the RPI-B ethernet config - the RPi-B includes a
SMC LAN9512 USB bridge and Ethernet 10/100 NIC/phy. The phy part of
this is supported by smscphy.
Tested On: RPi1 Model B
Approved by: grog, jhb (mentors)
MFC after: 3 days
Since r324184 the root node compatible for the original Raspberry Pi
is "brcm,bcm2835", add it to the compatible list of bcm2835_cpufreq.
Tested On: RPi1 Model B
Note that the default Das U-Boot FDT does not include a cpus clause
so actually adding a bcm2835_cpufreq device requires adding a FDT
overlay defining the cpu.
Approved by: grog, jhb (mentors)
MFC after: 3 days
tcpdump: disable Capsicum if -E option is provided.
The -E is used to provide a secret for decrypting IPsec.
The secret may be provided through command line or as the file.
The problem is that tcpdump doesn't support yet opening files in capability mode
and the file may contain a list of the files to open.
As a workaround, for now, let's just disable capsicum if the -E
the option is provided.
Check caller thread id before allowing to read the buffer
to make sure that it can only be accessed by the thread that
did the associated write to the TPM.
This allows an endless stream of random data within the given bounds.
It already worked if a seed was provided as the 4th argument but not
if one was left out.
In collaboration with: jhb
MFC after: 2 weeks
Relnotes: yes
cem [Mon, 15 Apr 2019 18:49:04 +0000 (18:49 +0000)]
random.3: Clarify confusing summary
random.3 is only "better" in contrast to rand.3. Both are non-cryptographic
pseudo-random number generators. The opening blurbs of each's DESCRIPTION
section does emphasize this, and correctly directs unfamiliar developers to
arc4random(3). However, the summary (".Nd" or Name description) of random.3
conflicted in tone and message with that warning.
Resolve the conflict by clarifying in the Nd section that random(3) is
non-cryptographic and pseudo-random. Elide the "better" qualifier which
implied a comparison but did not provide a specific object to contrast.
cem [Mon, 15 Apr 2019 18:40:36 +0000 (18:40 +0000)]
random(4): Block read_random(9) on initial seeding
read_random() is/was used, mostly without error checking, in a lot of
very sensitive places in the kernel -- including seeding the widely used
arc4random(9).
Most uses, especially arc4random(9), should block until the device is seeded
rather than proceeding with a bogus or empty seed. I did not spy any
obvious kernel consumers where blocking would be inappropriate (in the
sense that lack of entropy would be ok -- I did not investigate locking
angle thoroughly). In many instances, arc4random_buf(9) or that family
of APIs would be more appropriate anyway; that work was done in r345865.
A minor cleanup was made to the implementation of the READ_RANDOM function:
instead of using a variable-length array on the stack to temporarily store
all full random blocks sufficient to satisfy the requested 'len', only store
a single block on the stack. This has some benefit in terms of reducing
stack usage, reducing memcpy overhead and reducing devrandom output leakage
via the stack. Additionally, the stack block is now safely zeroed if it was
used.
One caveat of this change is that the kern.arandom sysctl no longer returns
zero bytes immediately if the random device is not seeded. This means that
FreeBSD-specific userspace applications which attempted to handle an
unseeded random device may be broken by this change. If such behavior is
needed, it can be replaced by the more portable getrandom(2) GRND_NONBLOCK
option.
On any typical FreeBSD system, entropy is persisted on read/write media and
used to seed the random device very early in boot, and blocking is never a
problem.
This change primarily impacts the behavior of /dev/random on embedded
systems with read-only media that do not configure "nodevice random". We
toggle the default from 'charge on blindly with no entropy' to 'block
indefinitely.' This default is safer, but may cause frustration. Embedded
system designers using FreeBSD have several options. The most obvious is to
plan to have a small writable NVRAM or NAND to persist entropy, like larger
systems. Early entropy can be fed from any loader, or by writing directly
to /dev/random during boot. Some embedded SoCs now provide a fast hardware
entropy source; this would also work for quickly seeding Fortuna. A 3rd
option would be creating an embedded-specific, more simplistic random
module, like that designed by DJB in [1] (this design still requires a small
rewritable media for forward secrecy). Finally, the least preferred option
might be "nodevice random", although I plan to remove this in a subsequent
revision.
To help developers emulate the behavior of these embedded systems on
ordinary workstations, the tunable kern.random.block_seeded_status was
added. When set to 1, it blocks the random device.
I attempted to document this change in random.4 and random.9 and ran into a
bunch of out-of-date or irrelevant or inaccurate content and ended up
rototilling those documents more than I intended to. Sorry. I think
they're in a better state now.
mlx5en: Enable new pfil(9) KPI ethernet filtering hooks
This allows efficient filtering at packet ingress on mlx5en.
Note that the packets are filtered (and potentially dropped) *before*
the driver has committed to (re)allocating an mbuf for the
packet. Dropped packets are treated essentially the same as an
error. Nothing is allocated, and the existing buffer is recycled. This
allows us to drop malicious packets at close to line rate with very
little CPU use.
Add quirk for ignoring SPCR AccessWidth values on the PL011 UART
The SPCR table on the Lenovo HR330A Ampere eMAG server indicates 8-bit
access, but 32-bit access is required for the PL011 to work.
PL011 on SBSA platforms always supports 32-bit access (and that was
hardcoded here before my EC2 fix), let's use 32-bit access for PL011
and 32BIT interface types.
Tested by emaste on Ampere eMAG and Cavium/Marvell ThunderX2.
Fix order of destructors between main binary and libraries.
Since inits for the main binary are run from rtld (for some time), the
rtld_exit atexit(3) handler, which is passed from rtld to the program
entry and installed by csu, is installed after any atexit(3) handlers
installed by main binary constructors. This means that rtld_exit() is
fired before main binary handlers.
Typical C++ static constructors are executed from init (either binary
or libs) but use atexit(3) to ensure that destructors are called in
the right order, independent of the linking order. Also, C++
libraries finalizers call __cxa_finalize(3) to flush library'
atexit(3) entries. Since atexit(3) entry is cleared after being run,
this would be mostly innocent, except that, atexit(rtld_exit) done
after main binary constructors, makes destructors from libraries
executed before destructors for main.
Fix by reordering atexit(rtld_exit) before inits for main binary, same
as it happened when inits were called by csu. Do it using new private
libc symbol with pre-defined ABI.
Reported. tested, and reviewed by: kan
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
r340744 broke the NFSv4 client, because it replaced pfind_locked() with a
call to pfind(), since pfind() acquires the sx lock for the pid hash and
the NFSv4 already holds a mutex when it does the call.
The patch fixes the problem by recreating a pfind_any_locked() and adding the
functions pidhash_slockall() and pidhash_sunlockall to acquire/release
all of the pid hash locks.
These functions are then used by the NFSv4 client instead of acquiring
the allproc_lock and calling pfind().
This causes some increase of the dynamic linker size, but benefits of
avoiding compiling private copy or the linker when debugging is
required. definitely worth it.
The dbg() calls can be compiled out by defining LD_NO_DEBUG symbol.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
For writing and reading single pixels, avoid some pessimizations for
depths > 8. Add some smaller optimizations for these depths. Use a
more generic method for all depths >= 8, although this gives tiny
pessimizations for these depths.
For clearing the whole frame buffer, avoid the same pessimizations
for depths > 8. Add some larger optimizations for these depths. Use
an even more generic method for all depths >= 8 to give the optimizations
for depths > 8 and a tiny pessimization for depth 8.
The main pessimization was that old versions of bcopy() copy 1 byte at a
time for all trailing bytes. (i386 still does this. amd64 now pessimizzes
large sizes instead of small ones if the CPU supports ERMS. dev/fb gets
this wrong by mostly not using the bcopy() family or the technically correct
bus space functions but by mostly copying 2 bytes at a time using an
unoptimized loop without even volatile declarations to prevent the compiler
rewriting it.)
The sizes here are 1, 2, 3 or 4 bytes, so depths 9-16 were up to twice as
slow as necessary and depths 17-24 were up to 3 times slower than necessary.
Fix this (except depths 17-24 are still up to 2 times slower than necessary)
by using (builtin) memcpy() instead of bcopy() and reorganizing so that the
complier can see the small constant sizes. Reduce special cases while
reorganizing although this is slightly slower than adding special cases.
The compiler inlining (and even -O2 vs -O0) makes little difference compared
with reducing the number of accesses except on modern hardware it gives a
small improvement.
Clearing was also pessimized mainly by the extra accesses. Fix it quite
differently by creating a MEMBUF containing 1 line (in fast memory using
a slow method) and copying this. This is only slightly slower than reducing
everything to efficient memset()s and bcopy()s, but simpler, especially
for the segmented case. This works for planar modes too, but don't use it
then since the old method was actually optimal for planar modes (it works
by moving the slow i/o instructions out of inner loops), while for direct
modes the slow instructions were all in the invisible inner loop in bcopy().
Use htole32() and le32toh() and some type puns instead of unoptimized
functions for converting colors. This optimization is mostly in the noise.
libvgl is only supported on x86, so it could hard-code the assumption that
the byte order is le32, but the old conversion functions didn't hard-code
this.
* Use `MIN` instead of similar hand rolled macro.
* Sort headers.
* Use `errno.h` instead of `sys/errno.h`.
* Wrap the argument to sizeof in parentheses for clarity.
* Remove `__BSD_VISIBLE` and `_XOPEN_SOURCE` #defines to mute warnings about
incompatible snprintf definitions.
This fixes a number of warnings I've been seeing lately in my builds.
Sort makefile variables per style.Makefile(9) (`CFLAGS`/`CWARNFLAG.gcc`) and
bump `WARNS` to 3.
Add support for INET6 addresses to the kernel code that dumps open/lock state.
PR#223036 reported that INET6 callback addresses were not printed by
nfsdumpstate(8). This kernel patch adds INET6 addresses to the dump structure,
so that nfsdumpstate(8) can print them out, post-r346190.
The patch also includes the addition of #ifdef INET, INET6 as requested
by bz@.
Followup to -r344552 in which fsck_ffs checks for a size past the
last allocated block of the file and if that is found, shortens the
file to reference the last allocated block thus avoiding having it
reference a hole at its end.
This update corrects an error where fsck_ffs miscalculated the last
logical block of the file when the file contained a large hole.
When sending IPv4 packets on a SOCK_RAW socket using the IP_HDRINCL option,
ensure that the ip_hl field is valid. Furthermore, ensure that the complete
IPv4 header is contained in the first mbuf. Finally, move the length checks
before relying on them when accessing fields of the IPv4 header.
Reported by: jtl@
Reviewed by: jtl@
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19181
Move mpr/mps drivers from per-arch NOTES files into the MI notes
file. They are in more arches they they aren't. Add appropriate
nodevice directives in powerpc and arm.
cem [Sat, 13 Apr 2019 04:42:17 +0000 (04:42 +0000)]
sort(1): Memoize MD5 computation to reduce repeated computation
Experimentally, reduces sort -R time of a 148160 line corpus from about
3.15s to about 0.93s on this particular system.
There's probably room for improvement using some digest other than md5, but
I don't want to look at sort(1) anymore. Some discussion of other possible
improvements in the Test Plan section of the Differential.
Summary:
Initial NUMA support:
- associate CPU with domain
- associate memory ranges with domain
- identify domain for devices
- limit device interrupt binding to appropriate domain
- Additionally fixes a bug in the setting of Maxmem which led to
only memory attached to the first socket being enabled for DMA
A pmap variant can opt in to numa support by by calling `numa_mem_regions`
at the end of pmap_bootstrap - registering the corresponding ranges with the
VM.
This yields a ~20% improvement in build times of llvm on dual socket POWER9
over non-NUMA.
powerpc/dtrace: Fix dtrace powerpc asm, and simplify stack walking
Fix some execution bugs in the dtrace powerpc asm. addme pulls in the carry
flag which we don't want, and the result wasn't recorded anyways, so the
following beq to check for exit condition wasn't checking the right
condition.
Simplify the stack walking in dtrace_isa.c, so there's only a single walker
that handles both pc and sp. This should make it easier to follow, and any
bugfix that may be needed for walking only needs to be made in one place
instead of two now.
Forgot to add the changes for DELAY(), which lowers priority during the
delay period. Also, mark the timebase read as volatile so newer GCC does
not optimize it away, as it reportedly does currently.
powerpc: Adjust priority NOPs, and make them functions
PowerISA 2.07 and PowerISA 3.0 both specify special NOPs for priority
adjustments, with "medium" priority being normal. We had been setting
medium-low as our normal priority. Rather than guess each time as to what
we want and the right NOP, wrap them in inline functions, and replace the
occurrances of the NOPs with the functions. Also, make DELAY() drop to very
low priority while waiting, so we don't burn CPU.
Coupled with r346143, this shaves off a modest 5-8% on buildworld times with
-j72. There may be more room for improvement with judicious use of these
NOPs.
powerpc64: Increase the nap level on power9 idling
The POWER9 documentation specifies that levels 0-3 are the 'lightest' sleep
level, meaning lowest latency and with no state loss. However, state 3 is
not implemented, and is instead reserved for future chips. This now
properly configures the PSSCR, specifying state 2 as the lowest level to
enter, but request level 0 for quickest sleep level. If the OCC determines
that the CPU can enter states 1 or 2 it will trigger the transition to those
states on demand.
It was pointed out that manually loading a .dtb to be used rather than
relying on platform-specific method for loading .dtb will result in overlays
not being applied. This was true because overlay loading was hacked into
fdt_platform_load_dtb, rather than done in a way more independent from how
the .dtb is loaded.
Instead, push overlay loading (for now) out into an
fdt_platform_load_overlays. This method easily allows ubldr to pull in any
fdt_overlays specified in the ub env, and omits overlay-checking on
platforms where they're not tested and/or not desired (e.g. powerpc). If we
eventually stop caring about fdt_overlays from ubenv (if we ever cared),
this method should get chopped out in favor of just calling
fdt_load_dtb_overlays() directly.
Reported by: Manuel Stühn (freebsdnewbie freenet de)
Cirrus-CI: pass OVMF env var to test script for upcoming changes
In review D19876 ian@ has some proposed improvements to the
tools/boot/ci-qemu-test.sh script. Start specifying the location of
OVMF.fd fetched by the Cirrus-CI build in advance of those changes.
Reinitialize multicast source filter structures after invalidation.
When leaving a multicast group, a hole may be created in the inpcb's
source filter and group membership arrays. To remove the hole, the
succeeding array elements are copied over by one entry. The multicast
code expects that a newly allocated array element is initialized, but
the code which shifts a tail of the array was leaving stale data
in the final entry. Fix this by explicitly reinitializing the last
entry following such a copy.
Reported by: syzbot+f8c3c564ee21d650475e@syzkaller.appspotmail.com
Reviewed by: ae
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19872
cem [Thu, 11 Apr 2019 05:08:49 +0000 (05:08 +0000)]
sort(1): Simplify and bound random seeding
Bound input file processing length to avoid the issue reported in [1]. For
simplicity, only allow regular file and character device inputs. For
character devices, only allow /dev/random (and /dev/urandom symblink).
32 bytes of random is perfectly sufficient to seed MD5; we don't need any
more. Users that want to use large files as seeds are encouraged to truncate
those files down to an appropriate input file via tools like sha256(1).
(This does not change the sort algorithm of sort -R.)
Implement CMD53 block mode support for SDHCI and AllWinner-based boards
If a custom block size requested, use it, otherwise revert to the previous logic
of using just a data size if it's less than MMC_BLOCK_SIZE, and MMC_BLOCK_SIZE otherwise.
Add new fields to mmc_data in preparation to SDIO CMD53 block mode support
SDIO command CMD53 (IO_RW_EXTENDED) allows data transfers using blocks of 1-2048 bytes,
with a maximum of 511 blocks per request.
Extend mmc_data structure to properly describe such requests,
and initialize the new fields in kernel and userland consumers.
No actual driver changes happen yet, these will follow in the separate changes.
manu [Wed, 10 Apr 2019 19:18:05 +0000 (19:18 +0000)]
arm: dts: Remove some old DTS
RPI is using the firmware provided DTS since 12.0
Pandaboard works with the Linux DTS
RK* Exynos* and Meson*/Odroid* don't even work with current
source code, if someone wants to make them work again they
better use the Linux DTS.
Fix a small bug in the tcp_log_id where the bucket
was unlocked and yet the bucket-unlock flag was not
changed to false. This can cause a panic if INVARIANTS
is on and we go through the right path (though rare).
This fixes the correct bug :)
libbe(3): use libzfs name validation for datasets/snapshot names
Our home-rolled solution didn't quite capture all of the details, and we
didn't actually validate snapshot names at all. zfs_name_valid captures the
important details, but it doesn't necessarily expose the errors that we're
wanting to see in the be_validate_* functions. Validating lengths
independently, then the names, should make this a non-issue.
Refine r330113 to honor the ProducerConsumer flag most of the time.
While it is true that the ACPI spec says that the flag is only valid
on Extended Address Space Descriptors, examples of other descriptors
in the spec use the ProducerConsumer flag explicitly, and real
hardware uses it as well. In fact, even in the ASL of the Thunder X2
for which r330113 was a workaround, some devices use this flag on
non-Extended Address Space Descriptors correctly. Instead, only
ignore the flag for resources associated with the UART devices on the
Thunder X2 using the "ARMH0011" HID to identify these devices.
This should fix regressions from ignoring this flag in other contexts
such as Hyper-V.
Fix dirty buf exhaustion easily triggered with msdosfs.
If truncate(2) is performed on msdosfs file, which extends the file by
system-depended large amount, fs creates corresponding amount of dirty
delayed-write buffers, which can consume all buffers. Such buffers
cannot be flushed by the bufdaemon because the ftruncate() thread owns
the vnode lock. So the system runs out of free buffers, and even
truncate() thread starves, which means deadlock because it owns the
vnode lock.
Fix this by doing vnode fsync in extendfile() when low memory or low
buffers condition detected, which flushes all dirty buffers belonging
to the file being extended.
Note that the more usual fallback to bawrite() does not work
acceptable in this situation, because it would only allow one buffer
to be recycled. Other filesystems, most important UFS, do not allow
userspace to create arbitrary amount of dirty delayed-write buffers
without feedback, so bawrite() is good enough for them.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Don't pre-reserve resources for CPU devices when they are set.
CPUs can use shared (RF_SHAREABLE) resources for the I/O port used for
entering and exiting C states. If this I/O port is included in an ACPI
system resource device, then this happens to still work, but if the port
wasn't part of a system resource device, only the first CPU could allocate
the I/O port and use C states since resource_list_reserve() was always
allocating the resource from nexus0 without RF_SHAREABLE. By avoiding
the reservation, the flags from the bus_alloc_resource() in the CPU driver
(which include RF_SHAREABLE) are honored.
pci_cfgreg.c: Use io port config access for early boot time.
Some early PCIe chipsets are explicitly listed in the white-list to
enable use of the MMIO config space accesses, perhaps because ACPI
tables were not reliable source of the base MCFG address at that time.
For that chipsets, MCFG base was read from the known chipset MCFGbase
config register.
During very early stage of boot, when access to the PCI config space
is performed (see e.g. pci_early_quirks.c), we cannot map 255MB of
registers because the method used with pre-boot pmap overflows initial
kernel page tables.
Move fallback to read MCFGbase to the attachment method of the
x86/legacy device, which removes code duplication, and results in the
use of io accesses until MCFG is parsed or legacy attach called.
For amd64, pre-initialize cfgmech with CFGMECH_1, right now we
dynamically assign CFGMECH_1 to it anyway, and remove checks for
CFGMECH_NONE.
There is a mention in the Intel documentation for corresponding
chipsets that OS must use either io port or MMIO access method, but we
already break this rule by reading MCFGbase register, so one more
access seems to be innocent.