The change makes the user and kernel address spaces on i386
independent, giving each almost the full 4G of usable virtual addresses
except for one PDE at top used for trampoline and per-CPU trampoline
stacks, and system structures that must be always mapped, namely IDT,
GDT, common TSS and LDT, and process-private TSS and LDT if allocated.
By using 1:1 mapping for the kernel text and data, it appeared
possible to eliminate assembler part of the locore.S which bootstraps
initial page table and KPTmap. The code is rewritten in C and moved
into the pmap_cold(). The comment in vmparam.h explains the KVA
layout.
There is no PCID mechanism available in protected mode, so each
kernel/user switch forth and back completely flushes the TLB, except
for the trampoline PTD region. The TLB invalidations for userspace
becomes trivial, because IPI handlers switch page tables. On the other
hand, context switches no longer need to reload %cr3.
copyout(9) was rewritten to use vm_fault_quick_hold(). An issue for
new copyout(9) is compatibility with wiring user buffers around sysctl
handlers. This explains two kind of locks for copyout ptes and
accounting of the vslock() calls. The vm_fault_quick_hold() AKA slow
path, is only tried after the 'fast path' failed, which temporary
changes mapping to the userspace and copies the data to/from small
per-cpu buffer in the trampoline. If a page fault occurs during the
copy, it is short-circuit by exception.s to not even reach C code.
The change was motivated by the need to implement the Meltdown
mitigation, but instead of KPTI the full split is done. The i386
architecture already shows the sizing problems, in particular, it is
impossible to link clang and lld with debugging. I expect that the
issues due to the virtual address space limits would only exaggerate
and the split gives more liveness to the platform.
Tested by: pho
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D14633
By popular demand, pkg now walks thought the arguments passed and
if it finds -y or --yes it does accept those as equivalent of
ASSUME_ALWAYS_YES env var.
A few glyphs were converted incorrectly:
U+00A6 broken bar - center
U+2022 bullet - center
U+2026 horizontal ellipsis - move to bottom of character cell
When we had both groff and mandoc in base, we decided to keep the roff(7)
manpage from groff. when remoing groff, we forgot to install the mandoc version
instead.
ken [Thu, 12 Apr 2018 21:21:18 +0000 (21:21 +0000)]
Handle Programmable Early Warning for control commands in sa(4).
When the tape position is inside the Early Warning area, the tape
drive will return a sense key of NO SENSE, and an ASC/ASCQ of
0x00,0x02, which means: End-of-partition/medium detected". If
this was in response to a control command like WRITE FILEMARKS,
we correctly translate this as informational status and return
0 from saerror().
Programmable Early Warning should be handled the same way, but
we weren't handling it that way. As a result, if a PEW status
(sense key of NO SENSE, ASC/ASCQ of 0x00,0x07, "Programmable early
warning detected") came back in response to a WRITE FILEMARKS,
we returned an error.
The impact of this was that if an application was writing to a
sa(4) device, and a PEW area was set (in the Device Configuration
Extension subpage -- mode page 0x10, subpage 1), and a filemark
needed to be written on close, we could wind up returning an error
to the user on close because of a "failure" to write the filemarks.
It actually isn't a failure, but rather just a status report from
the drive, and shouldn't be treated as a failure.
sys/cam/scsi/scsi_sa.c:
For control commands in saerror(), treat asc/ascq 0x00,0x07
the same as 0x00,{0-5} -- not an error. Return 0, since
the command actually did succeed.
Reported by: Dr. Andreas Haakh <andreas@haakh.de>
Tested by: Dr. Andreas Haakh <andreas@haakh.de>
Sponsored by: Spectra Logic
MFC after: 3 days
Use cfg->nomatch_verdict as return value from NAT64LSN handler when
given mbuf is considered as not matched.
If mbuf was consumed or freed during handling, we must return
IP_FW_DENY, since ipfw's pfil handler ipfw_check_packet() expects
IP_FW_DENY when mbuf pointer is NULL. This fixes KASSERT panics
when NAT64 is used with INVARIANTS. Also remove unused nomatch_final
field from struct nat64lsn_cfg.
The miscellaneous x86 sysent->sv_setregs() implementations tried to
migrate PSL_T from the previous program to the new executed one, but
they evaluated regs->tf_eflags after the whole regs structure was
bzeroed. Make this functional by saving PSL_T value before zeroing.
Note that if the debugger is not attached, executing the first
instruction in the new program with PSL_T set results in SIGTRAP, and
since all intercepted signals are reset to default dispostion on
exec(2), this means that non-debugged process gets killed immediately
if PSL_T is inherited. In particular, since suid images drop
P_TRACED, attempt to set PSL_T for execution of such program would
kill the process.
Another issue with userspace PSL_T handling is that it is reset by
trap(). It is reasonable to clear PSL_T when entering SIGTRAP
handler, to allow the signal to be handled without recursion or
delivery of blocked fault. But it is not reasonable to return back to
the normal flow with PSL_T cleared. This is too late to change, I
think.
Discussed with: bde, Ali Mashtizadeh
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D14995
This was inadvertently overriding the first found SYSDIR with the last
of /usr/src which could result in the wrong headers being used if not
building from /usr/src.
SYSDIR?= is not used here to avoid evaluating the exists() when unneeded.
Reported by: rgrimes, sjg, Mark Millard
Pointyhat to: bdrewery
Sponsored by: Dell EMC
"Terminus BSD Console" is a derivative of Terminus that is provided
by Mr. Dimitar Zhekov under the 2-clause BSD license for use by the
FreeBSD vt(4) console and other BSDs.
PR: 227409
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
In pti-enabled pmap, the PCID allocation scheme assigns temporal id
for the kernel page table, and user page table twin PCID is
calculating by setting high bit in the kernel PCID. So the kernel AS
is mapped with per-vmspace PCID, and we must completely shut down all
mappings in KVA when switching contexts, so that newly switched thread
would see all changes in KVA occured while it was not executing.
After all, KVA is same between all threads.
Currently the pti context switch for the user part of the page table
gets its TLB entries flushed too. It is excessive. The same PCID
flushing algorithm that is used for non-pti pmap, correctly works for
the UVA mappings. The only shared TLB entries are the pages from KVA
accessed by the kernel entry trampoline. All of them are static
except per-thread TSS and LDT. For TSS and LDT, the lifetime of newly
allocated entries is the whole thread life, so it is fine as well. If
not fine, then explicit shutdowns for current pmap of the newly
allocated LDT and TSS pages would be enough.
Also restore the constant value for the pm_pcid for the kernel_pmap.
Before, for PTI pmap, pm_pcid was erronously rolled same as user
pmap's pm_pcid, but it was not used.
Some BIOSes have trouble booting from GPT in non-UEFI mode. This is
commonly reported with Lenovo laptops, including my x220. As we do not
currently support booting FreeBSD/i386 via UEFI there's no reason to
prefer GPT.
The "vestigial swap partition" was added in r265017 to work around an
issue with loader's GPT support, so we should not need it when using
MBR.
We may want to make the same change to amd64, although the issue there is
mitigated by such systems booting via UEFI in the common case.
PR: 227422
Reviewed by: gjb
MFC after: 3 weeks
Relnotes: Yes
Sponsored by: The FreeBSD Foundation
Replace MD assembly exect() with a portable version.
Originally, on the VAX exect() enable tracing once the new executable
image was loaded. This was possible because tracing was controllable
through user space code by setting the PSL_T flag. The following
instruction is a system call that activated tracing (as all
instructions do) by copying PSL_T to PSL_TP (trace pending). The
first instruction of the new executable image would trigger a trace
fault.
This is not portable to all platforms and the behavior was replaced with
ptrace(PT_TRACE_ME, ...) since FreeBSD forked off of the CSRG repository.
Platforms either incorrectly call execve(), trigger trace faults inside
the original executable, or do contain an implementation of this
function.
The exect() interfaces is deprecated or removed on NetBSD and OpenBSD.
Add the ability to specify absolute and relative offsets to size partitions.
To create hybrid boot media we want to specify a partition at a known location.
This extends the syntax of size partitions to include an optional offset that
can be absolute or relative. It also introduces validation to make sure that
this hasn't resulted in overlapping partitions. I haven't added this to the
file and process partition specifications yet but the mechanics are designed
such that if someone comes up with a good way of specifying the offset it
will be fairly easy to add in.
Reviewed by: imp
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D14916
Tune xDMA interface slightly:
o Move descriptors allocation to DMA engine driver
o Add generic xdma_request() routine
o Add less-generic scatter-gather application based on xdma interface
Typical operation flow in peripheral device driver is:
1. Get xDMA controller
sc->xdma_tx = xdma_ofw_get(sc->dev, "tx");
Restore r332389 after resolution of locking fixes.
Add one extra lock initialization to iflib_register() that was missed
in the git<->phab conversion.
Split out flag manipulation from general context manipulation in iflib
To avoid blocking on the context lock in the swi thread and risk potential
deadlocks, this change protects lighter weight updates that only need to
be consistent with each other with their own lock.
cron(8): Reload database if an existing job in cron.d changed as well
Directory mtime will only change if a file is added or removed, not
modified. For /var/cron/tabs, this is fine because of how crontab(1) manages
it using temp files so all crontab(1) changes will trigger a reload of the
database.
For /etc/cron.d and /usr/local/etc/cron.d, this is not necessarily the case.
Instead of checking their mtime, we should descend into them and check mtime
on all jobs also.
Reported by: des
Reviewed by: bapt
MFC after: 1 week
allow ZFS pool to have temporary name for duration of current import
The change adds -t <name> option to zpool create and -t option to zpool
import in its form with an old name and a new name. This allows to
import (or create) a pool under a name that's different from its real,
permanent name without affecting that name. This is useful when working
with VM images or images of other physical systems if they happen to
have a ZFS pool with the same name as the host system.
The changes come from ZoL with some small tweaks.
The porting has been done by julian.
The change is being submitted to OpenZFS:
https://github.com/openzfs/openzfs/pull/600
netmap: align codebase to the current upstream (commit id 3fb001303718146)
Changelist:
- Turn tx_rings and rx_rings arrays into arrays of pointers to kring
structs. This patch includes fixes for ixv, ixl, ix, re, cxgbe, iflib,
vtnet and ptnet drivers to cope with the change.
- Generalize the nm_config() callback to accept a struct containing many
parameters.
- Introduce NKR_FAKERING to support buffers sharing (used for netmap
pipes)
- Improved API for external VALE modules.
- Various bug fixes and improvements to the netmap memory allocator,
including support for externally (userspace) allocated memory.
- Refactoring of netmap pipes: now linked rings share the same netmap
buffers, with a separate set of kring pointers (rhead, rcur, rtail).
Buffer swapping does not need to happen anymore.
- Large refactoring of the control API towards an extensible solution;
the goal is to allow the addition of more commands and extension of
existing ones (with new options) without the need of hacks or the
risk of running out of configuration space.
A new NIOCCTRL ioctl has been added to handle all the requests of the
new control API, which cover all the functionalities so far supported.
The netmap API bumps from 11 to 12 with this patch. Full backward
compatibility is provided for the old control command (NIOCREGIF), by
means of a new netmap_legacy module. Many parts of the old netmap.h
header has now been moved to netmap_legacy.h (included by netmap.h).
vt: add three more cp437 mappings for vga textmode
In UTF-8 locales mandoc uses a number of characters outside of the Basic
Latin group, e.g. from general punctuation or miscellaneous mathematical
symbols, and these rendered as ? in text mode.
This change adds (char, replacement, code point, description):
– - U+2013 En Dash
⟨ < U+27E8 Mathematical Left Angle Bracket
⟩ > U+27E9 Mathematical Right Angle Bracket
This change addresses some common cases; there are others that still
need to be added after a more thorough review.
PR: 227409
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Also, since ifc_nhwrxqs is only used in one place, remove it from the struct.
This was preventing iflib_dma_free() from being called via
iflib_device_detach().
Refactor the currdev setting to find the device we booted from. Limit
searching when we don't already have a reasonable currdev from that to
the same device only. Search a little harder for ZFS volumes as that's
needed for loader.efi to live on an ESP.
pf: limit ioctl to a reasonable and tuneable number of elements
pf ioctls frequently take a variable number of elements as argument. This can
potentially allow users to request very large allocations. These will fail,
but even a failing M_NOWAIT might tie up resources and result in concurrent
M_WAITOK allocations entering vm_wait and inducing reclamation of caches.
Limit these ioctls to what should be a reasonable value, but allow users to
tune it should they need to.
locks: extend speculative spin waiting for readers to drain
Now that 10 years have passed since the original limit of 10000 was
committed, bump it a little bit.
Spinning waiting for writers is semi-informed in the sense that we always
know if the owner is running and base the decision to spin on that.
However, no such information is provided for read-locking. In particular
this means that it is possible for a write-spinner to completely waste cpu
time waiting for the lock to be released, while the reader holding it was
preempted and is now waiting for the spinner to go off cpu.
Nonetheless, in majority of cases it is an improvement to spin instead of
instantly giving up and going to sleep.
The current approach is pretty simple: snatch the number of current readers
and performs that many pauses before checking again. The total number of
pauses to execute is limited to 10k. If the lock is still not free by
that time, go to sleep.
Given the previously noted problem of not knowing whether spinning makes
any sense to begin with the new limit has to remain rather conservative.
But at the very least it should also be related to the machine. Waiting
for writers uses parameters selected based on the number of activated
hardware threads. The upper limit of pause instructions to be executed
in-between re-reads of the lock is typically 16384 or 32678. It was
selected as the limit of total spins. The lower bound is set to
already present 10000 as to not change it for smaller machines.
Bumping the limit reduces system time by few % during benchmarks like
buildworld, buildkernel and others. Tested on 2 and 4 socket machines
(Broadwell, Skylake).
Figuring out how to make a more informed decision while not pessimizing
the fast path is left as an exercise for the reader.
Add a -R option to setfacl to operate recursively on directories, along
with the accompanying flags -H, -L, and -P (whose behaviour mimics
chmod).
A patch was submitted with PR 155163, but this is a new implementation
based on comments raised in the Phabricator review for that patch
(review D9096).
ian [Tue, 10 Apr 2018 22:57:56 +0000 (22:57 +0000)]
Use explicit_bzero() when cleaning values out of the kernel environment.
Sometimes the values contain geli passphrases being communicated from
loader(8) to the kernel, and some day the compiler may decide to start
eliding calls to memset() for a pointer which is not dereferenced again
before being passed to free().
Reenter KDB on fault on powerpc, instead of panicking
Most other architectures already re-enter KDB on faults, powerpc and mips
are the only outliers. Correct this for powerpc, so that now bad addresses
can be handled gracefully instead of panicking.
[pi] Do not attach bcm2835_pwm if DTB node is not enabled
Switch to standard FDT-base driver behavior and don't attach
if node "status" property value nn DTS is not set to "okay"
On RPi PWM by default is disabled, to enable it pwm.dtbo
from official repo[1] should be copied to overlays directory
on SD card FAT partition and "dtoverlay=pwm" line added to
config.txt. For more details see pwm overlay documentation[2]
sysutils/rpi-firmware port now includes overlays, so they
can be installed as a part of release image build.
Split out flag manipulation from general context manipulation in iflib
To avoid blocking on the context lock in the swi thread and risk potential
deadlocks, this change protects lighter weight updates that only need to
be consistent with each other with their own lock.
GEOM ELI may double ask the password during boot. Once at loader time, and
once at init time.
This happens due a module loading bug. By default GEOM ELI caches the
password in the kernel, but without the MODULE_VERSION annotation, the
kernel loads over the kernel module, even if the GEOM ELI was compiled into
the kernel. In this case, the newly loaded module
purges/invalidates/overwrites the GEOM ELI's password cache, which causes
the double asking.
MFC Note: There's a pc98 component to the original submission that is
omitted here due to pc98 removal in head. This part will need to be revived
upon MFC.
hyperv/storvsc: storvsc_io_done(): do not use CAM_SEL_TIMEOUT
CAM_SEL_TIMEOUT was introduced in
https://reviews.freebsd.org/D7521 (r304251), which claimed:
"VM shall response to CAM layer with CAM_SEL_TIMEOUT to filter those
invalid LUNs. Never use CAM_DEV_NOT_THERE which will block LUN scan
for LUN number higher than 7."
But it turns out this is not correct:
I think what really filters the invalid LUNs in r304251 is that:
before r304251, we could set the CAM_REQ_CMP without checking
vm_srb->srb_status at all:
ccb->ccb_h.status |= CAM_REQ_CMP.
r304251 checks vm_srb->srb_status and sets ccb->ccb_h.status properly,
so the invalid LUNs are filtered.
I changed my code version to r304251 but replaced the CAM_SEL_TIMEOUT
with CAM_DEV_NOT_THERE, and I confirmed the invalid LUNs can also be
filtered, and I successfully hot-added and hot-removed 8 disks to/from
the VM without any issue.
CAM_SEL_TIMEOUT has an unwanted side effect -- see cam_periph_error():
For a selection timeout, we consider all of the LUNs on
the target to be gone. If the status is CAM_DEV_NOT_THERE,
then we only get rid of the device(s) specified by the
path in the original CCB.
This means: for a VM with a valid LUN on 3:0:0:0, when the VM inquires
3:0:0:1 and the host reports 3:0:0:1 doesn't exist and storvsc returns
CAM_SEL_TIMEOUT to the CAM layer, CAM will detech 3:0:0:0 as well: this
is the bug I reported recently:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226583
Following r331292, many of the files (such as the LICENSE file)
have moved from the u-boot-rpi3 share directory to the default
rpi-firmware share directory. Remove the files from UBOOT_FILES
and append the DTB file to a DTB_FILES list so the correct path
is used, fixing a build failure.
Call through powerpc_interrupt for all Book-E interrupts
Make int_external_input, int_decrementer, and int_performance_counter all
now use trap_common, just like on AIM. The effects of this are:
* All traps are now properly displayed in ddb. Previously traps from
external input, decrementer, and performance counters, would display as
just basic stack traces. Now the frame is displayed.
* External interrupts are now handled with interrupts enabled, so handling
can be preempted. This seems to fix a hang found post-r329882.
Modify the net.inet.tcp.function_ids sysctl introduced in r331347.
Export additional information which may be helpful to userspace
consumers and rename the sysctl to net.inet.tcp.function_info.
Provide long options --bytes and --lines to match -c and -n respectively.
This improves head(1)'s compatibility with its GNU counterpart in a sensible
way.
Add --blocks, --bytes, and --lines long options for -b, -c, and -n
respectively. This improves tail(1)'s compatibility with its GNU counterpart
in a straightforward way.
Reviewed by: eadler (earlier version)
MFC after: 3 days
Page daemon output is now regulated by a PID controller with a setpoint
of v_free_target. Moreover, the page daemon now wakes up regularly
rather than waiting for a wakeup from another thread. This means that
the free page count is unlikely to drop below the old
zfs_arc_free_target value, and as a result the ARC was not readily
freeing pages under memory pressure. Address the immediate problem by
updating zfs_arc_free_target to match the page daemon's new behaviour.
Reported and tested by: truckman
Discussed with: jeff
X-MFC with: r329882
Differential Revision: https://reviews.freebsd.org/D14994
Some devices cannot rely on the switch MDIO address passed in the DTB
for specifying single/multi-chip addressing mode. Introduce new property
"single-chip-addressing" which added to DTS will force single-chip mode.
Don't show the number of currently established SCTP associations,
since this is not monotonically increasing. It's number can be
derived from the other counters shown.