Negate the logic of XCHAN_CAP_NOBUFS macro and rename it to
XCHAN_CAP_BOUNCE.
The only application that uses bounce buffering for now is the Government
Furnished Equipment (GFE) P2's dma core (AXIDMA) with its own dedicated
cacheless bounce buffer.
There was an issue in pseries llan driver, that resulted in the first 2 bytes
of the MAC address getting stripped, and the last 2 being always 0.
In most cases the network interface still worked, despite the MAC being
different of what was specified to QEMU, but when some other host or DHCP
server expected a specific MAC, this would fail.
This change fixes this by shifting right by 2 the local-mac-address read from
device tree, if its length is 6 instead of 8, as observed in QEMU DT, that
always presents a 6 bytes value for this property.
iwm: Drain callouts after stopping the device during detach.
Otherwise there is a window where they may be rescheduled. This
typically manifested as a page fault shortly after unloading if_iwm.ko.
Close the race by draining callouts after calling iwm_stop_device(),
which is also what Dragonfly does.
Change whitespace to reduce gratuitous diffs with Dragonfly.
Reported and tested by: seanc
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Save the last callout function executed on each CPU
Save the last callout function pointer (and its argument) executed
on each CPU for inspection by a debugger. Add a ddb `show callout_last`
command to show these pointers. Add a kernel module that I used
for testing that command.
Relocate `ce_migration_cpu` to reduce padding and therefore preserve
the size of `struct callout_cpu` (320 bytes on amd64) despite the
added members.
This should help diagnose reference-after-free bugs where the
callout's mutex has already been freed when `softclock_call_cc`
tries to unlock it.
You might hope that the pointer would still be available, but it
isn't. The argument to that function is on the stack (because
`softclock_call_cc` uses it later), and that might be enough in
some cases, but even then, it's very laborious. A pointer to the
callout is saved right before these newly added fields, but that
callout might have been freed. We still have the pointer to its
associated mutex, and the name within might be enough, but it might
also have been freed.
Cache the next queue element when traversing a page queue.
When QUEUE_MACRO_DEBUG_TRASH is configured, removing a queue element
invalidates its queue linkage pointers. vm_pageout_collect_batch()
was relying on these pointers remaining valid after a removal, so
modify it to fetch the next queued page before dequeuing the current
page.
Use unmapped (M_NOMAP) mbufs for zero-copy AIO writes via TOE.
Previously the TOE code used its own custom unmapped mbufs via
EXT_FLAG_VENDOR1. The old version always wired the entire AIO request
buffer first for the duration of the AIO operation and constructed
multiple mbufs which used the wired buffer as an external buffer.
The new version determines how much room is available in the socket
buffer and only wires the pages needed for the available room building
chains of M_NOMAP mbufs. This means that a large AIO write will now
limit the amount of wired memory it uses to the size of the socket
buffer.
Remove dead code added after r348743 in the LinuxKPI. The
LINUXKPI_VERSION macro is not defined for any compiled LinuxKPI code
which basically means __GFP_NOTWIRED is never checked when allocating
pages. This should work fine with the existing external DRM code as
long as the page wiring and unwiring is balanced.
MFC after: 3 days
Sponsored by: Mellanox Technologies
This was added for emulation of Linux's CDROMSUBCHNL, but allows
users with read access to a cd(4) device to overwrite kernel memory
provided that the driver detects some media present.
Reimplement CDROMSUBCHNL by bouncing the data from CDIOCREADSUBCHANNEL
through the linux_cdrom_subchnl structure passed from userspace.
admbugs: 768
Reported by: Alex Fortune
Security: CVE-2019-5602
Security: FreeBSD-SA-19:11.cd_ioctl
Invoke ext_free function when freeing an unmapped mbuf.
Fix a mis-merge when extracting the unmapped mbuf changes from
Netflix's in-kernel TLS changes where the call to the function that
freed the backing pages from an unmapped mbuf was missed.
Only free pages to the cache when they were allocated from that cache.
This mitigates rapid fragmentation of physical memory seen during
poudriere's dependency calculation phase. In particular, pages
belonging to broken reservations are no longer freed to the per-CPU
cache, so they get a chance to coalesce with freed pages during the
break. Otherwise, the optimized CoW handler may create object
chains in which multiple objects contain pages from the same
reservation, and the order in which we do object termination means
that the reservation is broken before all of those pages are freed,
so some of them end up in the per-CPU cache and thus permanently
fragment physical memory.
The flag may also be useful for eliding calls to vm_reserv_free_page(),
thus avoiding memory accesses for data that is likely not present
in the CPU caches.
Reviewed by: alc
Discussed with: jeff
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20763
Control implicit PROT_MAX() using procctl(2) and the FreeBSD note
feature bit.
In particular, allocate the bit to opt-out the image from implicit
PROTMAX enablement. Provide procctl(2) verbs to set and query
implicit PROTMAX handling. The knobs mimic the same per-image flag
and per-process controls for ASLR.
Reviewed by: emaste, markj (previous version)
Discussed with: brooks
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D20795
Fix handling of errors from sblock() in soreceive_stream().
Previously we would attempt to unlock the socket buffer despite having
failed to lock it. Simply return an error instead: no resources need
to be released at this point, and doing so is consistent with
soreceive_generic().
Extend simple_mfd driver to expose a syscon interface if
that node is also compatible with syscon. For instance,
Rockchip RK3399's GRF (General Register Files) is compatible
with simple-mfd as well as syscon and has devices like
usb2-phy, emmc-phy and pcie-phy etc. under it.
This patch is the driver for NTB hardware in AMD SoCs (ported from Linux)
and enables the NTB infrastructure like Doorbells, Scratchpads and Memory
window in AMD SoC. This driver has been validated using ntb_transport and
if_ntb driver already available in FreeBSD.
Add a new "untrusted" option to the mount command. Its purpose
is to notify the kernel that the file system is untrusted and it
should use more extensive checks on the file-system's metadata
before using it. This option is intended to be used when mounting
file systems from untrusted media such as USB memory sticks or other
externally-provided media.
It will initially be used by the UFS/FFS file system, but should
likely be expanded to be used by other file systems that may appear
on external media like msdosfs, exfat, and ext2fs.
g_mirror_taste: avoid deadlock, always clear tasting flag
If g_mirror_taste encountered an error at g_mirror_add_disk, it might
try to g_mirror_destroy the device with the G_MIRROR_DEVICE_FLAG_TASTING
flag still set. This would wait on a worker to complete the destruction
with g_mirror_try_destroy, but that function bails out if the tasting
flag is set, resulting in a deadlock. Clear the tasting flag before
trying to destroy the device.
Test Plan:
sysctl debug.fail_point.mnowait="1%return"
kyua test -k /usr/tests/sys/geom/class/mirror/Kyuafile
Tidy up pmap_copy(). Notably, deindent the innermost loop by making a
simple change to the control flow. Replace an unnecessary test by a
KASSERT. Add a comment explaining an obscure test.
manu [Mon, 1 Jul 2019 21:50:53 +0000 (21:50 +0000)]
Since r349571 we need all the accessor to be present for set or get
otherwise we panic.
dwmmc don't handle VCCQ (voltage for the IO line of the SD/eMMC) or
TIMING.
Add the needed accessor in the {read,write}_ivar functions.
Using dominance vs a set membership check is indistinguishable from a
compile time perspective, and the two queries return equivelent
results. Simplify code by using the existing function.
Pull in r360978 from upstream llvm trunk (by Philip Reames):
[LFTR] Strengthen assertions in genLoopLimit [NFCI]
Pull in r362292 from upstream llvm trunk (by Nikita Popov):
[IndVarSimplify] Fixup nowrap flags during LFTR (PR31181)
Fix for https://bugs.llvm.org/show_bug.cgi?id=31181 and partial fix
for LFTR poison handling issues in general.
When LFTR moves a condition from pre-inc to post-inc, it may now
depend on value that is poison due to nowrap flags. To avoid this, we
clear any nowrap flag that SCEV cannot prove for the post-inc addrec.
Additionally, LFTR may switch to a different IV that is dynamically
dead and as such may be arbitrarily poison. This patch will correct
nowrap flags in some but not all cases where this happens. This is
related to the adoption of IR nowrap flags for the pre-inc addrec.
(See some of the switch_to_different_iv tests, where flags are not
dropped or insufficiently dropped.)
Finally, there are likely similar issues with the handling of GEP
inbounds, but we don't have a test case for this yet.
Pull in r362971 from upstream llvm trunk (by Philip Reames):
Prepare for multi-exit LFTR [NFC]
This change does the plumbing to wire an ExitingBB parameter through
the LFTR implementation, and reorganizes the code to work in terms of
a set of individual loop exits. Most of it is fairly obvious, but
there's one key complexity which makes it worthy of consideration.
The actual multi-exit LFTR patch is in D62625 for context.
Specifically, it turns out the existing code uses the backedge taken
count from before a IV is widened. Oddly, we can end up with a
different (more expensive, but semantically equivelent) BE count for
the loop when requerying after widening. For the nestedIV example
from elim-extend, we end up with the following BE counts:
BEFORE: (-2 + (-1 * %innercount) + %limit)
AFTER: (-1 + (sext i32 (-1 + %limit) to i64) + (-1 * (sext i32 %innercount to i64))<nsw>)
This is the only test in tree which seems sensitive to this
difference. The actual result of using the wider BETC on this example
is that we actually produce slightly better code. :)
In review, we decided to accept that test change. This patch is
structured to preserve the old behavior, but a separate change will
immediate follow with the behavior change. (I wanted it separate for
problem attribution purposes.)
Pull in r362975 from upstream llvm trunk (by Philip Reames):
[LFTR] Use recomputed BE count
This was discussed as part of D62880. The basic thought is that
computing BE taken count after widening should produce (on average)
an equally good backedge taken count as the one before widening.
Since there's only one test in the suite which is impacted by this
change, and it's essentially equivelent codegen, that seems to be a
reasonable assertion. This change was separated from r362971 so that
if this turns out to be problematic, the triggering piece is obvious
and easily revertable.
For the nestedIV example from elim-extend.ll, we end up with the
following BE counts:
BEFORE: (-2 + (-1 * %innercount) + %limit)
AFTER: (-1 + (sext i32 (-1 + %limit) to i64) + (-1 * (sext i32 %innercount to i64))<nsw>)
Note that before is an i32 type, and the after is an i64. Truncating
the i64 produces the i32.
Pull in r362980 from upstream llvm trunk (by Philip Reames):
Factor out a helper function for readability and reuse in a future
patch [NFC]
Pull in r363613 from upstream llvm trunk (by Philip Reames):
Fix a bug w/inbounds invalidation in LFTR (recommit)
Recommit r363289 with a bug fix for crash identified in pr42279.
Issue was that a loop exit test does not have to be an icmp, leading
to a null dereference crash when new logic was exercised for that
case. Test case previously committed in r363601.
Original commit comment follows:
This contains fixes for two cases where we might invalidate inbounds
and leave it stale in the IR (a miscompile). Case 1 is when switching
to an IV with no dynamically live uses, and case 2 is when doing
pre-to-post conversion on the same pointer type IV.
The basic scheme used is to prove that using the given IV (pre or
post increment forms) would have to already trigger UB on the path to
the test we're modifying. As such, our potential UB triggering use
does not change the semantics of the original program.
As was pointed out in the review thread by Nikita, this is defending
against a separate issue from the hasConcreteDef case. This is about
poison, that's about undef. Unfortunately, the two are different, see
Nikita's comment for a fuller explanation, he explains it well.
(Note: I'm going to address Nikita's last style comment in a separate
commit just to minimize chance of subtle bugs being introduced due to
typos.)
Pull in r363875 from upstream llvm trunk (by Philip Reames):
[LFTR] Rename variable to minimize confusion [NFC]
(Recommit of r363293 which was reverted when a dependent patch was.)
As pointed out by Nikita in D62625, BackedgeTakenCount is generally
used to refer to the backedge taken count of the loop. A conditional
backedge taken count - one which only applies if a particular exit is
taken - is called a ExitCount in SCEV code, so be consistent here.
Pull in r363877 from upstream llvm trunk (by Philip Reames):
[LFTR] Stylistic cleanup as suggested in last review comment of
D62939 [NFC]
(Resumbit of r363292 which was reverted along w/an earlier patch)
Pull in r364346 from upstream llvm trunk (by Philip Reames):
[LFTR] Adjust debug output to include extensions (if any)
Pull in r364693 from upstream llvm trunk (by Philip Reames):
[IndVars] Remove a bit of manual constant folding [NFC]
SCEV is more than capable of folding (add x, trunc(0)) to x.
Pull in r364709 from upstream llvm trunk (by Nikita Popov):
[LFTR] Fix post-inc pointer IV with truncated exit count (PR41998)
Fixes https://bugs.llvm.org/show_bug.cgi?id=41998. Usually when we
have a truncated exit count we'll truncate the IV when comparing
against the limit, in which case exit count overflow in post-inc form
doesn't matter. However, for pointer IVs we don't do that, so we have
to be careful about incrementing the IV in the wide type.
I'm fixing this by removing the IVCount variable (which was ExitCount
or ExitCount+1) and replacing it with a UsePostInc flag, and then
moving the actual limit adjustment to the individual cases (which
are: pointer IV where we add to the wide type, integer IV where we
add to the narrow type, and constant integer IV where we add to the
wide type).
Factor out the code that does a VOP_SETATTR(size) from vn_truncate().
This patch factors the code in vn_truncate() that does the actual
VOP_SETATTR() of size into a separate function called vn_truncate_locked().
This will allow the NFS server and the patch that adds a
copy_file_range(2) syscall to call this function instead of duplicating
the code and carrying over changes, such as the recent r347151.
This patch fixes 2 panics. The first one is due to the current VNET not
being set in the emulated adapter transmission path. The second one
is caused by the M_PKTHDR flag not being set when preallocated mbufs
are recycled in the transmit path.
The goal of this driver is consolidate information about SuperIO chips
and to provide for peaceful coexistence of drivers that need to access
SuperIO configuration registers.
While SuperIO chips can host various functions most of them are
discoverable and accessible without any knowledge of the SuperIO.
Examples are: keyboard and mouse controllers, UARTs, floppy disk
controllers. SuperIO-s also provide non-standard functions such as
GPIO, watchdog timers and hardware monitoring. Such functions do
require drivers with a knowledge of a specific SuperIO.
At this time the driver supports a number of ITE and Nuvoton (fka
Winbond) SuperIO chips.
There is a single driver for all devices. So, I have not done the usual
split between the hardware driver and the bus functionality. Although,
superio does act as a bus for devices that represent known non-standard
functions of a SuperIO chip. The bus provides enumeration of child
devices based on the hardcoded knowledge of such functions. The
knowledge as extracted from datasheets and other drivers.
As there is a single driver, I have not defined a kobj interface for it.
So, its interface is currently made of simple functions.
I think that we can the flexibility (and complications) when we actually
need it.
I am planning to convert nctgpio and wbwd to superio bus very soon.
Also, I am working on itwd driver (watchdog in ITE SuperIO-s).
Additionally, there is ithwm driver based on the reverted sensors
import, but I am not sure how to integrate it given that we still lack
any sensors interface.
nctgpio: change default pin names to those used by the datasheet(s)
That is, instead of the current GPIO00 - GPIO15 the names will be GPIO00
- GPIO07, GPIO10 - GPIO17. The first digit is a GPIO "bank" / group
number and the second one is a pin number within the bank. Alternative
view is that the pin names are changed from decimal numbering scheme to
octal one (as there are 8 pins per bank).
mhorne [Sun, 30 Jun 2019 19:43:13 +0000 (19:43 +0000)]
elftoolchain: fix an incorrect e_flags description
r349482 introduced the definitions and descriptions of the RISC-V
specific e_flags values to elftoolchain. However, the description for
the EF_RISCV_RVE flag was incorrectly duplicated from EF_RISCV_RVC. Fix
this by providing the proper description for this flag.
arichardson [Sun, 30 Jun 2019 11:49:58 +0000 (11:49 +0000)]
Reduce size of rtld by 22% by pulling in less code from libc
Currently RTLD is linked against libc_nossp_pic which means that any libc
symbol used in rtld can pull in a lot of depedencies. This was causing
symbol such as __libc_interposing and all the pthread stubs to be included
in RTLD even though they are not required. It turns out most of these
dependencies can easily be avoided by providing overrides inside of rtld.
This change is motivated by CHERI, where we have an experimental ABI that
requires additional relocation processing to allow the use of function
pointers inside of rtld. Instead of adding this self-relocation code to
RTLD I attempted to remove most function pointers from RTLD and discovered
that most of them came from the libc dependencies instead of being actually
used inside rtld.
A nice side-effect of this change is that rtld is now 22% smaller on amd64.
text data bss dec hex filename
0x21eb6 0xce0 0xe60 145910 239f6 /home/alr48/ld-elf-x86.before.so.1
0x1a6ed 0x728 0xdd8 113645 1bbed /home/alr48/ld-elf-x86.after.so.1
The number of R_X86_64_RELATIVE relocations that need to be processed on
startup has also gone down from 368 to 187 (almost 50% less).
tijl [Sat, 29 Jun 2019 17:01:56 +0000 (17:01 +0000)]
Build lib32 libl. The library is built from usr.bin/lex/lib. It would be
better to move this directory to lib/libl, but this requires more extensive
changes to Makefile.inc1. This simple fix can be MFCed quickly.
markj [Sat, 29 Jun 2019 16:11:09 +0000 (16:11 +0000)]
Use a consistent snapshot of the fd's rights in fget_mmap().
fget_mmap() translates rights on the descriptor to a VM protection
mask. It was doing so without holding any locks on the descriptor
table, so a writer could simultaneously be modifying those rights.
Such a situation would be detected using a sequence counter, but
not before an inconsistency could trigger assertion failures in
the capability code.
Fix the problem by copying the fd's rights to a structure on the stack,
and perform the translation only once we know that that snapshot is
consistent.
Reported by: syzbot+ae359438769fda1840f8@syzkaller.appspotmail.com
Reviewed by: brooks, mjg
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20800
markj [Sat, 29 Jun 2019 16:05:52 +0000 (16:05 +0000)]
Fix mutual exclusion in pipe_direct_write().
We use PIPE_DIRECTW as a semaphore for direct writes to a pipe, where
the reader copies data directly from pages mapped into the writer.
However, when a reader finishes such a copy, it previously cleared
PIPE_DIRECTW, allowing multiple writers to race and corrupt the state
used to track wired pages belonging to the writer.
Fix this by having the writer clear PIPE_DIRECTW and instead use the
count of unread bytes to determine whether a write is finished.
lwhsu [Sat, 29 Jun 2019 14:55:53 +0000 (14:55 +0000)]
Fix VOP_PUTPAGES(9) in regards to the use of VM_PAGER_CLUSTER_OK
Submitted by: Ka Ho Ng <khng300 at gmail.com>
Reviewed by: mckusick
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20695
jhb [Sat, 29 Jun 2019 00:52:21 +0000 (00:52 +0000)]
Add support for IFCAP_NOMAP to cxgbe(4).
Since cxgbe(4) uses sglist instead of bus_dma, this required updates
to the code that generates scatter/gather lists for packets. Also,
unmapped mbufs are always sent via DMA and never as immediate data in
the payload of a work request.
jhb [Sat, 29 Jun 2019 00:50:25 +0000 (00:50 +0000)]
Compress pending socket buffer data once it is marked ready.
Apply similar logic from sbcompress to pending data in the socket
buffer once it is marked ready via sbready. Normally sbcompress
merges small mbufs to reduce the length of mbuf chains in the socket
buffer. However, sbcompress cannot do this for mbufs marked
M_NOTREADY. sbcompress_ready is now called from sbready when mbufs
are marked ready to merge small mbuf chains once the data is available
to copy.
jhb [Sat, 29 Jun 2019 00:48:33 +0000 (00:48 +0000)]
Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.
For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.
If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.
alc [Fri, 28 Jun 2019 22:40:34 +0000 (22:40 +0000)]
When we protect PTEs (as opposed to PDEs), we only call vm_page_dirty()
when, in fact, we are write protecting the page and the PTE has PG_M set.
However, pmap_protect_pde() was always calling vm_page_dirty() when the PDE
has PG_M set. So, adding PG_NX to a writeable PDE could result in
unnecessary (but harmless) calls to vm_page_dirty().
Simplify the loop calling vm_page_dirty() in pmap_protect_pde().
hselasky [Fri, 28 Jun 2019 22:28:51 +0000 (22:28 +0000)]
Need to apply the PCIM_BAR_MEM_BASE mask to the physical memory
address before returning it to the user. Some of the least significant
bits have special meaning and should be masked away.