Warner Losh [Sun, 19 Aug 2018 18:18:19 +0000 (18:18 +0000)]
Turn back the clock just a little: make userboot.so always be 4th
Turns out there was a hidden dependency we hasn't counted upon. The
host load /boot/userboot.so to boot the VMs it runs. This means that
the change to lua meant suddently that nobody could run their older
VMs because LUA wasn't in 10.0, last month's HardenedBSD, 11.2 or
whatever. Even more than for the /boot/loader* binaries, we need a
good coexistance strategy for this. While that's being designed and
implemented, drop back to always 4th for userboot.so. This will fail
safe in all but the most extreme environments (but lua-only hacks
to .lua files won't be processes in VMs until we fix it).
John Baldwin [Sun, 19 Aug 2018 17:36:50 +0000 (17:36 +0000)]
Fix the MPTable probe code after the 4:4 changes on i386.
The MPTable probe code was using PMAP_MAP_LOW as the PA -> VA offset
when searching for the table signature but still using KERNBASE once
it had found the table. As a result, the mpfps table pointed into a
random part of the kernel text instead of the actual MP Table.
Rather than adding more #ifdef's, use BIOS_PADDRTOVADDR from
<machine/pc/bios.h> which already uses PMAP_MAP_LOW on i386 and KERNBASE
on amd64.
Kirk McKusick [Sun, 19 Aug 2018 17:19:20 +0000 (17:19 +0000)]
For traditional disks, the filesystem attempts to allocate the
blocks of a file as contiguously as possible. Since the filesystem
does not know how large a file will grow when it is first being
written, it initially places the file in a set of blocks in which
it currently fits. As it grows, it is relocated to areas with
larger contiguous blocks. In this way it saves its large contiguous
sets of blocks for the files that need them and thus avoids
unnecessaily fragmenting its disk space.
We used to skip reallocating the blocks of a file into a contiguous
sequence if the underlying flash device requested BIO_DELETE
notifications, because devices that benefit from BIO_DELETE also
benefit from not moving the data. However, in the algorithm described
above that reallocates the blocks, the destination for the data is
usually moved before the data is written to the initially allocated
location. So we rarely suffer the penalty of extra writes. With
the addition of the consolodation of contiguous blocks into single
BIO_DELETE operations, having fewer but larger contiguous blocks
reduces the number of (slow and expensive) BIO_DELETE operations.
So when doing BIO_DELETE consolodation, we do block reallocation.
Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix
Kirk McKusick [Sun, 19 Aug 2018 16:56:42 +0000 (16:56 +0000)]
Add consolodation of TRIM / BIO_DELETE commands to the UFS/FFS filesystem.
When deleting files on filesystems that are stored on flash-memory
(solid-state) disk drives, the filesystem notifies the underlying
disk of the blocks that it is no longer using. The notification
allows the drive to avoid saving these blocks when it needs to
flash (zero out) one of its flash pages. These notifications of
no-longer-being-used blocks are referred to as TRIM notifications.
In FreeBSD these TRIM notifications are sent from the filesystem
to the drive using the BIO_DELETE command.
Until now, the filesystem would send a separate message to the drive
for each block of the file that was deleted. Each Gigabyte of file
size resulted in over 3000 TRIM messages being sent to the drive.
This burst of messages can overwhelm the drive's task queue causing
multiple second delays for read and write requests.
This implementation collects runs of contiguous blocks in the file
and then consolodates them into a single BIO_DELETE command to the
drive. The BIO_DELETE command describes the run of blocks as a
single large block being deleted. Each Gigabyte of file size can
result in as few as two BIO_DELETE commands and is typically less
than ten. Though these larger BIO_DELETE commands take longer to
run, they do not clog the drive task queue, so read and write
commands can intersperse effectively with them.
Though this new feature has been throughly reviewed and tested, it
is being added disabled by default so as to minimize the possibility
of disrupting the upcoming 12.0 release. It can be enabled by running
``sysctl vfs.ffs.dotrimcons=1''. Users are encouraged to test it.
If no problems arise, we will consider requesting that it be enabled
by default for 12.0.
Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix
John Baldwin [Sun, 19 Aug 2018 16:14:59 +0000 (16:14 +0000)]
Remove some vestiges of IPI_LAZYPMAP on i386.
The support for lazy pmap invalidations on i386 was removed in r281707.
This removes the constant for the IPI and stops accounting for it when
sizing the interrupt count arrays.
Michael Tuexen [Sun, 19 Aug 2018 14:56:10 +0000 (14:56 +0000)]
Don't expose the uptime via the TCP timestamps.
The TCP client side or the TCP server side when not using SYN-cookies
used the uptime as the TCP timestamp value. This patch uses in all
cases an offset, which is the result of a keyed hash function taking
the source and destination addresses and port numbers into account.
The keyed hash function is the same a used for the initial TSN.
Kyle Evans [Sun, 19 Aug 2018 14:26:33 +0000 (14:26 +0000)]
stand: Flip the default interpreter to Lua
After years in the making, lualoader is ready to make its debut. Both
flavors of loader are still built by default, and may be installed as
/boot/loader or /boot/loader.efi as appropriate either by manually creating
hard links or using LOADER_DEFAULT_INTERP as documented in build(7).
Cy Schubert [Sun, 19 Aug 2018 13:45:03 +0000 (13:45 +0000)]
The bucket index is subtracted by one at lines 2304 and 2314. When 0 it
becomes -1, except these are unsigned integers, so they become very large
numbers. Thus are always larger than the maximum bucket; the hash table
insertion fails causing NAT to fail.
This commit ensures that if the index is already zero it is not reduced
prior to insertion into the hash table.
Cy Schubert [Sun, 19 Aug 2018 13:44:59 +0000 (13:44 +0000)]
Add handy DTrace probes useful in diagnosing NAT issues. DTrace probes
are situated next to error counters and/or in one instance prior to the
-1 return from various functions. This was useful in diagnosis of
PR/208566 and will be handy in the future diagnosing NAT failures.
Cy Schubert [Sun, 19 Aug 2018 13:44:56 +0000 (13:44 +0000)]
Expose np (nat_t - an entry in the nat table structure) in the DTrace
probe when nat fails (label badnat). This is useful in diagnosing
failed NAT issues and was used in PR/208566.
Kyle Evans [Sun, 19 Aug 2018 04:15:38 +0000 (04:15 +0000)]
diff(1): Refactor -B a little bit
Instead of doing a second pass to skip empty lines if we've specified -I, go
ahead and check both at once. Ignore critera has been split out into its own
function to try and keep the logic cleaner.
Kyle Evans [Sun, 19 Aug 2018 03:57:20 +0000 (03:57 +0000)]
diff(1): Implement -B/--ignore-blank-lines
As noted by cem in r338035, coccinelle invokes diff(1) with the -B flag.
This was not previously implemented here, so one was forced to create a link
for GNU diff to /usr/local/bin/diff
Implement the -B flag and add some primitive tests for it. It is implemented
in the same fashion that -I is implemented; each chunk's lines are scanned,
and if a non-blank line is encountered then the chunk will be output.
Otherwise, it's skipped.
Conrad Meyer [Sun, 19 Aug 2018 00:22:21 +0000 (00:22 +0000)]
Remove unused and easy to misuse PNP macro parameter
Inspired by r338025, just remove the element size parameter to the
MODULE_PNP_INFO macro entirely. The 'table' parameter is now required to
have correct pointer (or array) type. Since all invocations of the macro
already had this property and the emitted PNP data continues to include the
element size, there is no functional change.
(Note that coccinelle invokes diff(1) via a PATH search and expects diff to
tolerate the -B flag, which BSD diff does not. So I had to link gdiff into
PATH as diff to use spatch.)
Kirk McKusick [Sat, 18 Aug 2018 22:21:59 +0000 (22:21 +0000)]
Replace the TRIM consolodation framework originally added in -r337396
driven by problems found with the algorithms being tested for TRIM
consolodation.
Kyle Evans [Sat, 18 Aug 2018 20:55:20 +0000 (20:55 +0000)]
ls(1): Support other aliases for --color arguments used by GNU ls(1)
These aliases are supported and documented in the man page. For now, they
will not be mentioned in the error when an invalid argument is encountered,
instead keeping that list to the shorter 'preferred' names of each argument.
Dimitry Andric [Sat, 18 Aug 2018 20:41:43 +0000 (20:41 +0000)]
Use the size of one bge_devs element for the MODULE_PNP_INFO macro,
instead of the size of the whole bge_devs array.
This should stop kldxref searching beyond the end of .rodata when it
processes relocations, and emitting "unhandled relocation type" errors,
at least on i386.
This is very primitive code to inspect the PCI error state and AER
error state, dump the log and clear errors, from ddb.
pci_print_faulted_dev() is made external to allow calling it from
other places. It was called from NMI handler but this chunk is not
included.
Also there is a tunable-controlled code to clear AER on device attach,
disabled by default.
All this code was useful to me when I debugged ACPI_DMAR failures (not
faults) long time ago.
Reviewed by: cem, imp (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D7813
John Baldwin [Sat, 18 Aug 2018 20:32:08 +0000 (20:32 +0000)]
Make 'device crypto' lines more consistent.
- In configurations with a pseudo devices section, move 'device crypto'
into that section.
- Use a consistent comment. Note that other things common in kernel
configs such as GELI also require 'device crypto', not just IPSEC.
John Baldwin [Sat, 18 Aug 2018 20:28:25 +0000 (20:28 +0000)]
Fix casts between 64-bit physical addresses and pointers in EFI.
Compiling FreeBSD/i386 with modern GCC triggers warnings for various
places that convert 64-bit EFI_ADDRs to pointers and vice versa.
- Cast pointers to uintptr_t rather than to uint64_t when assigning
to a 64-bit integer.
- Cast 64-bit integers to uintptr_t before a cast to a pointer.
Kyle Evans [Sat, 18 Aug 2018 19:45:56 +0000 (19:45 +0000)]
res_find: Fix fallback logic
The fallback logic was broken if hints were found in multiple environments.
If we found a hint in either the loader environment or the static
environment, fallback would be incremented excessively when we returned to
the environment-selection bits. These checks should have also been guarded
by the fbacklvl checks. As a result, fbacklvl could quickly get to a point
where we skip either the static environment and/or the static hints
depending on which environments contained valid hints.
The impact of this bug is minimal, mostly affecting mips boards that use
static hints and may have hints in either the loader environment or the
static environment.
There may be better ways to express the searchable environments and
describing their characteristics (immutable, already searched, etc.) but
this may be revisited after 12 branches.
Reported by: Dan Nelson <dnelson_1901@yahoo.com>
Triaged by: Dan Nelson <dnelson_1901@yahoo.com>
MFC after: 3 days
Rick Macklem [Sat, 18 Aug 2018 19:14:06 +0000 (19:14 +0000)]
Fix LORs between vn_start_write() and vn_lock() in nfsrv_copymr().
When coding the pNFS server, I added vn_start_write() calls in nfsrv_copymr()
done while the vnodes were locked, not realizing I had introduced LORs and
possible deadlock when an exported file system on the MDS is suspended.
This patch fixes the LORs by moving the vn_start_write() calls up to before
where the vnodes are locked. For "tvp", the vn_start_write() probaby isn't
necessary, because NFS mounts can't be suspended. However, I think doing
so is harmless.
Thanks go to kib@ for letting me know that I had introduced these LORs.
This patch only affects the behaviour of the pNFS server when pnfsdscopymr(8)
is used to recover a mirrored DS.
Alan Cox [Sat, 18 Aug 2018 18:33:50 +0000 (18:33 +0000)]
Eliminate the arena parameter to kmem_malloc_domain(). It is redundant.
The domain and flags parameters suffice. In fact, the related functions
kmem_alloc_{attr,contig}_domain() don't have an arena parameter.
Eugene Grosbein [Sat, 18 Aug 2018 10:58:44 +0000 (10:58 +0000)]
bsnmpd(8): fix and optimize interface description processing
* correctly prepare a buffer to obtain interface description from a kernel and
truncate long description instead of dropping it altogether and
spamming logs;
* skip calling strlen() for each description and each SNMP request
for MIB-II/ifXTable's ifAlias.
* teach bsnmpd to allocate memory dynamically for interface descriptions
to decrease memory usage for common case and not to break
if long description occurs;
PR: 217763
Reviewed by: harti and others
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D16459
Kyle Evans [Sat, 18 Aug 2018 03:20:59 +0000 (03:20 +0000)]
libbe(3): Move build goop back out of cddl/
Some background: in the GSoC project, libbe/Makefile lived in lib/libbe. I
created projects/bectl branch, maintained the above for all of five
minutes before I misread Makefile.inc1 and decided that it couldn't possibly
build outside of cddl/, so I kicked the Makefile out into the cddl/ build
and all was good. The misreading was of the bit where .WAIT is added to
SUBDIR after lib, libexec but prior to building bin and cddl *only during
the install targets*, which is the critical part.
Fast forward- buildworld was still broken in my branch unbeknownst to me
because I didn't nuke my OBJDIR. Combing through Makefile.inc1 eventually
revealed the necessary magic to make sure that libbe's dependencies are
specified well enough, and it becomes clear what needs done to make a
non-cddl/ build work. This is an interesting prospect, because the build
split is kind of annoying to work with.
IGNORE_PRAGMA is added to avoid dropping WARNS by one more. This was
previously pulled in via cddl/Makefile.inc.
Kyle Evans [Sat, 18 Aug 2018 01:12:44 +0000 (01:12 +0000)]
bectl(8): Allow running a custom command in the 'jail' subcommand
Instead of always running /bin/sh, allow the user to specify the command
to run. The jail is not removed when the command finishes. Meaning,
`bectl unjail` will still need to be run.
Pedro F. Giffuni [Sat, 18 Aug 2018 01:05:38 +0000 (01:05 +0000)]
POSIX compliance improvements in the pthread(3) functions.
This basically adds makes use of the C99 restrict keyword, and also
adds some 'const's to four threading functions: pthread_mutexattr_gettype(),
pthread_mutexattr_getprioceiling(), pthread_mutexattr_getprotocol(), and
pthread_mutex_getprioceiling. The changes are in accordance to POSIX/SUSv4-2018.
Bjoern A. Zeeb [Fri, 17 Aug 2018 21:19:18 +0000 (21:19 +0000)]
METALOG, unless manually overwritten, is defined as ${DESTDIR}/${DISTDIR}/METALOG
In the create-world-packages target we manually piece this together (unless
it is undefined), without the DISTDIR. Normally DISTDIR is empty (unset) and
no one notices. Now DISTDIR is a well known long-standing PORTS environment
variable and if that is set in the local environment the path to METALOG
is wrong as it no longer is ${DESTDIR}/METALOG.
Long-term we should start to avoid "publicly well known" names for global
variables, for now just piece ${DISTDIR} in as well. This allows
create-world-packages to continue if DISTDIR is set in the env.
Rick Macklem [Fri, 17 Aug 2018 21:12:16 +0000 (21:12 +0000)]
Fix LORs between vn_start_write() and vn_lock() in the pNFS server.
When coding the pNFS server, I added several vn_start_write() calls done
while the vnode was locked, not realizing I had introduced LORs and
possible deadlock when an exported file system on the MDS is suspended.
This patch fixes this by removing the added vn_start_write() calls and
modifying the code so that the extant vn_start_write() call before the
NFS RPC/operation is done when needed by the pNFS server.
Flags are changed so that LayoutCommit and LayoutReturn now get a
vn_start_write() done for them.
When the pNFS server is enabled, the code now also changes the flags for
Getattr, so that the vn_start_write() is done for Getattr, since it may
need to do a vn_set_extattr(). The nfs_writerpc flag array was made global
to the NFS server and renamed nfsrv_writerpc, which is consistent naming
for globals in the NFS server.
Thanks go to kib@ for reporting that doing vn_start_write() while the vnode is
locked results in a LOR.
This patch only affects the behaviour of the pNFS server.
Alan Somers [Fri, 17 Aug 2018 18:37:22 +0000 (18:37 +0000)]
Fix sys/netipsec/tunnel tests after r337736
Originally, these tests accidentally used broadcast addresses when they
should've used unicast addresses. That the tests passed prior to r337736
was accidental.
Brooks Davis [Fri, 17 Aug 2018 16:19:47 +0000 (16:19 +0000)]
Rework rtld's TLS Variant I implementation to match r326794
The above commit fixed handling overaligned TLS segments in libc's
TLS Variant I implementation, but rtld provides its own implementation
for dynamically-linked executables which lacks these fixes. Thus,
port these changes to rtld.
Mark Johnston [Fri, 17 Aug 2018 15:41:01 +0000 (15:41 +0000)]
Add INVARIANTS-only fences around lockless vnode refcount updates.
Some internal KASSERTs access the v_iflag field without the vnode
interlock held after such a refcount update. The fences are needed for
the assertions to be correct in the face of store reordering.
Reported and tested by: jhibbits
Reviewed by: kib, mjg
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16756
Relax allocation throttling for ditto blocks. Due to random imbalances
in allocation it tends to push block copies to one vdev, that looks
slightly better at the moment. Slightly less strict policy allows both
improve data security and surprisingly write performance, since we don't
need to touch extra metaslabs on each vdev to respect the min distance.
Alexander Motin [Fri, 17 Aug 2018 15:00:41 +0000 (15:00 +0000)]
9738 Fix third block copy allocations, broken at 9112.
Use METASLAB_WEIGHT_CLAIM weight to allocate tertiary blocks.
Previous use of METASLAB_WEIGHT_SECONDARY for that caused errors
later on metaslab_activate_allocator() call, leading to massive
load of unneeded metaslabs and write freezes.
Kristof Provost [Fri, 17 Aug 2018 15:00:10 +0000 (15:00 +0000)]
pf: Limit the maximum number of fragments per packet
Similar to the network stack issue fixed in r337782 pf did not limit the number
of fragments per packet, which could be exploited to generate high CPU loads
with a crafted series of packets.
Limit each packet to no more than 64 fragments. This should be sufficient on
typical networks to allow maximum-sized IP frames.
This addresses the issue for both IPv4 and IPv6.
MFC after: 3 days
Security: CVE-2018-5391
Sponsored by: Klara Systems
Notable fixes:
- Overlays may now be generated properly without -@
- /__local_fixups__ were not including unit address in their structure
- The error reporting a magic token was misleading, reporting
"Bad magic token in header. Got d00dfeed expected 0xd00dfeed"
if the token was missing. This has been split out into a separate message.
Rick Macklem [Fri, 17 Aug 2018 12:32:38 +0000 (12:32 +0000)]
Don't set a file's size for the MDS file of a pNFS service.
When a pNFS service is running, the size of the files created on the MDS
are normally 0, since the data is written to the data files on the DS(s).
However, without this patch, if a Setattr with a non-zero size was done by
a client, the MDS file was set to that size. This was thought to be benign,
but it turns out that files with a non-zero size plus extended attributes
can cause a "ffs_truncate3" panic in UFS. Although the exact cause of this
panic() has not been isolated, this patch avoids the panic() and leaves
the MDS files in a consistent state of always having a size == 0.
Note that these MDS files never store data. The patch also includes an
unnecessary initialization of savsize in case some compiler or static
analyser complains it might not be initialized.
This patch only affects the NFS server when pNFS is enabled via the "-p"
command line option on nfsd.
Roger Pau Monné [Fri, 17 Aug 2018 07:27:15 +0000 (07:27 +0000)]
build: skip the database check when generating install media
There are several scripts and targets solely used to generate install
media, make sure DB_FROM_SRC is used in that case in order to prevent
checking the host database, which is irrelevant when generating
install binaries.
Conrad Meyer [Fri, 17 Aug 2018 04:40:01 +0000 (04:40 +0000)]
cryptosoft: Reduce generality of supported algorithm composition
Fix a regression introduced in r336439.
Rather than allowing any linked list of algorithms, allow at most two
(typically, some combination of encrypt and/or MAC). Removes a WAITOK
malloc in an unsleepable context (classic LOR) by placing both software
algorithm contexts within the OCF-managed session object.
Tested with 'cryptocheck -a all -d cryptosoft0', which includes some
encrypt-and-MAC modes.
Kyle Evans [Fri, 17 Aug 2018 04:15:51 +0000 (04:15 +0000)]
ls(1): Add --color=when
--color may be set to one of: 'auto', 'always', and 'never'.
'auto' is the default behavior- output colors only if -G or COLORTERM are
set, and only if stdout is a tty.
'always' is a new behavior- output colors always. termcap(5) will be
consulted unless TERM is unset or not a recognized terminal, in which case
ls(1) will fall back to explicitly outputting ANSI escape sequences.
'never' to turn off any environment variable and -G usage.
Summary:
PowerISA 3.0 adds a 'darn' instruction to "deliver a random number". This
driver was modeled after (rather, copied and gutted of) the Ivy Bridge
rdrand driver.
This uses the "Conditional Random Number" behavior to remove input bias.
From the ISA reference the 'darn' instruction, and the random number
generator backing it, conforms to the NIST SP800-90B and SP800-90C
standards, compliant to the extent possible at the time the hardware was
designed, and guarantees a minimum 0.5 bits of entropy per bit returned.
Kyle Evans [Fri, 17 Aug 2018 03:42:57 +0000 (03:42 +0000)]
subr_prf: Don't write kern.boot_tag if it's empty
This change allows one to set kern.boot_tag="" and not get a blank line
preceding other boot messages. While this isn't super critical- blank lines
are easy to filter out both mentally and in processing dmesg later- it
allows for a mode of operation that matches previous behavior.
I intend to MFC this whole series to stable/11 by the end of the month with
boot_tag empty by default to make this effectively a nop in the stable
branch.
Conrad Meyer [Fri, 17 Aug 2018 00:30:04 +0000 (00:30 +0000)]
Add xform-conforming auth_hash wrapper for Poly-1305
The wrapper is a thin shim around libsodium's Poly-1305 implementation. For
now, we just use the C algorithm and do not attempt to build the
SSE-optimized variant for x86 processors.
The algorithm support has not yet been plumbed through cryptodev, or added
to cryptosoft.
libsodium is derived from Daniel J. Bernstein et al.'s 2011 NaCl
("Networking and Cryptography Library," pronounced "salt") software library.
At the risk of oversimplifying, libsodium primarily exists to make it easier
to use NaCl. NaCl and libsodium provide high quality implementations of a
number of useful cryptographic concepts (as well as the underlying
primitics) seeing some adoption in newer network protocols.
I considered but dismissed cleaning up the directory hierarchy and
discarding artifacts of other build systems in favor of remaining close to
upstream (and easing future updates).
Nothing is integrated into the build system yet, so in that sense, no
functional change.
Alan Somers [Thu, 16 Aug 2018 22:04:00 +0000 (22:04 +0000)]
Revert r337929
FreeBSD's mkstemp sets the temporary file's permissions to 600, and has ever
since mkstemp was added in 1987. Coverity's warning is still relevant for
portable programs since OpenGroup does not require that behavior, and POSIX
didn't until 2008. But none of these programs are portable.