Xin LI [Mon, 12 Jun 2017 09:11:31 +0000 (09:11 +0000)]
Fix buffer lengths.
After r319369, the RPC code validates caller supplied buffer length in
taddr2uaddr. When no -h is specified, the sizeof(ai_addr) is used,
which is always smaller than the required size and therefore uaddr
would be NULL, causing the kernel to copyin() from userland NULL
and fail with EFAULT.
Dmitry Chagin [Mon, 12 Jun 2017 07:35:59 +0000 (07:35 +0000)]
Since r318735 (ino64 project) the size of the native struct dirent is
equal or greater than the size of Linux struct dirent or struct dirent64.
So, remove LINUX_RECLEN_RATIO magic as useless.
Enji Cooper [Mon, 12 Jun 2017 02:38:37 +0000 (02:38 +0000)]
Remove stdlib.h #include added in r319844
A previous iteration of the tests I added in r319844 involved free(3), but
that attempt didn't pan out, so I switched to stack allocated buffers instead
of heap allocated ones, making the #include unnecessary.
Enji Cooper [Mon, 12 Jun 2017 00:43:14 +0000 (00:43 +0000)]
getbsize(3): clarify that underflow/overflow warnings in regard to $BLOCKSIZE
gets output via warnx(3)
This helps set expectations for how one might deal with those messages, i.e.,
mute output from /dev/stderr today, since that's where vwarn(3) outputs messages
to today.
Enji Cooper [Mon, 12 Jun 2017 00:21:55 +0000 (00:21 +0000)]
Add initial tests for stat(1)
Testcases for -H, -L, and -f haven't been implemented yet, in part due
to additional complexity needed to validate the features:
* -H and -f will require an external "helper" program to display/modify
the state/permissions for a given path.
* -L is being covered partially via the -n testcase today.
Jilles Tjoelker [Sun, 11 Jun 2017 19:06:07 +0000 (19:06 +0000)]
rc.subr: Optimize repeated sourcing.
When /etc/rc runs all /etc/rc.d scripts, it has already loaded /etc/rc.subr
but each /etc/rc.d script sources it again (since /etc/rc.d scripts must
also work when started stand-alone).
Therefore, if rc.subr is already loaded, return so sh need not parse the
rest of the file.
A second effect is that there is no longer a compound command around most of
rc.subr. This reduces memory usage while sh is loading rc.subr for the first
time (but this memory is free()d once rc.subr is loaded).
For purposes of porting this to other systems, I do not recommend porting
this to systems with shells that do not have the change to the return
special builtin like in r255215 (before FreeBSD 10.0-RELEASE). This change
ensures that return in the top level of a dot script returns from the dot
script, even if the dot script was sourced from a function.
A comparison of CPU time on an amd64 bhyve virtual machine from a times
command added near the end of /etc/rc, all four values summed:
x orig1
+ quickreturn
+--------------------------------------------------------------------------+
| + + + x x x|
||______M__A_________| |______M___A__________| |
+--------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 3 1.704 1.802 1.726 1.744 0.051419841
+ 3 1.467 1.559 1.487 1.5043333 0.048387326
Difference at 95.0% confidence
-0.239667 +/- 0.113163
-13.7424% +/- 6.48873%
(Student's t, pooled s = 0.0499266)
Ian Lepore [Sun, 11 Jun 2017 00:16:21 +0000 (00:16 +0000)]
Add some utility functions to help a PHY driver on an FDT-configured
system retrieve its config data from the fdt data.
The properties that are common to all phys are decoded and returned in a
structure. The fdt node handles for the mac and phy devices are also
returned in the config data struct, so a driver can easily obtain additional
hardware-specific config values from the fdt data.
Ian Lepore [Sat, 10 Jun 2017 23:55:13 +0000 (23:55 +0000)]
Add a set of constants describing the ways a MAC and PHY can be connected.
While the initial need for this is to help support phy drivers which are
configured with FDT data, there is nothing devicetree-specific about the
concept or the names, so they are available for use even on non-FDT systems.
The initial list of connection types comes from the current devicetree
bindings documentation, but values not documented there can be added to
the list in the future as needed, the values could be sorted into a
different order without perturbing FDT code, etc. The only invariant
is that MII_CONTYPE_UNKNOWN should be first (so it has a value of zero,
so that a con-type variable in a softc, for example, is initialized to
MII_CONTYPE_UNKNOWN by default).
Ian Lepore [Sat, 10 Jun 2017 23:26:25 +0000 (23:26 +0000)]
if_ffec bugfixes related to harvesting of hardware-maintained statistics...
After harvesting the hardware statistics counters and summing them into the
interface stats, properly clear the hardware counters back to zero. On imx5
and earlier hardware it is necessary to disable collection of stats while
writing zeroes to all the registers. On imx6 and newer it turns out it's
not even possible to write zeroes, instead you have to toggle a special
"zero everything" control bit in a register.
Count incoming packets with a bad start frame delim as input errors, and
incoming packets dropped due to no fifo space as input drops.
Remove all code related to harvesting the hardware stats less often than
once per second. It turns out the 32-bit stats registers are backed by
16-bit counters under the hood, and they can easily roll over if you only
harvest them once every 3 seconds like the old code was doing. Now we just
read all the regs once a second.
The combination of not properly zeroing the stats registers and 16-bit
counters sometimes wrapping between harvest calls resulted in basically
unusable statistics before these changes.
Enji Cooper [Sat, 10 Jun 2017 20:56:31 +0000 (20:56 +0000)]
Improve handling with system state
- Always unlink $cmd after exit via END block.
- The tests don't function well if kern.geom.debugflags != 0. Save debugflags,
then restore them at the end of the test.
Switch the example name for variables controlling loading memory images
in /boot/defaults/loader.conf to something that's actually commonly used,
"mdroot". It's arbitrary, but it's easier to find this way.
Eric Joyner [Sat, 10 Jun 2017 18:56:30 +0000 (18:56 +0000)]
ixl(4)/ixlv(4): Fix some busdma tags and improper map NULL.
Description from Brett:
"The busdma tags used to create mappings for the tx and rx rings did not have
the device's tag as parents, meaning that they did not respect the device's
busdma properties. The other tags used in the driver had their parents set
appropriately.
Also, the dma maps for each buffer in ixl_txeof() were being NULLed after
being unloaded, which is an error because those maps are then reused without
being recreated (I believe this also leaked resources since the maps were not
destroyed). Simply removing the line that sets the maps to NULL gives the
desired behavior. There does not seem to be a similar problem with ixl_rxeof().
Functions to free the tx and rx rings also NULL out the dma maps for each
buffer, but this seems okay because the maps are destroyed and not reused in
this case.
With these fixes, my ixl card seems to be working with the IOMMU enabled."
Cy Schubert [Sat, 10 Jun 2017 17:05:14 +0000 (17:05 +0000)]
Disable the -O (output fields) option in poollist() (ippool -l) for
now. The option does not presently work. However, similar functions in
ipfstat (for state) and ipnat (for nat) do work and provide outputs that
can be easily parsed by shell scripts or subsequently loaded into CSV
files. The intention here is to return to this option to make it work.
I suspect the problem is in printpoolfields.c.
Alan Cox [Sat, 10 Jun 2017 16:11:39 +0000 (16:11 +0000)]
Remove an unnecessary field from struct blist. (The comment describing
what this field represented was also inaccurate.) Suggested by: kib
In r178792, blist_create() grew a malloc flag, allowing M_NOWAIT to be
specified. However, blist_create() was not modified to handle the
possibility that a malloc() call failed. Address this omission.
Increase the width of the local variable "radix" to 64 bits. (This
matches the width of the corresponding field in struct blist.)
John Baldwin [Sat, 10 Jun 2017 01:20:08 +0000 (01:20 +0000)]
Improve decoding of RB_AUTOBOOT in the 'howto' argument to reboot().
The reboot() system call accepts a mode (RB_AUTOBOOT, RB_HALT, RB_POWEROFF,
or RB_REROOT) as well as zero or more optional flags in 'howto'.
However, RB_AUTOBOOT was only displayed if 'howto' was exactly 0.
Combinations like 'RB_AUTOBOOT | RB_DUMP' were decoded as 'RB_DUMP'.
Instead, imply that RB_AUTOBOOT was specified if none of the other "mode"
flags were specified.
Justin Hibbits [Fri, 9 Jun 2017 20:26:42 +0000 (20:26 +0000)]
Follow up r313841 on powerpc
Close a potential race in reading the CPU dtrace flags, where a thread can
start on one CPU, and partway through retrieving the flags be swapped out,
while another thread traps and sets the CPU_DTRACE_NOFAULT. This could
cause the first thread to return without handling the fault.
Mark Johnston [Fri, 9 Jun 2017 19:41:12 +0000 (19:41 +0000)]
Augment wait queue support in the LinuxKPI.
In particular:
- Don't evaluate event conditions with a sleepqueue lock held, since such
code may attempt to acquire arbitrary locks.
- Fix the return value for wait_event_interruptible() in the case that the
wait is interrupted by a signal.
- Implement wait_on_bit_timeout() and wait_on_atomic_t().
- Implement some functions used to test for pending signals.
- Implement a number of wait_event_*() variants and unify the existing
implementations.
- Unify the mechanism used by wait_event_*() and schedule() to put the
calling thread to sleep.
This is required to support updated DRM drivers. Thanks to hselasky for
finding and fixing a number of bugs in the original revision.
Alan Cox [Fri, 9 Jun 2017 16:19:24 +0000 (16:19 +0000)]
blist_fill()'s return type is too narrow. blist_fill() accepts a 64-bit
quantity as the size of the range to fill, but returns a 32-bit quantity
as the number of blocks that were allocated to fill that range. This
revision corrects that mismatch. Currently, swaponsomething() limits
the size of a swap area to prevent arithmetic arithmetic overflow in
other parts of the blist allocator. That limit has also prevented this
type mismatch from causing problems.
https://www.illumos.org/issues/8168
If we manage to export the pool on which we are creating a dataset (filesystem
or zvol) between entering libzfs`zfs_create() and libzfs`zpool_open() call (for
which we never check the return value) we end up dereferencing a NULL pointer
in libzfs`zpool_close().
This was discovered on ZFS on Linux. The same issue can be reproduced on
Illumos running in parallel:
while :; do zpool import -d /tmp testpool ; zpool export testpool ; done
while :; do zfs create testpool/fs; zfs destroy testpool/fs ; done
Eventually this will result in several core dumps like this one:
[root@52-54-00-d3-7a-01 /cores]# mdb core.zfs.4244
Loading modules: [ libumem.so.1 libc.so.1 libtopo.so.1 libavl.so.1
libnvpair.so.1 ld.so.1 ]
> ::stack
libzfs.so.1`zpool_close+0x17(0, 0, 0, 8047450)
libzfs.so.1`zfs_create+0x1bb(8090548, 8047e6f, 1, 808cba8)
zfs_do_create+0x545(2, 8047d74, 80778a0, 801, 0, 3)
main+0x22c(8047d2c, fef5c6e8, 8047d64, 8055a17, 3, 8047d70)
_start+0x83(3, 8047e64, 8047e68, 8047e6f, 0, 8047e7b)
>
Fix and reproducer (systemtap): https://github.com/zfsonlinux/zfs/pull/6096
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: loli10K <ezomori.nozomu@gmail.com>
MFC after: 2 weeks
https://www.illumos.org/issues/8156
dbuf_evict_notify() holds the dbuf_evict_lock while checking if it should do
the eviction itself (because the evict thread is not able to keep up).
This can result in massive lock contention.
It isn't necessary to hold the lock, because if we make the wrong choice
occasionally, nothing bad will happen.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after: 1 week
https://www.illumos.org/issues/8005
RAID-Z requires that space be allocated in multiples of P+1 sectors,
because this is the minimum size block that can have the required amount
of parity. Thus blocks on RAIDZ1 must be allocated in a multiple of 2
sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4. A sector
is a unit of 2^ashift bytes, typically 512B or 4KB.
To satisfy this constraint, the allocation size is rounded up to the
proper multiple, resulting in up to 3 "pad sectors" at the end of some
blocks. The contents of these pad sectors are not used, so we do not
need to read or write these sectors. However, some storage hardware
performs much worse (around 1/2 as fast) on mostly-contiguous writes
when there are small gaps of non-overwritten data between the writes.
Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that
include pad sectors. If writing a pad sector will fill the gap between
two (required) writes, we will issue the optional zio, thus doubling
performance. The gap-filling performance improvement was introduced in
July 2009.
Writing the optional zio is done by the io aggregation code in
vdev_queue.c. The problem is that it is also subject to the limit on
the size of aggregate writes, zfs_vdev_aggregation_limit, which is by
default 128KB. For a given block, if the amount of data plus padding
written to a leaf device exceeds zfs_vdev_aggregation_limit, the
optional zio will not be written, resulting in a ~2x performance
degradation.
The problem occurs only for certain values of ashift, compressed block
size, and RAID-Z configuration (number of parity and data disks). It
cannot occur with the default recordsize=128KB. If compression is
enabled, all configurations with recordsize=1MB or larger will be
impacted to some degree.
The problem notably occurs with recordsize=1MB, compression=off, with 10
disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors). Therefore
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after: 10 days
https://www.illumos.org/issues/8155
When writing pre-compressed buffers, arc_write() requires that the compression
algorithm used to compress the buffer matches the compression algorithm
requested by the zio_prop_t, which is set by dmu_write_policy().
This makes dmu_write_policy() and its callers a bit more complicated.
We can simplify this by making arc_write() trust the caller to supply the type
of pre-compressed buffer that it wants to write, and override the compression
setting in the zio_prop_t.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after: 10 days
https://www.illumos.org/issues/8269
It seems that currently normalization of stddev aggregation is done
incorrectly.
We divide both the sum of values and the sum of their squares by the
normalization factor. But we should divide the sum of squares by the
normalization factor squared to scale the original values properly.
FreeBSD note: the actual change was committed in r316853, this commit
adds the test files and record merge information.
https://www.illumos.org/issues/8269
It seems that currently normalization of stddev aggregation is done
incorrectly.
We divide both the sum of values and the sum of their squares by the
normalization factor. But we should divide the sum of squares by the
normalization factor squared to scale the original values properly.
https://www.illumos.org/issues/8269
It seems that currently normalization of stddev aggregation is done
incorrectly.
We divide both the sum of values and the sum of their squares by the
normalization factor. But we should divide the sum of squares by the
normalization factor squared to scale the original values properly.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Yuri Pankov <yuri.pankov@nexenta.com>
https://www.illumos.org/issues/8056
The send size estimate for a zvol can be too low, if the size of the record
headers (dmu_replay_record_t's) is a significant portion of the size.
This is typically the case when the data is highly compressible, especially
with embedded blocks.
The problem is that dmu_adjust_send_estimate_for_indirects() assumes that
blocks are the size of the "recordsize" property (128KB).
However, for zvols, the blocks are the size of the "volblocksize" property
(8KB). Therefore, we estimate that there will be 16x less record headers than
there really will be.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>
https://www.illumos.org/issues/8156
dbuf_evict_notify() holds the dbuf_evict_lock while checking if it should do
the eviction itself (because the evict thread is not able to keep up).
This can result in massive lock contention.
It isn't necessary to hold the lock, because if we make the wrong choice
occasionally, nothing bad will happen.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
https://www.illumos.org/issues/8168
If we manage to export the pool on which we are creating a dataset (filesystem
or zvol) between entering libzfs`zfs_create() and libzfs`zpool_open() call (for
which we never check the return value) we end up dereferencing a NULL pointer
in libzfs`zpool_close().
This was discovered on ZFS on Linux. The same issue can be reproduced on
Illumos running in parallel:
while :; do zpool import -d /tmp testpool ; zpool export testpool ; done
while :; do zfs create testpool/fs; zfs destroy testpool/fs ; done
Eventually this will result in several core dumps like this one:
[root@52-54-00-d3-7a-01 /cores]# mdb core.zfs.4244
Loading modules: [ libumem.so.1 libc.so.1 libtopo.so.1 libavl.so.1
libnvpair.so.1 ld.so.1 ]
> ::stack
libzfs.so.1`zpool_close+0x17(0, 0, 0, 8047450)
libzfs.so.1`zfs_create+0x1bb(8090548, 8047e6f, 1, 808cba8)
zfs_do_create+0x545(2, 8047d74, 80778a0, 801, 0, 3)
main+0x22c(8047d2c, fef5c6e8, 8047d64, 8055a17, 3, 8047d70)
_start+0x83(3, 8047e64, 8047e68, 8047e6f, 0, 8047e7b)
>
Fix and reproducer (systemtap): https://github.com/zfsonlinux/zfs/pull/6096
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: loli10K <ezomori.nozomu@gmail.com>
https://www.illumos.org/issues/8005
RAID-Z requires that space be allocated in multiples of P+1 sectors,
because this is the minimum size block that can have the required amount
of parity. Thus blocks on RAIDZ1 must be allocated in a multiple of 2
sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4. A sector
is a unit of 2^ashift bytes, typically 512B or 4KB.
To satisfy this constraint, the allocation size is rounded up to the
proper multiple, resulting in up to 3 "pad sectors" at the end of some
blocks. The contents of these pad sectors are not used, so we do not
need to read or write these sectors. However, some storage hardware
performs much worse (around 1/2 as fast) on mostly-contiguous writes
when there are small gaps of non-overwritten data between the writes.
Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that
include pad sectors. If writing a pad sector will fill the gap between
two (required) writes, we will issue the optional zio, thus doubling
performance. The gap-filling performance improvement was introduced in
July 2009.
Writing the optional zio is done by the io aggregation code in
vdev_queue.c. The problem is that it is also subject to the limit on
the size of aggregate writes, zfs_vdev_aggregation_limit, which is by
default 128KB. For a given block, if the amount of data plus padding
written to a leaf device exceeds zfs_vdev_aggregation_limit, the
optional zio will not be written, resulting in a ~2x performance
degradation.
The problem occurs only for certain values of ashift, compressed block
size, and RAID-Z configuration (number of parity and data disks). It
cannot occur with the default recordsize=128KB. If compression is
enabled, all configurations with recordsize=1MB or larger will be
impacted to some degree.
The problem notably occurs with recordsize=1MB, compression=off, with 10
disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors). Therefore
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
https://www.illumos.org/issues/8155
When writing pre-compressed buffers, arc_write() requires that the compression
algorithm used to compress the buffer matches the compression algorithm
requested by the zio_prop_t, which is set by dmu_write_policy().
This makes dmu_write_policy() and its callers a bit more complicated.
We can simplify this by making arc_write() trust the caller to supply the type
of pre-compressed buffer that it wants to write, and override the compression
setting in the zio_prop_t.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
https://www.illumos.org/issues/6939
Originally created https://smartos.org/bugview/OS-4489
sysevents should be fired in the kernel from ZFS whenever a command
is run that is logged in zpool history.
Example output
Terminal 1
root - gz sunos ~ # zfs create zones/foobar
root - gz sunos ~ # zfs set quota=10g zones/foobar
root - gz sunos ~ # zfs destroy zones/foobar
Terminal 2
root - gz sunos ~ # sysevent EC_zfs
nvlist version: 0
date = 2016-04-28T14:50:08.964Z
vendor = SUNW
publisher = zfs
class = EC_zfs
subclass = ESC_ZFS_history_event
pid = 0
data = (embedded nvlist)
nvlist version: 0
pool_name = zones
pool_guid = 0x40c964e8f9a7a694
history_record = (embedded nvlist)
nvlist version: 0
dsname = zones/foobar
dsid = 0x1525
history internal str =
internal_name = create
history txg = 0x4c4ef3
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Joshua M. Clulow <jmc@joyent.com>
Reviewed by: Josh Wilsdon <jwilsdon@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Alan Somers <asomers@gmail.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Dave Eddy <dave@daveeddy.com>
https://www.illumos.org/issues/6396
LVM = SVM = Solaris Volume Manager
dead code and not using with ZFS based platform.
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Yuri Pankov <yuri.pankov@nexenta.com>
Its purpose was to translate the values for msdosfs inode numbers,
which is calculated from the msdosfs structures describing the file,
into the range representable by 32bit ino_t. The translation acted
for filesystems larger than 128Gb, it reserved the range 0xf0000000
(FILENO_FIRST_DYN) to UINT32_MAX and remembered some arbitrary
translation of ino >= FILENO_FIRST_DYN into this range. It consumed
memory that could be only freed by unmount, and the translation was
not stable across remounts.
With ino_t type extended to 64 bit, there is no such issue and values
can be returned without compaction to 32bit. That is, for the native
environments, the translation layer is not necessary and adds
significant undeserved code complexity. For compat ABIs which use
32bit ino_t, the vfs.ino64_trunc_error sysctl provides some measures
to soften the failure mode when inode numbers truncation is not safe.
Discussed with: bde
Sponsored by: The FreeBSD Foundation
Provide a new mode "2" which returns a special overflow indicator in
the non-representable field instead of the silent truncation (mode
"0") or EOVERFLOW (mode "1").
In particular, the typical use of st_ino to detect hard links with
mode "2" reports false positives, which might be more suitable for
some uses.
Discussed with: bde
Sponsored by: The FreeBSD Foundation
Gleb Smirnoff [Thu, 8 Jun 2017 21:33:19 +0000 (21:33 +0000)]
When we are in UMA_STARTUP use startup_alloc() for any zone, not for
internal zones only. This allows to create new zones at early stages
of boot, without need to mark them as internal to UMA, which isn't
always true.
Gleb Smirnoff [Thu, 8 Jun 2017 21:30:34 +0000 (21:30 +0000)]
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
John Baldwin [Thu, 8 Jun 2017 21:06:18 +0000 (21:06 +0000)]
Add explicit handling for requests with an empty payload.
- For HMAC requests, construct a special input buffer to request an empty
hash result.
- For plain cipher requests and requests that chain an AES cipher with an
HMAC, fail with EINVAL if there is no cipher payload. If needed in
the future, chained requests that only contain AAD could be serviced as
HMAC-only requests.
- For GCM requests, the hardware does not support generating the tag for
an AAD-only request. Instead, complete these requests synchronously
in software on the assumption that such requests are rare.
With EARLY_AP_STARTUP enabled, we are seeing crashes in softclock_call_cc()
during bootup. Debugging information shows that softclock_call_cc() is
trying to execute the vt_consdev.vd_timer callout, and the callout
structure contains a NULL c_func.
This appears to be due to a race between vt_upgrade() running
callout_reset() and vt_resume_flush_timer() calling callout_schedule().
Fix the race by ensuring that vd_timer_armed is always set before
attempting to (re)schedule the callout.
Add the infrastructure to support loading multiple versions of TCP
stack modules.
It adds support for mangling symbols exported by a module by prepending
a string to them. (This avoids overlapping symbols in the kernel linker.)
It allows the use of a macro as the module name in the DECLARE_MACRO()
and MACRO_VERSION() macros.
It allows the code to register stack aliases (e.g. both a generic name
["default"] and version-specific name ["default_10_3p1"]).
With these changes, it is trivial to compile TCP stack modules with
the name defined in the Makefile and to load multiple versions of the
same stack simultaneously. This functionality can be used to enable
side-by-side testing of an old and new version of the same TCP stack.
It also could support upgrading the TCP stack without a reboot.
Ed Maste [Thu, 8 Jun 2017 20:06:09 +0000 (20:06 +0000)]
arm64: add ".arch armv8-a+crc" to allow use of crc instructions
With Clang 5.0 the .arch directive is required, otherwise Clang
complains "error: instruction requires: crc".
This was reported in D10499 but not added initially, because clang 3.8
available on a ref machine reported unknown directive. Clang 4.0 allows
but does not require the directive.
Submitted by: andrew
MFC after: 1 week
Sponsored by: The FreeBSD Foundation