jilles [Sun, 25 Aug 2013 15:00:34 +0000 (15:00 +0000)]
MFC r250412: posix_spawn_file_actions_addopen(3): Correct error for bad file
descriptor.
As per POSIX.1-2008, posix_spawn_file_actions_add* return [EBADF] if a file
descriptor is negative, not [EINVAL]. The bug was only in the manual page;
the code is correct.
ae [Thu, 22 Aug 2013 06:24:02 +0000 (06:24 +0000)]
MFC r254095:
gpt_entries is used as limit for the number of partition entries in
the GEOM_PART. Instead of just using number of entries from the GPT
header, calculate this limit based on the reserved space between
GPT header and first available LBA.
bryanv [Tue, 20 Aug 2013 19:17:01 +0000 (19:17 +0000)]
MFC r254457
Do not use potentially stale thread in kthread_add()
When an existing process is provided, the thread selected to use
to initialize the new thread could have exited and be reaped.
Acquire the proc lock earlier to ensure the thread remains valid.
tuexen [Thu, 15 Aug 2013 04:35:25 +0000 (04:35 +0000)]
MFC r254338:
Don't send uninitialized memory (two instances of 4 bytes) in
every cookie on the wire. This bug was reported in
https://bugzilla.mozilla.org/show_bug.cgi?id=905080
gshapiro [Thu, 15 Aug 2013 01:40:55 +0000 (01:40 +0000)]
MFC: Temporarily revert sendmail 8.14.7 change to getipnodebyname() flags
to prevent problems between the resolver and Microsoft DNS servers with
AAAA lookups. The upstream open source project will work on a more
permanent fix for the next release. Issue noted by Pavel Timofeev.
mav [Thu, 1 Aug 2013 09:48:12 +0000 (09:48 +0000)]
MFC r253754:
Partially close race between calls of orphan() method from GEOM and close()
method from ZFS core, that reliably causes use-after-free panic if SSD vdev
detached during inititial erase.
MFC r253404:
o TxD ring requires 8 bytes alignment to work so change alignment
constraint to 8. Previously it may have triggered watchdog
timeouts.
o Check whether interrupt is ours or not.
o Enable interrupts before attemping to transmit queued packets.
This will slightly improve TX performance.
o No need to clear IFF_DRV_OACTIVE in a loop. AE_FLAG_TXAVAIL is
used to know whether there are enough available TxD ring space.
o Added missing bus_dmamap_sync(9) in ae_rx_intr() and rearranged
code to avoid unncessary register access.
o Make sure to clear TxD, TxS, RxD rings in driver initialization.
Otherwise some data in these rings could be interpreted as
'updated' which in turn will advance internally maintained
pointers and can trigger watchdog timeouts.
MFC 252576:
Don't perform the acpi_DeviceIsPresent() check for PCI-PCI bridges. If
we are probing a PCI-PCI bridge it is because we found one by enumerating
the devices on a PCI bus, so the bridge is definitely present. A few
BIOSes report incorrect status (_STA) for some bridges that claimed they
were not present when in fact they were.
While here, move this check earlier for Host-PCI bridges so attach fails
before doing any work that needs to be torn down.
MFC: r252673
A problem with the old NFS client where large writes to large files
would sometimes result in a corrupted file was reported via email.
This problem appears to have been caused by r251719 (reverting
r251719 fixed the problem). Although I have not been able to
reproduce this problem, I suspect it is caused by another thread
increasing np->n_size after the mtx_unlock(&np->n_mtx) but before
the vnode_pager_setsize() call. Since the np->n_mtx mutex serializes
updates to np->n_size, doing the vnode_pager_setsize() with the
mutex locked appears to avoid the problem.
Unfortunately, vnode_pager_setsize() where the new size is smaller,
cannot be called with a mutex held.
This patch returns the semantics to be close to pre-r251719 such that the
call to the vnode_pager_setsize() is only delayed until after the mutex is
unlocked when np->n_size is shrinking. Since the file is growing
when being written, I believe this will fix the corruption.
MFC r245926, r245931
- Improve some comments.
- Make bge_lookup_{rev,vendor}() static.
- Factor out chip identification rather than duplicating the code.
- Sanitize bge_probe() a bit (don't hardcode buffer sizes, allow
bge_lookup_vendor() to return NULL so the excessive panic() can
be removed there, etc.) and return BUS_PROBE_DEFAULT rather than
hardcoding 0.
- According to the Linux tg3 driver, BCM57791 and BCM57795 aren't
capable of Gigabit Ethernet.
- Check the return value of taskqueue_start_threads().
- Mention NetLink controllers in the fallback description, too.
MFC r252402:
Fix triggering false watchdog timeout when controller is in PAUSE
state. Previously it used to check if controller has sent a
PAUSE frame to the remote peer.
- Morocco:
announced that the year's Ramadan daylight-savings transitions
would be 2013-07-07 and 2013-08-10.
- Israel:
As of 2013, DST starts at 02:00 on the Friday before the last
Sunday in March. DST ends at 02:00 on the first Sunday after
October 1, unless it occurs on the second day of the Jewish Rosh
Hashana holiday, in which case DST ends a day later (i.e. at 02:00
the first Monday after October 2). [Rosh Hashana holidays are
factored in until 2100.]
MFC r252779:
Fix a bug were only 2048 streams where usable even though more than
2048 streams were negotiated on the wire. While there, remove the
hard coded limit of 2048 streams.
MFC r252718:
When processing an incoming ABORT, SHUTDOWN_COMPLETE or ERROR (NAT related)
chunk, take always the T-bit into account, when checking the verification
tag.
MFC r250466:
Honor the net.inet6.ip6.v6only sysctl variable and the IPV6_V6ONLY
socket option for SCTP sockets in the same way as for UDP or TCP
sockets.
Import an implementation of the CAIA Delay-Gradient (CDG) congestion control
algorithm, which is based on the 2011 v0.1 patch release and described in the
paper "Revisiting TCP Congestion Control using Delay Gradients" by David Hayes
and Grenville Armitage. It is implemented as a kernel module compatible with the
modular congestion control framework.
CDG is a hybrid congestion control algorithm which reacts to both packet loss
and inferred queuing delay. It attempts to operate as a delay-based algorithm
where possible, but utilises heuristics to detect loss-based TCP cross traffic
and will compete effectively as required. CDG is therefore incrementally
deployable and suitable for use on shared networks.
In collaboration with: David Hayes <david.hayes at ieee.org> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: Cisco University Research Program and FreeBSD Foundation
MFC r252325:
The dtmalloc provider uses the short description of a malloc type as the
function name of its corresponding DTrace probes. These descriptions may
contain whitespace, but probe names cannot, so just replace any whitespace
with underscores when creating probes.
MFC r251238:
SDT probes can directly pass up to five arguments as arguments to
dtrace_probe(). Arguments beyond these five must be obtained in an
architecture-specific way; this can be done through the getargval provider
method, and through dtrace_getarg() if getargval isn't overridden.
This change fixes two off-by-one bugs in the way these arguments are fetched
in FreeBSD's DTrace implementation. First, the SDT provider must set the
aframes parameter to 1 when creating a probe. The aframes parameter controls
the number of frames that dtrace_getarg() will step over in order to find
the frame containing the extra arguments. On FreeBSD, dtrace_getarg() is
called in SDT probe context via
so aframes must be 3 since the arguments are in dtrace_probe()'s frame; it
was previously being called with a value of 2 instead. illumos uses a
different aframes value for SDT probes, but this is because illumos SDT
probes fire by triggering the #UD fault handler rather than calling
dtrace_probe() directly.
The second bug has to do with the way arguments are grabbed out
dtrace_probe()'s frame on amd64. The code currently jumps over the first
stack argument and retrieves the rest of them using a pointer into the
stack. This works on i386 because all of dtrace_probe()'s arguments will be
on the stack and the first argument is the probe ID, which should be
ignored. However, it is incorrect to ignore the first stack argument on
amd64, so we correct the pointer used to access the arguments.
Poor ZFS send / receive performance due to snapshot
hold / release processing (by smh@)
Illumos ZFS issues:
3740 Poor ZFS send / receive performance due to snapshot
hold / release processing
MFV r252215:
Restore a previous behavior before r251646, where when destructing
ZFS snapshot, the ioctl would return ENOENT when it hit any of
them in the errlist (the new behavior was only return ENOENT when
all returns error).
Illumos ZFS issues:
3829 fix for 3740 changed behavior of zfs destroy/hold/release ioctl
MFC r251636: illumos #3749 zfs event processing should work on R/O root
filesystems
This log is a modified version of the original one written by gibbs@,
to account for changes made during the illumos RTI process.
Allow ZFS asynchronous event handling to proceed even if the root file
system is mounted read-only. This restriction appears to have been put
in place to avoid errors with updating the configuration cache file.
However:
o The majority of asynchronous event handling does not involve
configuration cache file updates.
o The configuration cache file need not be on the root file system,
so the check was not complete.
o Other classes of errors (e.g. file system full) can also prevent
a successful update yet do not prevent asynchronous event processing.
o Configurations such as NanoBSD never have a read-write root,
so ZFS event processing is permanently disabled in these systems.
o Failure to handle asynchronous events promptly can extend the
window of time that a pool is in a critical state.
At worst, a missed configuration cache update will force the operator to
perform a manual "zfs import" (note -f is not required) to inform the
system about a newly created pool. To minimize the likelihood of this
rare occurrence, configuration cache write failures now emit FMA events
(via devctl) so the operator can take corrective action, and the write
is retried every 5 minutes. The retry interval, in seconds, is tunable
via the sysctl "vfs.zfs.ccw_retry_interval".
As a side effect of reporting configuration cache events, other sysevents,
such as re-silver start/stop, are now also reported via devctl.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:
o As is done in zfs_fm.c, provide a manual declaration for
devctl_notify(). Both declarations could be combined
into spa_impl.h, but the declaration is fault management
related, not spa specific. sys/fm/fs/zfs.h would be ideal
if it weren't so public and reserved for FMA string
definitions. I'm open to suggestions on how to improve
this nit while minimizing our divergence from Solaris.
o Use devctl_notify() to implement sysevent support in
spa_event_notify(). The subsystem is EC_ZFS so that
these events can never collide with those emitted in
zfs_fm.c.
o Add the sysctl "vfs.zfs.ccw_retry_interval". The value
defaults to 5 minutes and is used to rate limit, on a
per-pool basis, configuration cache file write attempts.
o Modify spa_async_dispatch to honor configuration cache
write limiting. If other events are pending, a configuration
cache write will be attempted at the same time, so the
rate limiting only applies when the asynchronous dispatch
system is otherwise idle. Async events should be rare
(e.g. device arrival/departure) and configuration cache
writes rarer, so a more complicated system to strictly
honor the retry limit seems unwarranted.
o Remove check in spa_async_dispatch() for the root file
system being read-write.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c:
Instead of silently ignoring configuration cache write
failures, report them via a new FMA event as well as
to the console. The current zfs_ereport_post() doesn't
allow arbitrary name=value pairs to be appended to the
report, so the configuration cache file name is only
available on the console output. This limitation should
be addressed in a future update.
Note: This error report is only posted once per incident,
to avoid spamming.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa_impl.h:
Add a hrtime_t to the spa data structure to track the
time (via gethrtime()) of the last configuration cache file
write failure. This is referenced in spa_async_dispatch()
to effect the rate limiting.
sys/cddl/contrib/opensolaris/uts/common/sys/fm/fs/zfs.h:
Add FM_EREPORT_ZFS_CONFIG_CACHE_WRITE as an ereport class.
Submitted by: gibbs
Reviewed by: Matthew Ahrens <mahrens@delphix.com>,
Eric Schrock <eric.schrock@delphix.com>,
Christopher Siden <christopher.siden@delphix.com>
Sponsored by: Spectra Logic
MFC r251635: illumos #3747 txg commit callbacks don't work
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c:
Fix commit callbacks by moving them to the task's list.
Previously, list_move_tail() returned without doing anything because
the task list was passed as the source rather than destination.
cddl/contrib/opensolaris/cmd/ztest/ztest.c:
Check the commit callback threshold correctly.
Submitted by: will
Reviewed by: Matthew Ahrens <mahrens@delphix.com>,
Christopher Siden <christopher.siden@delphix.com>
Sponsored by: Spectra Logic
MFC r251634: illumos #3745 zpool create should treat -O mountpoint and -m the same
cddl/contrib/opensolaris/cmd/zpool/zpool_main.c: (change 644608)
This allows specifying a mountpoint using the latter form and having
its value checked and used as it would be using the former form.
As a consequence of this change:
1. The mountpoint property is set in the fsprops nvlist prior
to creating the pool, rather than being set after creating
the pool. To me, this is the proper approach, since it
avoids creating the pool if the mountpoint setting would
cause the command to fail.
2. The mountpoint property, unlike all others, can be specified
more than once. Only the last setting takes effect. This
is to avoid breaking potential existing users that specify
-m more than once.
Submitted by: will
cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c
Fix "zpool create -R <whatever> -m <whatever>". Ever since
change 644608, this has been broken. The problem is that some
old code in libzfs_pool.c would force a pool's mountpoint to
"/" when creating a pool with an altroot. That probably
implemented some old policy decision regarding altroots, but it
conflicts with the current manpage. It also had no effect
until 644608, because the zpool command would _always_ change
the pool's mountpoint after creating it. The solution is to
delete the old code from libzfs_pool.c.
Submitted by: asomers
Reviewed by: Matthew Ahrens <mahrens@delphix.com>,
Christopher Siden <christopher.siden@delphix.com>
Sponsored by: Spectra Logic
MFC r251632: illumos #3743 zfs needs a refcount audit
Audit zap cursor usage and correct missing calls to zap_cursor_fini().
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_errlog.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap.c:
Correct early exit handling of several functions that
previously failed to close a cursor prior to returning.
Submitted by: gibbs
Audit holders of dmu_bufs and correct missing calls to dmu_buf_rele().
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap.c:
Correct early exit handling of several functions that
previously failed to release a dmu_buf prior to returning.
Submitted by: will
Reviewed by: Matthew Ahrens <mahrens@delphix.com>,
Eric Schrock <eric.schrock@delphix.com>,
George Wilson <george.wilson@delphix.com>,
Christopher Siden <christopher.siden@delphix.com>
Sponsored by: Spectra Logic
MFC r251631: illumos #3742 zfs comments need cleaner, more consistent style
- Make more of ZFS's comments use a natural English writing flow.
- Break up long paragraphs, fix various typos and spelling errors.
- Don't prefix a function description with its name when the function
definition immediately follows.
- Remove useless comments.
- Add extra whitespace where it makes the comments more readable.
New comments were separated from this change and added in r251629.
Submitted by: asomers, gibbs, will
Reviewed by: Matthew Ahrens <mahrens@delphix.com>,
George Wilson <george.wilson@delphix.com>,
Eric Schrock <eric.schrock@delphix.com>,
Christopher Siden <christopher.siden@delphix.com>
Sponsored by: Spectra Logic
Embellish the comments in various components of ZFS. Move some comments
around closer to what they describe. Specifically, answer the questions:
- What are some of the edge cases of the dbuf state machine?
- What does a txg quiesce do?
- When does the DMU notify threads waiting on txg's that they may
proceed?
- How do the calculations for RAIDZ map allocations work?
- What process do the RAIDZ I/O start and done callbacks follow?
While here, adjust the function prototype of dmu_zfetch.c:dmu_zfetch_colinear()
to match its comment which describes its return as a boolean.
Submitted by: asomers, gibbs, will
Reviewed by: Matthew Ahrens <mahrens@delphix.com>,
Eric Schrock <eric.schrock@delphix.com>,
Christopher Siden <christopher.siden@delphix.com>
Sponsored by: Spectra Logic
For ATA_PASSTHROUGH commands, pretend isci(4) supports multiword DMA
by treating it as UDMA.
This fixes a problem introduced in r249933/r249939, where CAM sends
ATA_DSM_TRIM to SATA devices using ATA_PASSTHROUGH_16. scsi_ata_trim()
sets protocol as DMA (not UDMA) which is for multi-word DMA, even
though no such mode is selected for the device. isci(4) would fail
these commands which is the correct behavior but not consistent with
other HBAs, namely LSI's.
smh@ did some further testing on an LSI controller, which rejected
ATA_PASSTHROUGH_16 commands with mode=UDMA_OUT, even though only
a UDMA mode was selected on the device. So this precludes adding
any kind of mode detection in CAM to determine which mode to use on
a per-device basis.