kib [Sat, 23 Mar 2013 22:23:15 +0000 (22:23 +0000)]
Do not call malloc(M_WAITOK) while bodev->fence_lock mutex is
held. The ttm_buffer_object_transfer() does not need the mutex locked
at all, except for the call to the driver sync_obj_ref() method.
Reported and tested by: dumbbell
MFC after: 2 weeks
mm [Sat, 23 Mar 2013 21:34:10 +0000 (21:34 +0000)]
Merge bugfix from vendor master branch:
Limit write requests to at most INT_MAX.
This prevents a certain common programming error (passing -1 to write)
from leading to other problems deeper in the library.
ian [Sat, 23 Mar 2013 17:17:06 +0000 (17:17 +0000)]
Don't check and warn about pmap mismatch on every call to busdma sync.
With some recent busdma refactoring, sometimes it happens that a sync
op gets called when bus_dmamap_load() never got called, which results
in a spurious warning about a map mismatch when no sync operations will
actually happen anyway. Now the check is done only if a sync operation
is actually performed, and the result of the check is a panic, not just
a printf.
Reviewed by: cognet (who prevented me from donning a point hat)
will [Sat, 23 Mar 2013 16:34:56 +0000 (16:34 +0000)]
ZFS: Fix a panic while unmounting a busy filesystem.
This particular scenario was easily reproduced using a NFS export. When the
first 'zfs unmount' occurred, it returned EBUSY via this path, while
vflush() had flushed references on the filesystem's root vnode, which in
turn caused its v_interlock to be destroyed. The next time 'zfs unmount'
was called, vflush() tried to obtain this lock, which caused this panic.
Since vflush() on FreeBSD is a definitive call, there is no need to check
vfsp->vfs_count after it completes. Simply #ifdef sun this check.
will [Sat, 23 Mar 2013 15:11:53 +0000 (15:11 +0000)]
Extend taskqueue(9) to enable per-taskqueue callbacks.
The scope of these callbacks is primarily to support actions that affect the
taskqueue's thread environments. They are entirely optional, and
consequently are introduced as a new API: taskqueue_set_callback().
This interface allows the caller to specify that a taskqueue requires a
callback and optional context pointer for a given callback type.
The callback types included in this commit can be used to register a
constructor and destructor for thread-local storage using osd(9). This
allows a particular taskqueue to define that its threads require a specific
type of TLS, without the need for a specially-orchestrated task-based
mechanism for startup and shutdown in order to accomplish it.
Two callback types are supported at this point:
- TASKQUEUE_CALLBACK_TYPE_INIT, called by every thread when it starts, prior
to processing any tasks.
- TASKQUEUE_CALLBACK_TYPE_SHUTDOWN, called by every thread when it exits,
after it has processed its last task but before the taskqueue is
reclaimed.
While I'm here:
- Add two new macros, TQ_ASSERT_LOCKED and TQ_ASSERT_UNLOCKED, and use them
in appropriate locations.
- Fix taskqueue.9 to mention taskqueue_start_threads(), which is a required
interface for all consumers of taskqueue(9).
mav [Sat, 23 Mar 2013 13:11:54 +0000 (13:11 +0000)]
Make `systat -vmstat` to use suffixes to display big floating point numbers
that are not fitting into the specified field width, same as done for ints.
In particular that allows to properly display disk tps above 100k, that are
reachable with modern SSDs.
avg [Sat, 23 Mar 2013 08:48:44 +0000 (08:48 +0000)]
fbt_typoff_init: fix an off by one in determining required memory size
This issue would be silent most of the time, but if the requested memory
is a multiple of a page size, then accessing one element beyond the end
would lead to a kernel page fault.
Otherwise, the unlucky last type would just be inaccessible.
Reported by: glebius
Tested by: glebius
MFC after: 6 days
mckusick [Fri, 22 Mar 2013 21:50:43 +0000 (21:50 +0000)]
Speed up fsck by caching the cylinder group maps in pass1 so
that they do not need to be read again in pass5. As this nearly
doubles the memory requirement for fsck, the cache is thrown away
if other memory needs in fsck would otherwise fail. Thus, the
memory footprint of fsck remains unchanged in memory constrained
environments.
This work was inspired by a paper presented at Usenix's FAST '13:
www.usenix.org/conference/fast13/ffsck-fast-file-system-checker
Details of this implementation appears in the April 2013 of ;login:
www.usenix.org/publications/login/april-2013-volume-38-number-2.
A copy of the April 2013 ;login: paper can also be downloaded
from: www.mckusick.com/publications/faster_fsck.pdf.
cognet [Fri, 22 Mar 2013 21:50:32 +0000 (21:50 +0000)]
As it's done for libstdc++, use SJLJ-based exceptions on arm when we're not
using EABI, and use unwind-arm.h instead of unwind-generic.h when using EABI.
mckusick [Fri, 22 Mar 2013 21:45:28 +0000 (21:45 +0000)]
The purpose of this change to the FFS layout policy is to reduce the
running time for a full fsck. It also reduces the random access time
for large files and speeds the traversal time for directory tree walks.
The key idea is to reserve a small area in each cylinder group
immediately following the inode blocks for the use of metadata,
specifically indirect blocks and directory contents. The new policy
is to preferentially place metadata in the metadata area and
everything else in the blocks that follow the metadata area.
The size of this area can be set when creating a filesystem using
newfs(8) or changed in an existing filesystem using tunefs(8).
Both utilities use the `-k held-for-metadata-blocks' option to
specify the amount of space to be held for metadata blocks in each
cylinder group. By default, newfs(8) sets this area to half of
minfree (typically 4% of the data area).
This work was inspired by a paper presented at Usenix's FAST '13:
www.usenix.org/conference/fast13/ffsck-fast-file-system-checker
Details of this implementation appears in the April 2013 of ;login:
www.usenix.org/publications/login/april-2013-volume-38-number-2.
A copy of the April 2013 ;login: paper can also be downloaded
from: www.mckusick.com/publications/faster_fsck.pdf.
jilles [Fri, 22 Mar 2013 20:12:25 +0000 (20:12 +0000)]
rc.d/sysctl: Fix error messages about unknown OIDs.
There are three situations where the sysctl script is called:
1. "start", very early
2. "lastload", near the end of rc
3. "reload", at admin request while the system is booted
Ignore unknown OIDs in situation 1 because kernel modules may not be loaded
yet and complain about them in situations 2 and 3.
mm [Fri, 22 Mar 2013 13:36:03 +0000 (13:36 +0000)]
MFV r248590,248594:
Update libarchive to 3.1.2
Some of new features:
- support for lrzip and grzip compression
- support for writing tar v7 format
- b64encode and uuencode filters
- support for __MACOSX directory in Zip archives
- support for lzop compresion (external utility)
pjd [Thu, 21 Mar 2013 22:44:33 +0000 (22:44 +0000)]
- Make 'flags' argument to chflags(2), fchflags(2) and lchflags(2) of type
u_long. Before this change it was of type int for syscalls, but prototypes
in sys/stat.h and documentation for chflags(2) and fchflags(2) (but not
for lchflags(2)) stated that it was u_long. Now some related functions
use u_long type for flags (strtofflags(3), fflagstostr(3)).
- Make path argument of type 'const char *' for consistency.
Discussed on: arch
Sponsored by: The FreeBSD Foundation
attilio [Thu, 21 Mar 2013 19:58:25 +0000 (19:58 +0000)]
Fix a bug in UMTX_PROFILING:
UMTX_PROFILING should really analyze the distribution of locks as they
index entries in the umtxq_chains hash-table.
However, the current implementation does add/dec the length counters
for *every* thread insert/removal, measuring at all really userland
contention and not the hash distribution.
Fix this by correctly add/dec the length counters in the points where
it is really needed.
Please note that this bug brought us questioning in the past the quality
of the umtx hash table distribution.
To date with all the benchmarks I could try I was not able to reproduce
any issue about the hash distribution on umtx.
mav [Thu, 21 Mar 2013 15:42:41 +0000 (15:42 +0000)]
Minimal timer period of 100us introduced in r244758 is overkill. While
original 2us are indeed not enough, 3us are working quite well on my tests.
To be more safe set minimal period to 5us and to be even more safe replicate
here from HPET mechanism of rereading counter after programming comparator.
This change allows to handle 30K of short nanosleep() calls per second on
Raspberry Pi instead of just 8K before.
jhb [Thu, 21 Mar 2013 14:06:27 +0000 (14:06 +0000)]
Another NFS SIGSTOP related fix: Ignore thread suspend requests due to
SIGSTOP if stop signals are currently deferred. This can occur if a
process is stopped via SIGSTOP while a thread is running or runnable
but before it has set TDF_SBDRY.
kib [Thu, 21 Mar 2013 13:06:28 +0000 (13:06 +0000)]
Fix twa(4) after the r246713. The driver copies data around to
satisfy some alignment restrictions. Do not set TW_OSLI_REQ_FLAGS_CCB
flag for mapped data, pass the csio->data_ptr in the req->data.
Do not put the ccb pointer into req->data ever, ccb is stored in
req->orig_req already.
smh [Thu, 21 Mar 2013 11:02:08 +0000 (11:02 +0000)]
Optimisation of TRIM processing.
Previously TRIM processing was very bursty. This was made worse by the fact
that TRIM requests on SSD's are typically much slower than reads or writes.
This often resulted in stalls while large numbers of TRIM's where processed.
In addition due to the way the TRIM thread was only woken by writes, deletes
could stall in the queue for extensive periods of time.
This patch adds a number of controls to how often the TRIM thread for each
SPA processes its outstanding delete requests.
vfs.zfs.trim.timeout: Delay TRIMs by up to this many seconds
vfs.zfs.trim.txg_delay: Delay TRIMs by up to this many TXGs (reduced to 32)
vfs.zfs.vdev.trim_max_bytes: Maximum pending TRIM bytes for a vdev
vfs.zfs.vdev.trim_max_pending: Maximum pending TRIM segments for a vdev
vfs.zfs.trim.max_interval: Maximum interval between TRIM queue processing
(seconds)
Given the most common TRIM implementation is ATA TRIM the current defaults
are targeted at that.
smh [Thu, 21 Mar 2013 10:29:05 +0000 (10:29 +0000)]
TRIM cache devices based on time instead of TXGs.
Currently, the trim module uses the same algorithm for data and cache
devices when deciding to issue TRIM requests, based on how far in the
past the TXG is.
Unfortunately, this is not ideal for cache devices, because the L2ARC
doesn't use the concept of TXGs at all. In fact, when using a pool for
reading only, the L2ARC is written but the TXG counter doesn't
increase, and so no new TRIM requests are issued to the cache device.
This patch fixes the issue by using time instead of the TXG number as
the criteria for trimming on cache devices. The basic delay principle
stays the same, but parameters are expressed in seconds instead of
TXGs. The new parameters are named trim_l2arc_limit and
trim_l2arc_batch, and both default to 30 second.
smh [Thu, 21 Mar 2013 10:16:10 +0000 (10:16 +0000)]
Improve TXG handling in the TRIM module.
This patch adds some improvements to the way the trim module considers
TXGs:
- Free ZIOs are registered with the TXG from the ZIO itself, not the
current SPA syncing TXG (which may be out of date);
- L2ARC are registered with a zero TXG number, as L2ARC has no concept
of TXGs;
- The TXG limit for issuing TRIMs is now computed from the last synced
TXG, not the currently syncing TXG. Indeed, under extremely unlikely
race conditions, there is a risk we could trim blocks which have been
freed in a TXG that has not finished syncing, resulting in potential
data corruption in case of a crash.
smh [Thu, 21 Mar 2013 10:02:32 +0000 (10:02 +0000)]
Don't register repair writes in the trim map.
The trim map inflight writes tree assumes non-conflicting writes, i.e.
that there will never be two simultaneous write I/Os to the same range
on the same vdev. This seemed like a sane assumption; however, in
actual testing, it appears that repair I/Os can very well conflict
with "normal" writes.
I'm not quite sure if these conflicting writes are supposed to happen
or not, but in the mean time, let's ignore repair writes for now. This
should be safe considering that, by definition, we never repair blocks
that are freed.
smh [Thu, 21 Mar 2013 09:34:41 +0000 (09:34 +0000)]
Add TRIM support for L2ARC.
This adds TRIM support to cache vdevs. When ARC buffers are removed
from the L2ARC in arc_hdr_destroy(), arc_release() or l2arc_evict(),
the size previously occupied by the buffer gets scheduled for TRIMming.
As always, actual TRIMs are only issued to the L2ARC after
txg_trim_limit.
mm [Thu, 21 Mar 2013 08:38:03 +0000 (08:38 +0000)]
Merge libzfs_core branch:
includes MFV 238590, 238592, 247580
MFV 238590, 238592:
In the first zfs ioctl restructuring phase, the libzfs_core library was
introduced. It is a new thin library that wraps around kernel ioctl's.
The idea is to provide a forward-compatible way of dealing with new
features. Arguments are passed in nvlists and not random zfs_cmd fields,
new-style ioctls are logged to pool history using a new method of
history logging.
MFV 247580 [1]:
To address issues of several deadlocks and race conditions the locking
code around dsl_dataset was rewritten and the interface to synctasks
was changed.
User-Visible Changes:
"zfs snapshot" can create more arbitrary snapshots at once (atomically)
"zfs destroy" destroys multiple snapshots at once
"zfs recv" has improved performance
Backward Compatibility:
I have extended the compatibility layer to support full backward
compatibility by remapping or rewriting the responsible ioctl arguments.
Old utilities are fully supported by the new kernel module.
Forward Compatibility:
New utilities work with old kernels with the following restrictions:
- creating, destroying, holding and releasing of multiple snapshots
at once is not supported, this includes recursive (-r) commands
Illumos ZFS issues:
2882 implement libzfs_core
2900 "zfs snapshot" should be able to create multiple,
arbitrary snapshots at once
3464 zfs synctask code needs restructuring
kib [Thu, 21 Mar 2013 07:28:15 +0000 (07:28 +0000)]
Only size and create the bio_transient_map when unmapped buffers are
enabled. Now, disabling the unmapped buffers should result in the
kernel memory map identical to pre-r248550.
kib [Wed, 20 Mar 2013 21:08:00 +0000 (21:08 +0000)]
In bufwrite(), a dirty buffer is moved to the clean queue before the
bufobj counter of the writes in progress is incremented. Other thread
inspecting the bufobj would consider it clean.
For the regular vnodes, the vnode lock is typically held both by the
thread performing the bufwrite() and an other thread doing syncing,
which prevents the situation. On the other hand, writes to the VCHR
vnodes are done without holding vnode lock.
Increment the write ref counter for the buffer object before calling
bundirty().
Sponsored by: The FreeBSD Foundation
Tested by: pho
MFC after: 2 weeks
kib [Wed, 20 Mar 2013 21:07:49 +0000 (21:07 +0000)]
When the journaled FFS volume is suspended due to the journal space
becoming too low, the softdep flush thread processes the workitems,
which frees the space in journal, and then unsuspends the fs. The
softdep_flush() and other workitem processing functions busy the
filesystem before iterating over the worklist, to prevent the parallel
unmount from freeing the mount data. The vfs_busy() is called with
MBF_NOWAIT flag.
Now, if the unmount is already started and the filesystem is suspended
due to low journal space, the journal is never flushed and filesystem
is never unsuspended, because vfs_busy(MBF_NOWAIT) call cannot succeed
for the unmounting fs, and softdep_flush() does not process the
workitems. Unmount needs to write metadata, where it hangs in the
"suspfs" state.
Move the vn_start_write() call in the dounmount() before setting the
MNTK_UNMOUNT flag. This practically ensures that softdep_flush()
processed the pending journal writes by making dounmount() wait for
the lift of the suspension.
Sponsored by: The FreeBSD Foundation
Reported and tested by: pho
MFC after: 2 weeks
mckusick [Wed, 20 Mar 2013 17:57:00 +0000 (17:57 +0000)]
When renaming a directory from one parent directory to another,
we need to call ufs_checkpath() to walk from our new location to
the root of the filesystem to ensure that we do not encounter
ourselves along the way. Until now, we accomplished this by reading
the ".." entries of each directory in our path until we reached
the root (or encountered an error). This change tries to avoid the
I/O of reading the ".." entries by first looking them up in the
name cache and only doing the I/O when the name cache lookup fails.
melifaro [Wed, 20 Mar 2013 10:35:33 +0000 (10:35 +0000)]
Add ipfw support for setting/matching DiffServ codepoints (DSCP).
Setting DSCP support is done via O_SETDSCP which works for both
IPv4 and IPv6 packets. Fast checksum recalculation (RFC 1624) is done for IPv4.
Dscp can be specified by name (AFXY, CSX, BE, EF), by value
(0..63) or via tablearg.
Matching DSCP is done via another opcode (O_DSCP) which accepts several
classes at once (af11,af22,be). Classes are stored in bitmask (2 u32 words).
Many people made their variants of this patch, the ones I'm aware of are
(in alphabetic order):
yongari [Wed, 20 Mar 2013 05:31:34 +0000 (05:31 +0000)]
For RTL8211B or later PHYs, enable crossover detection and
auto-correction. This change makes re(4) establish a link with
a system using non-crossover UTP cable.
Tested by: Michael BlackHeart < amdmiek <> gmail dot com >
jilles [Tue, 19 Mar 2013 20:58:17 +0000 (20:58 +0000)]
Implement SOCK_CLOEXEC, SOCK_NONBLOCK and MSG_CMSG_CLOEXEC.
This change allows creating file descriptors with close-on-exec set in some
situations. SOCK_CLOEXEC and SOCK_NONBLOCK can be OR'ed in socket() and
socketpair()'s type parameter, and MSG_CMSG_CLOEXEC to recvmsg() makes file
descriptors (SCM_RIGHTS) atomically close-on-exec.
The numerical values for SOCK_CLOEXEC and SOCK_NONBLOCK are as in NetBSD.
MSG_CMSG_CLOEXEC is the first free bit for MSG_*.
The SOCK_* flags are not passed to MAC because this may cause incorrect
failures and can be done later via fcntl() anyway. On the other hand, audit
is expected to cope with the new flags.
For MSG_CMSG_CLOEXEC, unp_externalize() is extended to take a flags
argument.
adrian [Tue, 19 Mar 2013 19:32:28 +0000 (19:32 +0000)]
Break out the RX completion path into "FIFO check / refill" and
"complete RX frames."
The 128 entry RX FIFO is really easy to fill up and miss refilling
when it's done in the ath taskq - as that gets blocked up doing
RX completion, TX completion and other random things.
So the 128 entry RX FIFO now gets emptied and refilled in the ath_intr()
task (and it grabs / releases locks, so now ath_intr() can't just be
a FAST handler yet!) but the locks aren't held for very long. The
completion part is done in the ath taskqueue context.
Details:
* Create a new completed frame list - sc->sc_rx_rxlist;
* Split the EDMA RX process queue into two halves - one that
processes the RX FIFO and refills it with new frames; another
that completes the completed frame list;
* When tearing down the driver, flush whatever is in the deferred
queue as well as what's in the FIFO;
* Create two new RX methods - one that processes all RX queues,
one that processes the given RX queue. When MSI is implemented,
we get told which RX queue the interrupt came in on so we can
specifically schedule that. (And I can do that with the non-MSI
path too; I'll figure that out later.)
* Convert the legacy code over to use these new RX methods;
* Replace all the instances of the RX taskqueue enqueue with a call
to a relevant RX method to enqueue one or all RX queues.
kib [Tue, 19 Mar 2013 15:05:21 +0000 (15:05 +0000)]
Commit the removal of a whitespace to record the proper commit message
for the r248519:
For the cam-attached HBAs, allow the driver to specify that it accepts
the unmapped bio by the PIM_UNMAPPED flag. The CAM passes the
CAM_DATA_BIO data transfer type request for the unmapped bio, and the
driver could use the bus_dmamap_load_ccb() as a helper to
transparently handle the ccb.
Sponsored by: The FreeBSD Foundation
Reviewed by: scottl
Tested by: pho, scottl
kib [Tue, 19 Mar 2013 15:01:50 +0000 (15:01 +0000)]
Support unmapped i/o for the md(4).
The vnode-backed md(4) has to map the unmapped bio because VOP_READ()
and VOP_WRITE() interfaces do not allow to pass unmapped requests to
the filesystem. Vnode-backed md(4) uses pbufs instead of relying on
the bio_transient_map, to avoid usual md deadlock.
Sponsored by: The FreeBSD Foundation
Tested by: pho, scottl
kib [Tue, 19 Mar 2013 14:53:23 +0000 (14:53 +0000)]
Support unmapped i/o for the md(4).
The vnode-backed md(4) has to map the unmapped bio because VOP_READ()
and VOP_WRITE() interfaces do not allow to pass unmapped requests to
the filesystem. Vnode-backed md(4) uses pbufs instead of relying on
the bio_transient_map, to avoid usual md deadlock.
Sponsored by: The FreeBSD Foundation
Tested by: pho, scottl