mm [Fri, 26 Apr 2013 07:00:49 +0000 (07:00 +0000)]
MFC r249787:
The zfs synctask code restructuring introduced a new bug that makes it
impossible to set quota and reservation on pools lower than version 22.
Problem has been reported and a solution discussed with vendor.
Illumos ZFS issues:
3739 cannot set zfs quota or reservation on pool version < 22
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reported by: Steve Wills <swills@FreeBSD.org>
MFC r248690, r248706, 248708, r248752:
Dtrace: merge new functions from Illumos.
This covers illumos issues:
1455 DTrace tracemem() should take an optional size argument
1451 DTrace needs toupper()/tolower() subroutines
1457 lltostr() D subroutine should take an optional base
1694 Add type-aware print() action
3511 dtrace.c erroneously checks for memory alignment on amd64
MFC r249346:
Create controller-level DMA tag, handling range of supported addresses.
That simplifies logic for channels and gives the bus information about what
device actually allocated the tag.
MFC r248704:
Read Asynchronous Notification statuses only if Port Multiplier or ATAPI
device are connected. ATA disks are not using ANs, while the extra register
read operation is quite expensive.
MFC r248698:
Depending on combination of running commands (NCQ/non-NCQ) try to avoid
extra read from PxCI/PxSACT registers. If only NCQ commands are running, we
don't really need PxCI. If only non-NCQ commands are running we don't need
PxSACT. Mixed set may happen only on controllers with FIS-based switching
when port multiplier is attached, and then we have to read both registers.
Update error messages when processing the INDEX file to display the given
path rather than a static string. This makes the error messages consistent
with the rest of the functions which already do the same thing (assumed to
be an oversight or r47055, 13+ years ago). A direct commit to stable/9.
MFC of 247212:
When running with the -d option, instrument fsck_ffs to track the number,
data type, and running time of its I/O operations.
No functional changes.
MFC of 247234:
Catch up with internal API changes for initbarea() and getdatablk()
of fsck_ffs introduced with r247212.
Submitted by: David Wolfskill <david@catwhisker.org>
MFC of 248625:
Speed up fsck by caching the cylinder group maps in pass1 so
that they do not need to be read again in pass5. As this nearly
doubles the memory requirement for fsck, the cache is thrown away
if other memory needs in fsck would otherwise fail. Thus, the
memory footprint of fsck remains unchanged in memory constrained
environments.
This work was inspired by a paper presented at Usenix's FAST '13:
www.usenix.org/conference/fast13/ffsck-fast-file-system-checker
Details of this implementation appears in the April 2013 of ;login:
www.usenix.org/publications/login/april-2013-volume-38-number-2.
A copy of the April 2013 ;login: paper can also be downloaded
from: www.mckusick.com/publications/faster_fsck.pdf.
Reviewed by: kib
Tested by: Peter Holm
MFC of 248639:
Fix the build after addition of cylinder group cacheing (r248625)
Reported by: Glen Barber (gjb@)
Pointy hat to: Kirk McKusick (mckusick@)
MFC of 248673:
Minor formatting fix for printf() to fix clang builds.
Submitted by: db
Reviewed by: gjb
MFC of 248680:
Resolve clang compile errors on amd64/i386 for certain by casting.
compile tested with clang on i386, amd64
compile tested with gcc on i386, amd64, sparc64
Submitted by: delphij
MFC of 248691:
Note that output is in seconds, not msec.
KNF indentation.
No functional change.
No change to printf strings.
No change to casting of printf arguments.
The purpose of this change to the FFS layout policy is to reduce the
running time for a full fsck. It also reduces the random access time
for large files and speeds the traversal time for directory tree walks.
The key idea is to reserve a small area in each cylinder group
immediately following the inode blocks for the use of metadata,
specifically indirect blocks and directory contents. The new policy
is to preferentially place metadata in the metadata area and
everything else in the blocks that follow the metadata area.
The size of this area can be set when creating a filesystem using
newfs(8) or changed in an existing filesystem using tunefs(8).
Both utilities use the `-k held-for-metadata-blocks' option to
specify the amount of space to be held for metadata blocks in each
cylinder group. By default, newfs(8) sets this area to half of
minfree (typically 4% of the data area).
This work was inspired by a paper presented at Usenix's FAST '13:
www.usenix.org/conference/fast13/ffsck-fast-file-system-checker
Details of this implementation appears in the April 2013 of ;login:
www.usenix.org/publications/login/april-2013-volume-38-number-2.
A copy of the April 2013 ;login: paper can also be downloaded
from: www.mckusick.com/publications/faster_fsck.pdf.
When a file is first being written, the dynamic block reallocation
(implemented by ffs_reallocblks_ufs[12]) relocates the file's blocks
so as to cluster them together into a contiguous set of blocks on
the disk.
When the cluster crosses the boundary into the first indirect block,
the first indirect block is initially allocated in a position
immediately following the last direct block. Block reallocation
would usually destroy locality by moving the indirect block out of
the way to keep the data blocks contiguous. This change compensates
for this problem by noting that the first indirect block should be
left immediately following the last direct block. It then tries
to start a new cluster of contiguous blocks (referenced by the
indirect block) immediately following the indirect block.
We should also do this for other indirect block boundaries, but it
is only important for the first one.
Suggested by: Bruce Evans
MFC of 248623:
The purpose of this change to the FFS layout policy is to reduce the
running time for a full fsck. It also reduces the random access time
for large files and speeds the traversal time for directory tree walks.
The key idea is to reserve a small area in each cylinder group
immediately following the inode blocks for the use of metadata,
specifically indirect blocks and directory contents. The new policy
is to preferentially place metadata in the metadata area and
everything else in the blocks that follow the metadata area.
The size of this area can be set when creating a filesystem using
newfs(8) or changed in an existing filesystem using tunefs(8).
Both utilities use the `-k held-for-metadata-blocks' option to
specify the amount of space to be held for metadata blocks in each
cylinder group. By default, newfs(8) sets this area to half of
minfree (typically 4% of the data area).
This work was inspired by a paper presented at Usenix's FAST '13:
www.usenix.org/conference/fast13/ffsck-fast-file-system-checker
Details of this implementation appears in the April 2013 of ;login:
www.usenix.org/publications/login/april-2013-volume-38-number-2.
A copy of the April 2013 ;login: paper can also be downloaded
from: www.mckusick.com/publications/faster_fsck.pdf.
Use 4-byte padding for core dump notes on both 32 and 64bit archs.
Although native word padding (i.e. 8-byte on 64bit arch) looks to be
in agreement with standards, other parts of our code and other OSes
use 4-byte alignment.
This is not expected to change alignment for currently generated core
dump notes, as the notes look to consist of structures with sizes
multiple of 8 on 64-bit archs. But there are plans to add additional
notes, where 4-byte vs 8-byte alignment makes difference.
Discussed with: kib
Reviewed by: kib
r249239:
Fill p_flags and p_align fields of the core dump note segement.
MFC r249476:
Esnure that PCI bus BUS_GET_DMA_TAG() method sees the actual PCI device
which makes the request for dma tag, instead of some descendant of
the PCI device, by creating a pass-through trampoline for vga_pci and
ata_pci buses.
- antarctica: AusAQ and ATAQ have been removed.
- Antarctica/Macquarie has been moved to australasia file and AU.
- Asia/Hebron, Palestine updated for 2013.
- Paraguay stays with DST for the whole year.
mm [Sat, 20 Apr 2013 09:25:25 +0000 (09:25 +0000)]
MFC r249047 (avg):
spa_open_common: fix argument to zvol_create_minors
Prior to r248571 spa_open was always called with a bare pool name,
but now it is called with a dataset name instead (spa_lookup handles
that).
So, when a ZFS root is mounted spa_open is called with a name of a root
dataset, which can very well be different from the pool name.
But zvol_create_minors should be called with the pool name, because it
performs a recursive traversal of all datasets under the name to find
all those that are volumes.
Add a callback to the ada(4) driver so that it knows when GEOM has released
references to it.
This is the functional equivalent to change r237518, which added this
functionality to the cd(4) and da(4) drivers.
This fix prevents a panic caused by GEOM calling adaopen() while the device
is going away. We now keep the device around until GEOM has finished
cleaning up its state.
ata_da.c: In adaregister(), add a d_gone callback to the GEOM disk
structure registered for the ada driver. Increment the
peripheral reference count for GEOM.
Add a new callback, adadiskgonecb(), that GEOM calls when
it is done with its resources. This callback releases the
reference acquired in adaregister().
Merge libzfs_core and other ZFS bugfixes and improvements.
MFC r248571:
MFV 238590, 238592:
In the first zfs ioctl restructuring phase, the libzfs_core library was
introduced. It is a new thin library that wraps around kernel ioctl's.
The idea is to provide a forward-compatible way of dealing with new
features. Arguments are passed in nvlists and not random zfs_cmd fields,
new-style ioctls are logged to pool history using a new method of
history logging.
MFV 247580 [1]:
To address issues of several deadlocks and race conditions the locking
code around dsl_dataset was rewritten and the interface to synctasks
was changed.
User-Visible Changes:
"zfs snapshot" can create more arbitrary snapshots at once (atomically)
"zfs destroy" destroys multiple snapshots at once
"zfs recv" has improved performance
Backward Compatibility:
I have extended the compatibility layer to support full backward
compatibility by remapping or rewriting the responsible ioctl arguments.
Old utilities are fully supported by the new kernel module.
Forward Compatibility:
New utilities work with old kernels with the following restrictions:
- creating, destroying, holding and releasing of multiple snapshots
at once is not supported, this includes recursive (-r) commands
Illumos ZFS issues:
2882 implement libzfs_core
2900 "zfs snapshot" should be able to create multiple,
arbitrary snapshots at once
3464 zfs synctask code needs restructuring
MFC r248976:
Call dmu_snapshot_list_next() in zvol.c with dsl_pool_config lock held
MFC r249004:
Do not check against uninitialized rc and comment out vendor code
MFC r249042:
Fix possible pool hold leak in dmu_send_impl()
Illumos ZFS issues:
3645 dmu_send_impl: possibilty of pool hold leak
MFC r249188:
Import vendor change to reduce diff, no effect on FreeBSD.
Illumos ZFS issues:
3517 importing pool with autoreplace=on and "hole" vdevs crashes
syseventd
MFC r249195:
Merge change from vendor to reduce diff only.
ZFS dtrace probes are not supported on FreeBSD yet.
Illumos ZFS issues:
3598 want to dtrace when errors are generated in zfs
MFC r249196:
Provide a fix for kernel panic if receiving recursive deduplicated
streams. Problem reported to vendor.
Illumos ZFS issues:
3692 Panic on zfs receive of a recursive deduplicated stream
MFC r249206:
Merge vendor change - modify time processing in deadman thread.
Illumos ZFS issues:
3618 ::zio dcmd does not show timestamp data
MFC r249207:
Allow zdb to output a histogram of compressed block sizes.
Illumos ZFS issues:
3641 want a histogram of compressed block sizes
MFC r249319:
ZFS expects a copyout of zfs_cmd_t on an ioctl error. Our sys_ioctl()
doesn't copyout in this case.
To solve this a new struct zfs_iocparm_t is introduced consisting of:
- zfs_ioctl_version (future backwards compatibility purposes)
- user space pointer to zfs_cmd_t (copyin and copyout)
- size of zfs_cmd_t (verification purposes)
The copyin and copyout of zfs_cmd_t is now done the illumos (vendor) way
what makes porting of new changes easier and ensures correct behavior if
returning an error.
MFC r249326:
Cast (void *)(uintptr_t) on copyout and copyin of zfs_iocparm_t.zfs_cmd
MFC r249356:
Merge bugfixes accepted and integrated by vendor. Underlying problems
have been reported by us and fixed in r240942 and r249196.
Illumos ZFS issues:
3645 dmu_send_impl: possibilty of pool hold leak
3692 Panic on zfs receive of a recursive deduplicated stream
MFC r249357:
Fix libzfs to report error instead of returning zero if trying to hold or
release a non-existing snapshot of a existing dataset. In recursive case
error is reported if no snapshots with the requested name have been found.
Illumos ZFS issues:
3699 zfs hold or release of a non-existent snapshot does not output
error
Notify CAM on state* change to a logical volume not status. This resolves
the issues reported regarding camcontrol devlist not showing the rebuild
states of volumes unless an explicit camcontrol rescan was executed.
Fix a time calculation error in ctlstat_standard().
ctlstat.c: When converting a timeval to a floating point
number in ctlstat_standard(), cast the nanoseconds
calculation to a long double, so we don't lose
precision. Without the cast, we wind up with a
time in whole seconds only.
Fix bugs in the elapsed time calculation in ctlstat_standard()
pointed out by bde:
- Casting to long double isn't needed.
- The division isn't needed, multiplication can be used.
"When 1 nanosecond is in a floating point literal, the whole
expression is automatically promoted correctly."
- non-KNF indentation (1 tab) for the newly split line
- different non-KNF indentation (5 spaces) for the previously split
line
- exessive parentheses around the division operation
- bogus blank line which splits up the etime initialization
- general verboseness from the above.
MFC r248694:
In GEOM DISK:
- Replace single done mutex with per-disk ones. On system with several
disks on several HBAs that removes small, but measurable lock congestion.
- Modify disk destruction process to not destroy the mutex prematurely.
- Remove some extra pointer derefences.
MFC r238171, r248679:
Fix long known deadlock between geom dev destruction and d_close() call.
Use destroy_dev_sched_cb() to not wait for device destruction while holding
GEOM topology lock (that actually caused deadlock). Use request counting
protected by mutex to properly wait for outstanding requests completion in
cases of device closing and geom destruction. Unlike r227009, this code
does not block taskqueue thread for indefinite time, waiting for completion.
MFC r248674:
Make g_wither_washer() to not loop by itself, but only when there was some
more topology change done that may require its attention. Add few missing
g_do_wither() calls in respective places to signal it.
This fixes potential infinite loop here when some provider is withered, but
still opened or connected for some reason and so can not be destroyed. For
example, see r227009 and r227510.
MFC r249108:
- Unify device to target insertion inside xpt_alloc_device() instead of
duplicating it three times.
- Reformat code to reduce indentation.
- Add lock assertions to every point where reference counters are modified.
- When reference counters are reaching zero, add assertions that there are
no children items left.
- Add a bit more locking to the xptpdperiphtraverse().
MFC r249104:
Move CAM_DEBUG_CDB messages from the point of queuing to the point of
sending to SIM. That allows to inspect real requests execution order,
respecting priorities, freezing, etc.
MFC r248872, r249048:
Make pre-shutdown flush and spindown routines to not use xpt_polled_action(),
but execute the commands in regular way. There is no any reason to cook CPU
while the system is still fully operational. After this change polling in
CAM is used only for kernel dumping.
MFC r248868, r248874:
Implement CAM_PERIPH_FOREACH() macro, safely iterating over the list of
driver's periphs, acquiring and releaseing periph references while doing it.
Use it to iterate over the lists of ada and da periphs when flushing caches
and putting devices to sleep on shutdown and suspend. Previous code could
panic in theory if some device disappear in the middle of the process.
When renaming a directory from one parent directory to another,
we need to call ufs_checkpath() to walk from our new location to
the root of the filesystem to ensure that we do not encounter
ourselves along the way. Until now, we accomplished this by reading
the ".." entries of each directory in our path until we reached
the root (or encountered an error). This change tries to avoid the
I/O of reading the ".." entries by first looking them up in the
name cache and only doing the I/O when the name cache lookup fails.
The code in clear_remove() and clear_inodedeps() skips one entry
in the pagedep and inodedep hash tables. An entry in the table is
skipped because 'pagedep_hash' and 'inodedep_hash' hold the size
of the hash tables - 1.
The chance that this would have any operational failure is extremely
unlikely. These funtions only need to find a single entry and are
only called when there are too many entries. The chance that they
would fail because all the entries are on the single skipped hash
chain are remote.
dim [Tue, 16 Apr 2013 06:51:07 +0000 (06:51 +0000)]
MFC r249316:
Ensure make -j N universe works correctly, by checking for an up-to-date
make before starting the universe targets themselves. Otherwise, all of
the targets would attempt to build make simultaneously, overwriting each
other's copies of the make object files and executable. This could lead
to strange errors, for example when partially-written make executables
are invoked.
Also amend r216620, to make the rest of universe wait properly until the
upgrade_checks target is finished, by adding universe_${target}_prologue
to the .ORDER target. Otherwise, make will be too smart for its own
good, and start building the universe targets simultaneously with the
prologues anyway.
dim [Mon, 15 Apr 2013 18:30:00 +0000 (18:30 +0000)]
Pull in r178636 from upstream llvm trunk:
Second pass at addressing PR15351 by explicitly checking for AVX
support when getting the host processor information. It emits a
.byte sequence on GNUC compilers to work around lack of xgetbv
support with older assemblers, and resolves a comment typo found in
the previous patch.
This should fix crashes due to emitting of AVX instructions on certain
processors, which do not support then, when using -march=native.
This is a direct commit to stable/9, since head has a complete import of
llvm/clang trunk, and there is no single commit to merge.
Don't directly dereference userland pointer; instead use kernel pointer
copied in from userspace. This fixes instant panic when creating CTL LUN
on sparc64. Not a security problem, since the API is root-only.
Add CPU percentage limit enforcement to RCTL. The resouce name is "pcpu".
It was implemented by Rudolf Tomori during Google Summer of Code 2012.
MFC r242957:
Don't divide by zero.
MFC r243070:
Fix kassert that's not really valid for %CPU accounting. The problem
here is race between decaying the resource usage in containers, and updating
per-process usage; basically, the former may cause per-container usage
to get smaller than per-process usage.
MFC r243088:
Improve KASSERT messages in racct, to make it clear which resource
caused the problem.
MFC r248298:
Accessing td_state requires thread lock to be held.
MFC r248300:
When throttling a process to enforce RACCT limits, do not use neither
PBDRY (which simply doesn't make any sense) nor PCATCH (which could
be used by a malicious process to work around the PCPU limit).
MFC r249163:
If filter of the interrupt event is not null, print it, in addition to
the handler address. Add a mark to distinguish between filter and
handler.
MFC r232385 by ru: Remove 3 syscalls from opendir().
Finally removed the stat() and fstat() calls from the opendir() code.
They were made excessive in r205424 by opening with O_DIRECTORY.
Also eliminated the fcntl() call used to set FD_CLOEXEC by opening
with O_CLOEXEC.
(fdopendir() still checks that the passed descriptor is a directory,
and sets FD_CLOEXEC on it.)
The necessary kernel support for O_DIRECTORY and O_CLOEXEC was already in
9.0-RELEASE.
Add a conditional sleep 1 in case we add any IPv6 addresses to interfaces.
Do this per jail started, not per address. This will allow DAD to complete
and services to properly start. Before we have seen problems with services
trying to start before the IPv6 address was available to use and thus
erroring and failing to start.
MFC r249062:
Since ATA_CAM mode has no implemented support for serializing access to the
different ATA channels, required for acard and pc98 ATA controllers, block
access to second channels of both, hoping that one working channel is better
then none. I have an idea how that support could be implemented, but I have
no hardware to work on that.
MFC r248800:
On SIM destruction free associated CCBs, preallocated inside xpt_get_ccb().
Before this change they were just leaked. Fortunately USB sticks now use
only one CCB, and so leak was only 2KB per detach, while other bigger SIMs
with much more allocated CCBs are rarely detached.
Update the manual page to reflect reality. With r138509 and r152355,
"nostrictjoliet" option for mount_cd9660(8) was completely replaced with
"brokenjoliet" somehow.
hrStorageSize and hrStorageUsed are 32 bit integers, reporting a fs
size and usage in hrStorageAllocationUnits. If the file system has
more than 2^31 allocations it can not be shown correctly and the
meters are useless.
In such cases follow net-snmp behaviour and increase
hrStorageAllocationUnits so the values fit under INT_MAX.
dim [Mon, 8 Apr 2013 07:08:29 +0000 (07:08 +0000)]
MFC r248991:
Follow up to r247960 and rr247960 by also amending ctfmerge. For the
only other case where STT_FILE symbols are used, in symit_next() in
cddl/contrib/opensolaris/tools/ctf/cvt/input.c, save the basename of the
symbol, instead of the full pathname.
MFC r230998,r233792: sh: Use vfork in a few common cases.
This uses vfork() for simple commands and command substitutions containing a
single simple command, invoking an external program under certain conditions
(no redirections or variable assignments, non-interactive shell, no job
control). These restrictions limit the amount of code executed in a vforked
child.
Various incarnations of this patch have been shown to bring performance
improvements:
http://lists.freebsd.org/pipermail/freebsd-hackers/2012-January/037581.html
The use of vfork() can be disabled by setting a variable named
SH_DISABLE_VFORK.
- Add support for 'memsync' mode. This is the fastest replication mode that's
why it will now be the default.
- Bump protocol version to 2 and add backward compatibility for version 1.
- Allow to specify hosts by kern.hostid as well (in addition to hostname and
kern.hostuuid) in configuration file.
------------------------------------------------------------------------
r245228 | ken | 2013-01-09 10:02:08 -0700 (Wed, 09 Jan 2013) | 43 lines
Make CTL work a little better with loading and unloading drivers.
Previously CTL would leave individual LUNs enabled in the target
driver, whether or not the port as a whole was enabled. It would
also leave the wildcard LUN enabled indefinitely.
This change means that CTL will enable and disable any active LUNs,
as well as the wildcard LUN, when enabling and disabling a port.
Also, fix a bug that could crop up due to an uninitialized CCB
type.
ctl.c: Before calling ctl_frontend_online(), run through
the LUN list and enable all active LUNs.
After calling ctl_frontend_offline(), run through
the LUN list and disble all active LUNs.
scsi_ctl.c: Before bringing a port online, allocate the
wildcard peripheral for that bus. And after taking
a port offline, invalidate the wildcard peripheral
for that bus.
Make sure that we hold the SIM lock around all
calls to xpt_action() and other transport layer
interfaces that require it.
Use CAM_SIM_{LOCK|UNLOCK} consistently to acquire
and release the SIM lock.
Update a number of outdated comments. Some of
these should have been fixed long ago.
Actually do LUN disbables now. The newer drivers
in the tree work correctly for this as far as I
know.
Initialize the CCB type to CTLFE_CCB_DEFAULT to
avoid a panic due to uninitialized memory.