fix a bug where we access a bread buffer after we have brelse'd it...
The kernel normally didn't unmap/context switch away before we accessed
the buffer most of the time, but under heavy I/O pressure and lots of
mount/unmounting this would cause a fault on nofault panic...
NULL stale pointers (should be a no-op as they should no longer be
used)...
dim [Sun, 29 Sep 2013 20:35:38 +0000 (20:35 +0000)]
MFC r255804:
Pull in r191165 from upstream llvm trunk:
ISelDAG: spot chain cycles involving MachineNodes
Previously, the DAGISel function WalkChainUsers was spotting that it
had entered already-selected territory by whether a node was a
MachineNode (amongst other things). Since it's fairly common practice
to insert MachineNodes during ISelLowering, this was not the correct
check.
Looking around, it seems that other nodes get their NodeId set to -1
upon selection, so this makes sure the same thing happens to all
MachineNodes and uses that characteristic to determine whether we
should stop looking for a loop during selection.
This should fix PR15840.
Specifically, this fixes the long-standing assertion failure when
compiling the multimedia/gstreamer port on i386. Thanks to Tijl
Coosemans for his help in getting upstream to fix it.
ken [Sat, 28 Sep 2013 05:56:37 +0000 (05:56 +0000)]
MFC 244015:
------------------------------------------------------------------------
r244015 | ken | 2012-12-07 21:16:07 -0700 (Fri, 07 Dec 2012) | 17 lines
Fix the CTL OOA queue dumping code so that it does not hold a mutex
while doing a copyout. That can cause a panic, because copyout
can trigger VM faults, and we can't handle VM faults while holding
a mutex.
The solution here is to malloc a separate buffer to hold the OOA
queue entries, so that we don't risk a VM fault while filling up
the buffer and we don't have to drop the lock. The other solution
would be to wire the user's memory while filling their buffer with
copyout, but that would have been a little more complex.
Also fix a debugging parenthesis issue in ctl_abort_task() pointed
out by Chuck Tuffli.
MFC r255844:
Ensure that the ERESTART return from the syscall reloads the registers,
to make the restarted syscall instruction pass the correct arguments.
Align stacks of kernel threads correctly at 16-byte boundaries rather than
making sure they are all misaligned at +8 bytes. This fixes clang builds
of powerpc64 kernels (aside from a required increase in KSTACK_PAGES which
will come later).
marius [Thu, 26 Sep 2013 09:16:57 +0000 (09:16 +0000)]
MFC: r254005
- Fix a bug in the MSI allocation logic so an MSI is also employed if a
controller supports only a single message. I haven't seen such an adapter
out in the wild, though, so this change likely is a NOP.
While at it, further simplify the MSI allocation logic; there's no need
to check the number of available messages on our own as pci_alloc_msi(9)
will just fail if it can't provide us with the single message we want.
- Nuke the unused softc of aacch(4).
MFC 240424,244582:
Improve check coverage about idle threads.
Idle threads are not allowed to acquire any lock but spinlocks.
Deny any attempt to do so by panicing at the locking operation
when INVARIANTS is on. Then, remove the check on blocking on a
turnstile.
The check in sleepqueues is left because they are not allowed to use
tsleep() either which could happen still.
On entering KDB backends, the hijacked thread to run
interrupt context can still be idlethread. At that point, without the
panic condition, it can still happen that idlethread then will try to
acquire some locks to carry on some operations.
Skip the idlethread check on block/sleep lock operations when KDB is
active.
ken [Mon, 23 Sep 2013 21:52:07 +0000 (21:52 +0000)]
MFC r255501
This is slightly modified from the FreeBSD/head version, to include
version checks for the scanning changes for not only FreeBSD/head
(1000039 and higher) but also stable/9 (902502 and higher).
Fix an issue that caused Integrated RAID volumes on LSI mps(4) controllers
to not get scanned on boot.
The problem originated in change 253549. With the change to the mps(4)
driver to scan only targets that it knows it has (as opposed to scanning
the entire bus), scanning RAID volumes on boot was omitted.
So, for versions of FreeBSD that have the scanning changes
(__FreeBSD_version 1000039 and higher), scan RAID volumes that are added
whether or not we're booting.
PR: kern/181784
Reported by: Xiguang Wang <kurapica@gmail.com>
Tested by: Dennis Glatting <dg@pki2.com>
Sponsored by: Spectra Logic
MFC: r255333
Intermittent crashes in the NLM (rpc.lockd) code during system
shutdown was reporetd via email. The crashes occurred because the
client side NLM would attempt to use its socket after it had been
destroyed. Looking at the code, it would soclose() once the reference
count on the socket handling structure went to 0. Unfortunately,
nlm_host_get_rpc() will simply allocate a new socket handling structure
when none exists and use the now soclose()d socket. Since there doesn't
seem to be a safe way to determine when the socket is no longer needed,
this patch modifies the code so that it never soclose()es the socket.
Since there is only one socket ever created, this does not introduce a
leak when the rpc.lockd is stopped/restarted. The patch also disables
unloading of the nfslockd module, since it is not safe to do so (and
has never been safe to do so, from what I can see).
MFC: r255284
It was reported via email that the cu_sent field used by the
krpc client side UDP was observed as way out of range and
caused the rpc.lockd daemon to hang trying to do an RPC.
Inspection of the code found two places where the RPC request
is re-queued, but the value of cu_sent was not incremented.
Since cu_sent is always decremented when the RPC request is
dequeued, I think this could have caused cu_sent to go out of
range. This patch adds lines to increment cu_sent for these
two cases.
MFC: r255216
Crashes have been observed for NFSv4.1 mounts when the system
is being shut down which were caused by the nfscbd_pool being
destroyed before the backchannel is disabled. This patch is
believed to fix the problem, by simply avoiding ever destroying
the nfscbd_pool. Since the NFS client module cannot be unloaded,
this should not cause a memory leak.
MFC r252894:
Add SDT_PROBE_DEFINE0 for consistency with SDT_PROBE0.
MFC r253022:
Also define SDT_PROBE_DEFINE0 for the !KDTRACE_HOOKS case.
MFC r254266:
Add event handlers for module load and unload events. The load handlers are
called after the module has been loaded, and the unload handlers are called
before the module is unloaded. Moreover, the module unload handlers may
return an error to prevent the unload from proceeding.
MFC r254267:
Remove some unused fields from struct linker_file. They were added in
r172862 for use by the DTrace SDT framework but don't seem to have ever
been used.
MFC r254268:
FreeBSD's DTrace implementation has a few problems with respect to handling
probes declared in a kernel module when that module is unloaded. In
particular,
* Unloading a module with active SDT probes will cause a panic. [1]
* A module's (FBT/SDT) probes aren't destroyed when the module is unloaded;
trying to use them after the fact will generally cause a panic.
This change fixes both problems by porting the DTrace module load/unload
handlers from illumos and registering them with the corresponding
EVENTHANDLER(9) handlers. This allows the DTrace framework to destroy all
probes defined in a module when that module is unloaded, and to prevent a
module unload from proceeding if some of its probes are active. The latter
problem has already been fixed for FBT probes by checking lf->nenabled in
kern_kldunload(), but moving the check into the DTrace framework generalizes
it to all kernel providers and also fixes a race in the current
implementation (since a probe may be activated between the check and the
call to linker_file_unload()).
Additionally, the SDT implementation has been reworked to define SDT
providers/probes/argtypes in linker sets rather than using SYSINIT/SYSUNINIT
to create and destroy SDT probes when a module is loaded or unloaded. This
simplifies things quite a bit since it means that pretty much all of the SDT
code can live in sdt.ko, and since it becomes easier to integrate SDT with
the DTrace framework. Furthermore, this allows FreeBSD to be quite flexible
in that SDT providers spanning multiple modules can be created on the fly
when a module is loaded; at the moment it looks like illumos' SDT
implementation requires all SDT probes to be statically defined in a single
kernel table.
MFC r254309:
Use kld_{load,unload} instead of mod_{load,unload} for the linker file load
and unload event handlers added in r254266.
MFC r254350:
Specify SDT probe argument types in the probe definition itself rather than
using SDT_PROBE_ARGTYPE(). This will make it easy to extend the SDT(9) API
to allow probes with dynamically-translated types.
MFC r250953:
The fasttrap provider cleans up probes asynchronously when a process with
USDT probes exits. This was previously done with a callout; however, it is
possible to sleep while holding the DTrace mutexes, so a panic will occur
on INVARIANTS kernels if the callout handler can't immediately acquire one
of these mutexes. This panic will be frequently triggered on systems where
a USDT-enabled program (perl, for instance) is often run.
This revision changes the fasttrap cleanup mechanism so that a dedicated
thread is used instead of a callout. The old behaviour is otherwise
preserved.
MFC r252493:
Be sure to destory the fasttrap cleanup mutex when unloading the fasttrap
module. This should be MFCed with r250953.
MFC r254742:
Hold mfi_io_lock across calls to xpt_rescan() and xpt_alloc_ccb_nowait().
xpt_rescan() expects the SIM lock to be held, and we trip a mtx_assert if
the driver initiates multiple rescans in quick succession.
Long URLs don't always appear even with autosizing and other tricks. So,
add some whitespace to put the URL on a line by itself, maximizing view.
MFC r255341:
Remove unnecessary mediaClose (FTP operations are done with either ftp(1)
or fetch(1), neither of which are stateful, compared to how sysinstall(8)
did FTP operations, maintaining an open session until mediaClose).
MFC r255488:
Don't issue USB resume signalling in USB device mode, if the USB power
mode is ON and suspend is detected. This confuses iPads running in USB
host mode at least.
- Make quirk for reading device descriptor from broken USB devices.
Else they won't enumerate at all:
hw.usb.full_ddesc=1
- Reduce the USB descriptor read timeout from 1000ms to
500ms. Typical value for LOW speed devices is 50-100ms.
- Enumerate USB device a maximum of 3 times when a port
connection change event is detected, before giving up.
Revert parts of r245132 and r245175. We don't need to write to the
IMAN register to clear the pending interrupt status bits. This patch
tries to solve problems seen on the MacBook Air, as reported by
Johannes Lundberg <johannes@brilliantservice.co.jp>
MFC r255144:
Make ELI destruction (including orphanization) less aggressive, making it
always wait for provider close. Old algorithm was reported to cause NULL
dereference panic on attempt to close provider after softc destruction.
If not global workaroung in GEOM, that could even cause destruction with
requests still in flight.
MFC r254275:
Return error when opening read-only volumes (like RAID4/5/...) for writing.
Previously opens succeeded, but actual write operations returned errors.
MFC r253706:
Introduce 3 seconds timeout on `graid stop` command (mostly with -f flag).
Since completion waiting goes in g_event thread, it may cause GEOM deadlock
if consumer on top (for example, ZFS) uses g_event thread for closing.
MFC r255120:
Bring legacy CAM target implementation back into API/KPI-coherent and even
functional state. While CTL is much more superior target from all points,
there is no reason why this code should not work.
MFC r254766:
Add new attribute lunname to report only textual LUN-specific device IDs.
While lunid attribute prefers to report numeric ones, having both may be
useful in some situations.
MFC r253752:
Fix returning incorrect bio_resid value with failed BIO_DELETE requests.
Neither residual length reported for ATA/SCSI command nor one from another
BIO_DELETE request are in any way related to the value to be returned.
MFC r250557:
Suppress error printing for "PREVENT ALLOW MEDIUM REMOVAL" on da open.
Change at r250208 exposed more errors here, hidden before. The same flag
is used in cd driver.
MFC r250208:
Tune support for removable media in da driver:
- remove DA_FLAG_SAW_MEDIA flag, almost opposite to DA_FLAG_PACK_INVALID,
using the last instead.
- allow opening device with no media present, reporting zero media size
and non-zero sector size, as geom/notes suggests. That allow to read
device attributes and potentially do other things, not related to media.
MFC r249981:
Remove ADA_FLAG_PACK_INVALID flag. Since ATA disks have no concept of media
change it only duplicates CAM_PERIPH_INVALID flag, so we can use last one.
MFC r249194 (by trasz):
Make SYNCHRONIZE CACHE work with LUNs backed by device files (as opposed
to regular files, which already worked fine). With this change, it's no
longer neccessary to use "ctladm realsync off" workaround.
MFC r254052:
Improve r253721 by reporting detected lack of BIO_FLUSH support to GEOM.
That prevents more of such requests from coming and errors from logging.
MFC r253803:
Add NO_RC16 quirk to make da driver avoid using READ CAPACITY(16) command
if possible. Use it for Kingston JetFlash USB sticks, that are known to
return garbage in response to that command.
MFC r253724:
Synchronize device cache on close only if there were some write operations.
While these operations are not really needed otherwise, at least for SCSI
they may cause extra errors if some other initiator holds write exclusive
reservation on the LUN (SYNCHRONIZE CACHE handled as "write" operation).
MFC r253721, r253722:
Detect unsupported PREVENT ALLOW MEDIUM REMOVAL and SYNCHRONIZE CACHE(10)
to not spam devices with useless commands and logs with errors.
MFC r253322, r253370:
Improve handling of 0x3F/0x0E "Reported LUNs data has changed" and 0x25/0x00
"Logical unit not supported" errors. First initiates specific target rescan,
second -- destroys specific LUN. That allows to automatically detect changes
in list of device LUNs. This mechanism doesn't work when target is completely
idle, but probably that is all what can be done without active polling.
MFC r253228, r253307 (by scottl):
Refactor the various delete methods out of dastart(). Cleans up a bunch
of style and adds more modularity and clarity.
MFC r252382 (by scottl), r252684 (by jkim):
Introduce accessors for the ccb status word. Convert one (of many more)
modules to use it, will convert the others once the appropriate shed
color is selected by consensus.
MFC r255363:
Micro-optimize cpu_search(), allowing compiler to use more efficient inline
ffsl() implementation, when it is available, instead of homegrown iteration.
On dual-E5645 amd64 system (2x6x2 cores) under heavy I/O load that reduces
time spent inside cpu_search() from 19% to 13%, while IOPS increased by 5%.
MFC r253993:
Block reporting of ZFS features for suspended pools.
Before executing any subcommand, zpool tool fetches pools configuration from
the kernel. Before features support was added, kernel was regenerating that
configuration based on data always present in memory. Unfortunately, pool
features list and activity counters are not such. They are stored in ZAP,
that normally resides in ARC, but under heavy memory pressure may be swapped
out. If pool is suspended at this point, there is no way to recover it back
since any zpool command will stuck.
This change has one predictable flaw: `zpool upgrade` always wish to upgrade
suspended pools, but fortunately it can't do it due to the suspension.
MFC r253991:
Make `zpool clear` to reopen also reconnected cache and spare devices.
Since `zpool status` reports about such kinds of errors, it is strange
that they are not cleared by `zpool clear`.
MFC r253990:
Make ZFS to use separate thread to handle SPA_ASYNC_REMOVE async events.
Existing async thread is running only on successfull spa_sync() completion,
that is impossible in case of pool loosing required (last) disk(s). That
indefinite delay of SPA_ASYNC_REMOVE processing made ZFS to not close the
lost disks, preventing GEOM/CAM from destroying devices and reusing names
on later disk reattach.
In earlier version of the patch I've tried to just run existing thread
immediately, unrelated to spa_sync() completion, but that exposed number
of situations where it could stuck due to locks held by stuck spa_sync(),
that are required for other kinds of async events.
Experiments with OpenIndiana snapshot confirmed that they also have this
issue with lost disks reattach.
MFC r253806:
Allow three IOCTLs to be used on suspended pool, restoring state that
existed before IOCTL code refactoring merged change 4445fffb from illumos
at r248571.
This change allows `zpool clear` to be used again to recover suspended pool.
It seems the only was supposed by the code to restore pool operation after
reconnecting lost disks that were required for data completeness. There
are still cases where `zpool clear` command can just safely stuck due to
deadlocks inside ZFS kernel part, but probably that is better then having
no chances to recover at all.
MFC r253643:
Following r222950, revert unintentional change cls -> class in argument name
in r245264. Aside from non-uniformity, that again confused C++ compilers.
MFC: r254337
Fix several performance related issues in the new NFS server's
DRC for NFS over TCP.
- Increase the size of the hash tables.
- Create a separate mutex for each hash list of the TCP hash table.
- Single thread the code that deletes stale cache entries.
- Add a tunable called vfs.nfsd.tcphighwater, which can be increased
to allow the cache to grow larger, avoiding the overhead of frequent
scans to delete stale cache entries.
(The default value will result in frequent scans to delete stale cache
entries, analagous to what the pre-patched code does.)
- Add a tunable called vfs.nfsd.cachetcp that can be used to disable
DRC caching for NFS over TCP, since the old NFS server didn't DRC cache TCP.
It also adjusts the size of nfsrc_floodlevel dynamically, so that it is
always greater than vfs.nfsd.tcphighwater.
For UDP the algorithm remains the same as the pre-patched code, but the
tunable vfs.nfsd.udphighwater can be used to allow the cache to grow
larger and reduce the overhead caused by frequent scans for stale entries.
UDP also uses a larger hash table size than the pre-patched code.
MFC r255261: watch: Do not mess up the tty modes on early error.
Record the initial state earlier, so it is always safe to restore it.
One way this happens is if watch(8) is started by a user that does not have
access to /dev/snp. The result is "staircase effect" during later commands.
A performance problem was reported in PR kern/181226:
I have 25TB Dell PERC 6 RAID5 array. When it becomes almost
full (10-20GB free), processes which write data to it start
eating 100% CPU and write speed drops below 1MB/sec (normally
to gives 400MB/sec). The revision at which it first became
apparent was http://svnweb.freebsd.org/changeset/base/249782.
The offending change reserved an area in each cylinder group to
store metadata. The new algorithm attempts to save this area for
metadata and allows its use for non-metadata only after all the
data areas have been exhausted. The size of the reserved area
defaults to half of minfree, so the filesystem reports full before
the data area can completely fill. However, in this report, the
filesystem has had minfree reduced to 1% thus forcing the metadata
area to be used for data. As the filesystem approached full, it
had only metadata areas left to allocate. The result was that
every block allocation had to scan summary data for 30,000 cylinder
groups before falling back to searching up to 30,000 metadata areas.
The fix is to give up on saving the metadata areas once the free
space reserve drops below 2%. The effect of this change is to use
the old algorithm of just accepting the first available block that
we find. Since most filesystems use the default 5% minfree, this
will have no effect on their operation. For those that want to push
to the limit, they will get their crappy block placements quickly.
In looking at block layouts as part of fixing filesystem block
allocations under low free-space conditions (-r254995), determine
that old block-preference search order used before -r249782 worked
a bit better. This change reverts to that block-preference search order.
In r243868, the error message buffer errmsg have been changed from
an on-stack array to a pointer and therefore sizeof(errmsg) would
become 4 or 8 bytes depending on the architecture.
des [Tue, 10 Sep 2013 10:07:21 +0000 (10:07 +0000)]
Fix the length calculation for the final block of a sendfile(2)
transmission which could be tricked into rounding up to the nearest
page size, leaking up to a page of kernel memory. [13:11]
In IPv6 and NetATM, stop SIOCSIFADDR, SIOCSIFBRDADDR, SIOCSIFDSTADDR
and SIOCSIFNETMASK at the socket layer rather than pass them on to the
link layer without validation or credential checks. [SA-13:12]
Prevent cross-mount hardlinks between different nullfs mounts of the
same underlying filesystem. [SA-13:13]
Make ipfw nat init/unint work correctly for VIMAGE:
* Do per vnet instance cleanup (previously it was only for vnet0 on
module unload, and led to libalias leaks and possible panics due to
stale pointer dereferences).
* Instead of protecting ipfw hooks registering/deregistering by only
vnet0 lock (which does not prevent pointers access from another
vnets), introduce per vnet ipfw_nat_loaded variable. The variable is
set after hooks are registered and unset before they are deregistered.
* Devirtualize ifaddr_event_tag as we run only one event handler for
all vnets.
* It is supposed that ifaddr_change event handler is called in the
interface vnet context, so add an assertion.
dim [Fri, 6 Sep 2013 15:17:25 +0000 (15:17 +0000)]
MFC r245428:
Add CLOCK_PROCESS_CPUTIME_ID to <time.h>, to synchronize the CLOCK_*
values with those in <sys/time.h>. Otherwise, if a program includes
<time.h> before <sys/time.h>, the CLOCK_PROCESS_CPUTIME_ID macro never
gets defined.
Partial MFC of r248534: Implement SOCK_CLOEXEC and SOCK_NONBLOCK in kernel.
SOCK_CLOEXEC and SOCK_NONBLOCK can be OR'ed in socket() and socketpair()'s
type parameter.
The numerical values for SOCK_CLOEXEC and SOCK_NONBLOCK are as in NetBSD.
The SOCK_* flags are not passed to MAC because this may cause incorrect
failures and can be done later via fcntl() anyway. On the other hand, audit
is expected to cope with the new flags.
This commit does not add the constants to <sys/socket.h> as in r248534
because this would lead to various newly compiled software not working on
old kernels. It only serves to help 10.x binaries and prepare for a possible
full MFC (if third-party software starts to insist on SOCK_CLOEXEC and
SOCK_NONBLOCK).
MSG_CMSG_CLOEXEC from r248534 is not merged here because it changes a
KPI/KBI.
No mergeinfo because the rest of r248534 is missing.
MFC r254018:
Pass variables prefixed with both LD_ and LD_32_ to the run-time linker.
This prevents unintentional execution of programs when running ldd(1) on
32-bit Linux binaries.
Basic support for extents was implemented by Zheng Liu as part
of his Google Summer of Code in 2010. This support is read-only
at this time.
In addition to extents we also support the huge_file extension
for read-only purposes. This works nicely with the additional
support for birthtime/nanosec timestamps and dir_index that
have been added lately.
The implementation may not work for all ext4 filesystems as
it doesn't support some features that are being enabled by
default on recent linux like flex_bg. Nevertheless, the feature
should be very useful for migration or simple access in
filesystems that have been converted from ext2/3 or don't use
incompatible features.
Special thanks to Zheng Liu for his dedication and continued
work to support ext2 in FreeBSD.
Submitted by: Zheng Liu (lz@)
Reviewed by: Mike Ma, Christoph Mallon (previous version)
Sponsored by: Google Inc.
Basic support for extents was implemented by Zheng Liu as part
of his Google Summer of Code in 2010. This support is read-only
at this time.
In addition to extents we also support the huge_file extension
for read-only purposes. This works nicely with the additional
support for birthtime/nanosec timestamps and dir_index that
have been added lately.
The implementation may not work for all ext4 filesystems as
it doesn't support some features that are being enabled by
default on recent linux like flex_bg. Nevertheless, the feature
should be very useful for migration or simple access in
filesystems that have been converted from ext2/3 or don't use
incompatible features.
Special thanks to Zheng Liu for his dedication and continued
work to support ext2 in FreeBSD.
Submitted by: Zheng Liu (lz@)
Reviewed by: Mike Ma, Christoph Mallon (previous version)
Sponsored by: Google Inc.