dim [Sun, 12 Jan 2014 14:37:39 +0000 (14:37 +0000)]
MFC r260494:
Fix a braino with r259730: we cannot currently use CFLAGS.gcc or
CFLAGS.clang in sys/conf/Makefile.arm, since the main kernel build does
not use <bsd.sys.mk>. So revert that particular change for now.
pfg [Sat, 11 Jan 2014 01:44:22 +0000 (01:44 +0000)]
MFC r260361:
gcc: Fix optimization bug.
GCC-PR rtl-optimization/34628
* combine.c (try_combine): Stop and undo after the first combination
if an autoincrement side-effect on the first insn has effectively
been lost.
asomers [Fri, 10 Jan 2014 17:56:23 +0000 (17:56 +0000)]
MFC 259339
sbin/devd/devd.cc
Increase the size of devd's client socket's send buffer from the
default (8k) to 128k. This prevents clients from getting
POLLHUPped during event storms. For example, during zpool creation,
the kernel emits a resource.fs.zfs.statechange event for every vdev
in the pool. A 128k buffer is large enough to hold the statechange
events for a pool with nearly 800 drives.
MFC 259362
sbin/devd/devd.cc
Promoting the SIGINFO handler's log message from LOG_INFO to
LOG_NOTICE, and promoting the "Processing event ..." message from
LOG_DEBUG to LOG_INFO. Setting the logfile to LOG_NOTICE with this
change will have the same result as setting it to LOG_INFO without
this change. Setting it to LOG_INFO with this change will include
the useful "Processing event ..." messages that were previously at
LOG_DEBUG, without including useless messages like "Pushing table".
The intent of this change is that one can log "Processing event ..."
without logging "Pushing table" and related messages that are sent
for every event. The number of lines actually logged is reduced by
about 75% by making this change and setting syslog to LOG_INFO vs
setting syslog to LOG_DEBUG.
etc/syslog.conf
Changing the recommended loglevel to notice instead of info.
asomers [Fri, 10 Jan 2014 16:56:59 +0000 (16:56 +0000)]
MFC 259240
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
When a da or ada device dissappears, outstanding IOs fail with
ENXIO, not EIO. The check for EIO was probably copied from Illumos,
where that is indeed the correct errno.
Without this change, pulling a busy drive from a zpool would usually
turn it into UNAVAIL, even though pulling an idle drive would turn
it into REMOVED. With this change, it is REMOVED every time.
Also, vdev_geom_io_intr shouldn't do zfs_post_remove, because that
results in devd getting two resource.fs.zfs.removed events. The
comment said that the event had to be sent directly instead of
through the async removal thread because "the DE engine is using
this information to discard prevoius I/O errors". However, the fact
that vdev_geom_io_intr was never actually sending the events until
now, and that vdev_geom_orphan never sent them at all, and that
vdev_geom_orphan usually gets called about 2 seconds after the
actual removal, means that FreeBSD's userland can cope with a late
event just fine.
ae [Fri, 10 Jan 2014 09:45:28 +0000 (09:45 +0000)]
MFC r260151 (by adrian):
Use an RLOCK here instead of an RWLOCK - matching all the other calls
to lla_lookup().
This drastically reduces the very high lock contention when doing parallel
TCP throughput tests (> 1024 sockets) with IPv6.
MFC r260187:
lla_lookup() does modification only when LLE_CREATE is specified.
Thus we can use IF_AFDATA_RLOCK() instead of IF_AFDATA_LOCK() when doing
lla_lookup() without LLE_CREATE flag.
MFC r260217:
Add IF_AFDATA_WLOCK_ASSERT() in case lla_lookup() is called with
LLE_CREATE flag.
ae [Fri, 10 Jan 2014 07:48:36 +0000 (07:48 +0000)]
MFC r259634:
Prevent users from deactivating the last component of a mirror.
MFC r259929:
Add an ability to stop gmirror and clear its metadata in one command.
This fixes the problem, when gmirror starts again just after stop.
The problem occurs when gmirror's component has geom label with equal size.
E.g. gpt and gptid have the same size as partition, diskid has the same
size as entire disk. When gmirror's geom has been destroyed, glabel
creates its providers and this initiate retaste.
Now "gmirror destroy" command is available. It destroys geom and also
erases gmirror's metadata.
ae [Fri, 10 Jan 2014 07:43:40 +0000 (07:43 +0000)]
MFC r258357:
Add "resize" verb to gmirror(8) and such functionality to geom_mirror(4).
Now it is easy to expand the size of the mirror when all its components
are replaced. Also add g_resize method to geom_mirror class. It will write
updated metadata to new last sector, when parent provider is resized.
dim [Thu, 9 Jan 2014 23:08:56 +0000 (23:08 +0000)]
MFC r260334:
Split the last gcc-specific flags off into CFLAGS.gcc. This also
removes the need to use -Qunused-arguments for clang throughout the
tree.
MFC r260369:
Apply band-aid for 32-bit compat libs failures after r260334: put back
-Qunused-arguments for clang for now, until I can figure out a way to
make it unneeded in all scenarios. Sorry about the breakage.
dim [Thu, 9 Jan 2014 22:40:51 +0000 (22:40 +0000)]
MFC r260102:
Similar to r260020, only use -fms-extensions with gcc, for all other
modules which require this flag to compile. Use a GCC_MS_EXTENSIONS
variable, defined in kern.pre.mk, which can be used to easily supply the
flag (or not), depending on the compiler type.
MFC r260322:
In addition to r260102, also define GCC_MS_EXTENSIONS in bsd.sys.mk,
since kernel module builds do not use kern.pre.mk.
loos [Thu, 9 Jan 2014 18:28:58 +0000 (18:28 +0000)]
MFC r257064:
Add an OFW SPI compatible bus. Fix the spibus probe to return
BUS_PROBE_GENERIC and not BUS_PROBE_SPECIFIC (0) so the OFW SPI bus can
attach when enabled. Export the spibus devclass_t and driver_t
declarations.
mav [Thu, 9 Jan 2014 11:11:47 +0000 (11:11 +0000)]
MFC r258220, r258251:
Implement automatic live resize support for GEOM MULTIPATH class.
In "manual" mode just automatically resize provider in any direction.
In "automatic" mode allow growth (with new metadata write); in case of
shrinking check if there is already valid metadata found at the new
location. This should allow easy transparent recovery if first resize
was done by mistake.
While there, unify metadata write code and fix minor memory leak.
mav [Thu, 9 Jan 2014 10:44:27 +0000 (10:44 +0000)]
MFC r259197:
Do not DELAY() for P-state transition unless we want to see the result.
Intel manual says: "If a transition is already in progress, transition to
a new value will subsequently take effect. Reads of IA32_PERF_CTL determine
the last targeted operating point." So seems it should be fine to just
trigger wanted transition and go. Linux does the same.
kib [Thu, 9 Jan 2014 03:32:03 +0000 (03:32 +0000)]
MFC r260205:
Update the description for pmap_remove_pages() to match the modern
times. Assert that the pmap passed to pmap_remove_pages() is only
active on current CPU.
cperciva [Wed, 8 Jan 2014 02:19:39 +0000 (02:19 +0000)]
MFC r258893, r258956:
Add a new sysctl / loader tunable kern.panic_reboot_wait_time which
defaults to PANIC_REBOOT_WAIT_TIME (a long-existing kernel config
setting). Use this now-variable value in place of the defined constant
to control how long the system waits after a panic before rebooting.
tuexen [Tue, 7 Jan 2014 23:51:41 +0000 (23:51 +0000)]
MFC r260257:
Fix several bugs in sctp_bindx():
* Set errno to EAFNOSUPPORT if an address is provided which is neither
AF_INET nor AF_INET6.
* Don't modify the arguments.
* Don't smash the stack when provided with a non-zero port.
* Handle the case correctly where the first address provided is
an IPv6 address.
delphij [Tue, 7 Jan 2014 20:04:41 +0000 (20:04 +0000)]
MFC r260403 (MFV r260399):
Apply vendor commits:
197e0ea Fix for TLS record tampering bug. (CVE-2013-4353). 3462896 For DTLS we might need to retransmit messages from the
previous session so keep a copy of write context in DTLS
retransmission buffers instead of replacing it after
sending CCS. (CVE-2013-6450). ca98926 When deciding whether to use TLS 1.2 PRF and record hash
algorithms use the version number in the corresponding
SSL_METHOD structure instead of the SSL structure. The
SSL structure version is sometimes inaccurate.
Note: OpenSSL 1.0.2 and later effectively do this already.
(CVE-2013-6449).
pjd [Tue, 7 Jan 2014 19:46:17 +0000 (19:46 +0000)]
MFC r260290:
Bring back the old size of the kinfo_file structure to preserve ABI.
Keep only one uint64_t spare for further cap_rights_t expension.
Add a comment clarifying that if the size of this structure changes,
a new sysctl MIB has to be allocate for it and the old structure has
to be returned by the old sysctl MIB.
mjg [Tue, 7 Jan 2014 19:28:10 +0000 (19:28 +0000)]
MFC r260232:
Don't check for fd limits in fdgrowtable_exp.
Callers do that already and additional check races with process
decreasing limits and can result in not growing the table at all, which
is currently not handled.
scottl [Tue, 7 Jan 2014 01:51:48 +0000 (01:51 +0000)]
MFC Alexander Motin's direct dispatch, multi-queue, and finer-grained
locking support for CAM
r256826:
Fix several target mode SIMs to not blindly clear ccb_h.flags field of
ATIO CCBs. Not all CCB flags there belong to them.
r256836:
Remove hard limit on number of BIOs handled with one ATA TRIM request.
r256843:
Merge CAM locking changes from the projects/camlock branch to radically
reduce lock congestion and improve SMP scalability of the SCSI/ATA stack,
preparing the ground for the coming next GEOM direct dispatch support.
r256888:
Unconditionally acquire periph reference on CCB allocation failure.
r256895:
Fix memory and references leak due to unfreed path.
r256960:
Move CAM_UNQUEUED_INDEX setting to the last moment and under the periph lock.
This fixes race condition with cam_periph_ccbwait(), causing use-after-free.
r256975:
Minor (mostly cosmetical) addition to r256960.
r257054:
Some microoptimizations for da and ada drivers:
- Replace ordered_tag_count counter with single flag;
- From da remove outstanding_cmds counter, duplicating pending_ccbs list;
- From da_softc remove unused links field.
r257482:
Fix lock recursion, triggered by `smartctl -a /dev/adaX`.
r257501:
Make getenv_*() functions and respectively TUNABLE_*_FETCH() macros not
allocate memory and so not require sleepable environment. getenv() has
already used on-stack temporary storage, so just use it more rationally.
getenv_string() receives buffer as argument, so don't need another one.
r257914:
Some CAM locks polishing:
- Fix LOR and possible lock recursion when handling high-power commands.
Introduce new lock to protect left power quota and list of frozen devices.
- Correct locking around xpt periph creation.
- Remove seems never used XPT_FLAG_OPEN xpt periph flag.
Again, Netflix assisted with testing the merge, but all of the credit goes
to Alexander and iX Systems.
scottl [Tue, 7 Jan 2014 01:32:23 +0000 (01:32 +0000)]
MFC Alexander Motin's GEOM direct dispatch work:
r256603:
Introduce new function devstat_end_transaction_bio_bt(), adding new argument
to specify present time. Use this function to move binuptime() out of lock,
substantially reducing lock congestion when slow timecounter is used.
r256606:
Move g_io_deliver() out of the lock, as required for direct dispatch.
Move g_destroy_bio() out too to reduce lock scope even more.
r256607:
Fix passing uninitialized bio_resid argument to g_trace().
r256610:
Add unmapped I/O support to GEOM RAID.
r256830:
Restore BIO_UNMAPPED and BIO_TRANSIENT_MAPPING in biodonne() when unmapping
temporary mapped buffer. That fixes double unmap if biodone() called twice
for the same BIO (but with different done methods).
r256880:
Merge GEOM direct dispatch changes from the projects/camlock branch.
When safety requirements are met, it allows to avoid passing I/O requests
to GEOM g_up/g_down thread, executing them directly in the caller context.
That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid
several context switches per I/O.
r259247:
Fix bug introduced at r256607. We have to recalculate bp_resid here since
sizes of original and completed requests may differ due to end of media.
Testing of the stable/10 merge was done by Netflix, but all of the credit
goes to Alexander and iX Systems.
mav [Sun, 5 Jan 2014 23:00:38 +0000 (23:00 +0000)]
MFC r256614:
- Take BIO lock in biodone() only when there is no completion callback set
and so we should wake up thread waiting in biowait().
- Remove msleep() timeout from biowait(). It was added 11 years ago, when
there was no locks used, and it should not be needed any more.
mav [Sun, 5 Jan 2014 22:51:09 +0000 (22:51 +0000)]
MFC r258164:
Handle case when ACPI reports HPET device, but does not provide memory
resource for it. In such case take the address range from the HPET table.
This fixes hpet(4) driver attach on Asrock C2750D4I board.
mav [Sun, 5 Jan 2014 22:48:12 +0000 (22:48 +0000)]
MFC r257932:
Use relaxed (write-only) memory barriers when writing some of queue index
registers (for now on ISP2400+). We never read those registers back and
AFAIK their semantics does not require any immediate reaction on write.
mav [Sun, 5 Jan 2014 22:47:12 +0000 (22:47 +0000)]
MFC r257930:
Some more registers access optimizations:
- Process ATIO queue only if interrupt status tells so;
- Do not update queue out pointers after each processed command, do it
only once at the end of the loop.
mav [Sun, 5 Jan 2014 22:45:46 +0000 (22:45 +0000)]
MFC r257916:
Save one more register read per command by not reading rqstoutrp register
every time. The purpose of that register is unlikely output queue overflow
detection, so read it only when its last known (and probably stale now)
value signals overflow.
mav [Sun, 5 Jan 2014 22:38:44 +0000 (22:38 +0000)]
MFC r256705:
Optimize isp(4) to reduce CPU usage, especially in target mode:
- Remove two excessive and slow register reads from isp_intr(). Instead
of rereading value every time, assume that registers contain what we have
written there.
- Avoid sequential search through 4096 array elements when looking for
command tag. Use hash of lists to store active tags separately from free
ones and so greatly speedup the searches.
mav [Sun, 5 Jan 2014 22:14:12 +0000 (22:14 +0000)]
MFC r259168:
Don't even try to read vdev labels from devices smaller then SPA_MINDEVSIZE
(64MB). Even if we would find one somehow, ZFS kernel code rejects such
devices. It is funny to look on attempts to read 4 256K vdev labels from
1.44MB floppy, though it is not very practical and quite slow.
mav [Sun, 5 Jan 2014 22:12:45 +0000 (22:12 +0000)]
MFC r258342:
Reenable vfs.zfs.zio.use_uma for amd64, disabled at r209261.
On machines with seveal CPUs and enough RAM this can easily twice improve
ZFS performance or twice reduce CPU usage. It was disabled three years
ago due to memory and KVA exhaustion reports, but our VM subsystem got
improved a lot since that time, hopefully enough to make another try.
dim [Sun, 5 Jan 2014 15:39:37 +0000 (15:39 +0000)]
Revert MFC of r260102 for now, until I can merge the required fix from
head. This should fix building modules which require -fms-extensions to
compile them with gcc.
glebius [Sun, 5 Jan 2014 13:55:33 +0000 (13:55 +0000)]
Merge r260188 from head:
Fix regression from r249894. Now we pass "gw" as argument to if_output
method, thus for multicast case we need it to point at "dst".
mav [Sat, 4 Jan 2014 23:42:24 +0000 (23:42 +0000)]
MFC r258693:
Make UMA to not blindly force offpage slab header allocation for large
(> PAGE_SIZE) zones. If zone is not multiple to PAGE_SIZE, there may
be enough space for the header at the last page, so we may avoid extra
header memory allocation and hash table update/lookup.
ZFS creates bunch of odd-sized UMA zones (5120, 6144, 7168, 10240, 14336).
This change gives good use to at least some of otherwise lost memory there.
mav [Sat, 4 Jan 2014 23:40:47 +0000 (23:40 +0000)]
MFC r258691:
Don't count bucket allocation failures for UMA zones as their own failures.
There are good reasons for this to happen, such as recursion prevention, etc.
and they are not fatal since buckets are just an optimization mechanism.
Real bucket allocation failures are any way counted by the bucket zones
themselves, and we don't need double accounting there.
mav [Sat, 4 Jan 2014 23:39:39 +0000 (23:39 +0000)]
MFC r258340, r258497:
Implement mechanism to safely but slowly purge UMA per-CPU caches.
This is a last resort for very low memory condition in case other measures
to free memory were ineffective. Sequentially cycle through all CPUs and
extract per-CPU cache buckets into zone cache from where they can be freed.
mav [Sat, 4 Jan 2014 23:38:06 +0000 (23:38 +0000)]
MFC r258338:
Grow UMA zone bucket size also on lock congestion during item free.
Lock congestion is the same, whether it happens on alloc or free, so
handle it equally. Now that we have back pressure, there is no problem
to grow buckets a bit faster. Any way growth is much slower then in 9.x.
mav [Sat, 4 Jan 2014 23:37:01 +0000 (23:37 +0000)]
MFC r258337:
Add two new UMA bucket zones to store 3 and 9 items per bucket.
These new buckets make bucket size self-tuning more soft and precise.
Without them there are buckets for 1, 5, 13, 29, ... items. While at
bigger sizes difference about 2x is fine, at smallest ones it is 5x and
2.6x respectively. New buckets make that line look like 1, 3, 5, 9, 13,
29, reducing jumps between steps, making algorithm work softer, allocating
and freeing memory in better fitting chunks. Otherwise there is quite a
big gap between allocating 128K and 5x128K of RAM at once.
mav [Sat, 4 Jan 2014 23:35:34 +0000 (23:35 +0000)]
MFC r258336:
Implement soft pressure on UMA cache bucket sizes.
Every time system detects low memory condition decrease bucket sizes for
each zone by one item. As result, higher memory pressure will push to
smaller bucket sizes and so smaller per-CPU caches and so more efficient
memory use.
Before this change there was no force to oppose buckets growth as result
of practically inevitable zone lock conflicts, and after some run time
per-CPU caches could consume enough RAM to kill the system.
mav [Sat, 4 Jan 2014 23:31:34 +0000 (23:31 +0000)]
MFC r259232:
Create own free list for each of the first 32 possible allocation sizes.
In case of 4K allocation quantum that means for allocations up to 128K.
With growth of memory fragmentation these lists may grow to quite a large
sizes (tenths and hundreds of thousands items). Having in one list items
of different sizes in worst case may require full linear list traversal,
that may be very expensive. Having lists for items of single size means
that unless user specify some alignment or border requirements (that are
very rare cases) first item found on the list should satisfy the request.
While running SPEC NFS benchmark on top of ZFS on 24-core machine with
84GB RAM this change reduces CPU time spent in vmem_xalloc() from 8%
and lock congestion spinning around it from 20% to invisible levels.
And that all is by the cost of just 26 more pointers per vmem instance.
If at some point our kernel will start to actively use KVA allocations
with odd sizes above 128K, something may need to be done to bigger lists
also.
dim [Sat, 4 Jan 2014 22:00:07 +0000 (22:00 +0000)]
MFC r260095:
For sys/boot/i386 and sys/boot/pc98, separate flags to be passed
directly to the linker (LD_FLAGS) from flags passed indirectly, via the
compiler driver (LDFLAGS).
This is because several Makefiles under sys/boot/i386 and sys/boot/pc98
use ${LD} directly to link, and the normal LDFLAGS value should not be
used in these cases.
glebius [Sat, 4 Jan 2014 19:51:57 +0000 (19:51 +0000)]
Merge r258690 by mav from head:
Fix bug introduced at r252226, when udata argument passed to bucket_alloc()
was used without making sure first that it was really passed for us.
On some of my systems this bug made user argument passed by ZFS code to
uma_zalloc_arg() unexpectedly block UMA per-CPU caches for those zones.
dim [Sat, 4 Jan 2014 18:53:31 +0000 (18:53 +0000)]
MFC r260040:
In sys/dev/mcd/mcd.c, mark the static const COPYRIGHT string as __used,
so it ends up in the object file, and no warnings are emitted about it
being actually unused.
dim [Sat, 4 Jan 2014 17:54:06 +0000 (17:54 +0000)]
MFC r260020:
For sys/dev/drm2/radeon, only use -fms-extensions with gcc. This flag
is only to stop gcc complaining about anonymous unions, which clang does
not do. For clang 3.4 however, -fms-extensions enables the Microsoft
__wchar_t type, which clashes with our own types.h.
MFC r260102:
Similar to r260020, only use -fms-extensions with gcc, for all other
modules which require this flag to compile. Use a GCC_MS_EXTENSIONS
variable, defined in kern.pre.mk, which can be used to easily supply the
flag (or not), depending on the compiler type.
dim [Sat, 4 Jan 2014 17:22:53 +0000 (17:22 +0000)]
MFC r260015:
In libc++'s type_traits header, avoid warnings (activated by our use of
-Wsystem-headers) about potential keyword compatibility problems, by
adding a __libcpp prefix to the applicable identifiers.
Upstream is still debating about this, but we need it now, to be able to
import clang 3.4.
glebius [Fri, 3 Jan 2014 12:28:33 +0000 (12:28 +0000)]
Merge r259681 from head:
Changes:
- Reinit uio_resid and flags before every call to soreceive().
- Set maximum acceptable size of packet to IP_MAXPACKET. As for now the
module doesn't support INET6.
- Properly handle MSG_TRUNC return from soreceive().
pluknet [Thu, 2 Jan 2014 16:37:23 +0000 (16:37 +0000)]
MFC r259872:
The compile time constant limit on number of swap devices was removed in 5.2.
As such, remove the EINVAL error saying so. Currently the vm.nswapdev sysctl
just represents the number of added swap devices.