brooks [Wed, 23 Oct 2013 21:35:39 +0000 (21:35 +0000)]
MFP4:
Change 221534 by rwatson@rwatson_zenith_cl_cam_ac_uk on 2013/01/27 16:05:30
FreeBSD/mips stores page-table entries in a near-identical format
to MIPS TLB entries -- only it overrides certain "reserved" bits
in the MIPS-defined EntryLo register to hold software-defined bits
(swbits) to avoid significantly increasing the page table memory
footprint. On n32 and n64, these bits were (a) colliding with
MIPS64r2 physical memory extensions and (b) being improperly
cleared.
Attempt to fix both of these problems by pushing swbits further
along 64-bit EntryLo registers into the reserved space, and
improving consistency between C-based and assembly-based clearing
of swbits -- in particular, to use the same definition. This
should stop swbits from leaking into TLB entries -- while ignored
by most current MIPS hardware, this would cause a problem with
(much) larger physical memory sizes, and also leads to confusing
hardware-level tracing as physical addresses contain unexpected
(and inconsistent) higher bits.
Discussed with: imp, jmallett
Change 1187301 by brooks@brooks_zenith on 2013/10/23 14:40:10
Loop back the initial commit of 221534 to HEAD. Correct its
implementation for mips32.
grehan [Wed, 23 Oct 2013 18:54:58 +0000 (18:54 +0000)]
Export the block size capability to guests.
- Use #defines for capability bits
- Export the VTBLK_F_BLK_SIZE capability
- Fix bug in calculating capacity: it is in
512-byte units, not the underlying sector size
This allows virtio-blk to have backing devices
with non 512-byte sector sizes e.g. /dev/cd0, and
4K-block harddrives.
nwhitehorn [Wed, 23 Oct 2013 17:24:21 +0000 (17:24 +0000)]
Add two new interfaces to ofw_bus:
- ofw_bus_map_intr()
Maps an (iparent, IRQ) tuple to a system-global interrupt number in some
platform dependent way. This is meant to be implemented as a replacement
for [FDT_]MAP_IRQ() that is an MI interface that knows about the bus
hierarchy.
- ofw_bus_config_intr()
Configures an interrupt (previously mapped) based on firmware sense flags.
This replaces manual interpretation of the sense field in bus drivers and
will, in a follow-up, allow that interpretation to be redirected to the PIC
drivers where it belongs. This will eventually replace the tables in
/sys/dev/fdt/fdt_ARCH.c
The PowerPC/AIM code has been converted to use these globally, with an
implementation in terms of MAP_IRQ() and powerpc_config_intr(), assuming
OpenPIC, at the bus root in nexus(4). The ofw_bus_config_intr() will shortly
be integrated into pic_if.m and bounced through nexus into the PIC tree.
FDT integration will happen significantly later due to larger testing
requirements. This patch in general also lays the groundwork for the removal
of /sys/dev/fdt/fdt_ARCH.c and machine/fdt.h.
nwhitehorn [Wed, 23 Oct 2013 14:28:59 +0000 (14:28 +0000)]
If the device tree directly contains the timebase frequency, use it. This
property is required by ePAPR, but maintain the fallback to bus-frequency
for compatibility.
nwhitehorn [Wed, 23 Oct 2013 14:06:41 +0000 (14:06 +0000)]
Use OF_getencprop() in preference to OF_getprop() for numerical quantities.
Since all supported PowerPC systems are big-endian, this is a no-op, but
this is preparatory work to moving this to /sys/dev/ofw.
bapt [Wed, 23 Oct 2013 14:06:07 +0000 (14:06 +0000)]
Improve SRV records support for the pkg(8) bootstrap:
- order srv records by priorities
- for all entries of the same priority, order randomly respect the weight
- select the port where to fetch from respect the port provided in the SRV record
nwhitehorn [Wed, 23 Oct 2013 14:04:09 +0000 (14:04 +0000)]
Remove OF_instance_to_package() hack for FDT and replace with use of the
generic OF_xref_phandle() API universally. Also replace some related
explicit uses of fdt32_to_cpu() with OF_getencprop() calls.
nwhitehorn [Wed, 23 Oct 2013 13:55:41 +0000 (13:55 +0000)]
Make all Open Firmware internal interfaces endian-safe by using the new
OF_getencprop() API. This removes one explicit endianness conversion in
ofw_iicbus.c.
jhb [Wed, 23 Oct 2013 13:22:50 +0000 (13:22 +0000)]
Finish r254925 and remove the last remaining sysctl name list macro. The
one port that used it has been fixed to use the more portable
getprotoent(3) instead.
mav [Wed, 23 Oct 2013 12:53:05 +0000 (12:53 +0000)]
Move CAM_UNQUEUED_INDEX setting to the last moment and under the periph lock.
This fixes race condition with cam_periph_ccbwait(), causing use-after-free.
smh [Wed, 23 Oct 2013 09:54:58 +0000 (09:54 +0000)]
Improve ZFS N-way mirror read performance by using load and locality
information.
The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.
The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.
Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.
This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.
The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.
With pre-fetch disabled (vfs.zfs.prefetch_disable=1):
== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s
With pre-fetch enabled (vfs.zfs.prefetch_disable=0):
== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s
In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.
The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc
These changes where based on work started by the zfsonlinux developers:
https://github.com/zfsonlinux/zfs/pull/1487
Add atsectl, a simple utility to read and update MAC addresses stored in
the default flash location on Altera DE4 boards. Typically used once
when setting up a board so leaving in tools rather than inflicting on
all users.
To build with world add LOCAL_DIRS=tools/tools/atsectl to the make
command line.
brooks [Tue, 22 Oct 2013 22:03:01 +0000 (22:03 +0000)]
MFP4:
Change 221669 by bz@bz_zenith on 2013/02/01 12:26:04
Run the initialization for polling earlier along with INTRs
so that we can put network interface into polling mode by default
if DEVICE_POLLING is compiled in and no interrupts are available.
cognet [Tue, 22 Oct 2013 21:51:07 +0000 (21:51 +0000)]
- Use bus_dmamap_unload(), it is not optional.
- The new allocator won't return coherent memory for any size > PAGE_SIZE,
so don't assume we have coherent memory, and explicitely use
bus_dmamap_sync().
brooks [Tue, 22 Oct 2013 21:27:22 +0000 (21:27 +0000)]
MFP4:
Change 221767 by rwatson@rwatson_zenith_cl_cam_ac_uk on 2013/02/05 14:18:53
When printing out information on a TLB MOD exception for a user
process (e.g., an attempt to write to a read-only page), report
it as a "write" in the console message, rather than "unknown".
Change 221768 by rwatson@rwatson_zenith_cl_cam_ac_uk on 2013/02/05 14:28:00
Fix post-compile but pre-commit typo in last changeset.
nwhitehorn [Tue, 22 Oct 2013 21:20:05 +0000 (21:20 +0000)]
A few other common cases for encode-int decoding: OF_getencprop_alloc()
and OF_searchencprop(). I thought about using the element size parameter
to OF_getprop_alloc() to do endian-switching automatically, but it breaks
use with structs and a *lot* of FDT code (which can hopefully be moved to
these new APIs).
brooks [Tue, 22 Oct 2013 21:16:57 +0000 (21:16 +0000)]
MFP4:
Change 231031 by brooks@brooks_zenith on 2013/07/11 16:22:08
Turn the unused and uncompilable MIPS_DISABLE_L1_CACHE define in
cache.c into an option and when set force I- and D-cache line
sizes to 0 (the latter part might be better as a tunable).
Fix some casts in an #if 0'd bit of code which attempts to
disable L1 cache ops when the cache is coherent.
brooks [Tue, 22 Oct 2013 21:06:27 +0000 (21:06 +0000)]
MFP4:
Change 221534 by rwatson@rwatson_zenith_cl_cam_ac_uk on 2013/01/27 16:05:30
FreeBSD/mips stores page-table entries in a near-identical format
to MIPS TLB entries -- only it overrides certain "reserved" bits
in the MIPS-defined EntryLo register to hold software-defined bits
(swbits) to avoid significantly increasing the page table memory
footprint. On n32 and n64, these bits were (a) colliding with
MIPS64r2 physical memory extensions and (b) being improperly
cleared.
Attempt to fix both of these problems by pushing swbits further
along 64-bit EntryLo registers into the reserved space, and
improving consistency between C-based and assembly-based clearing
of swbits -- in particular, to use the same definition. This
should stop swbits from leaking into TLB entries -- while ignored
by most current MIPS hardware, this would cause a problem with
(much) larger physical memory sizes, and also leads to confusing
hardware-level tracing as physical addresses contain unexpected
(and inconsistent) higher bits.
nwhitehorn [Tue, 22 Oct 2013 20:57:24 +0000 (20:57 +0000)]
Add a new function (OF_getencprop()) that undoes the transformation applied
by encode-int. Specifically, it takes a set of 32-bit cell values and
changes them to host byte order. Most non-string instances of OF_getprop()
should be using this function, which is a no-op on big-endian platforms.
grehan [Tue, 22 Oct 2013 19:55:04 +0000 (19:55 +0000)]
Fix AHCI ATAPI emulation when backed with /dev/cd0
- remove assumption that the backing file/device had
512-byte sectors
- fix incorrect iovec size variable that would result
in a buffer overrun when an o/s issued an i/o request
with more s/g elements than the blockif api
Reviewed by: Zhixiang Yu (zxyu.core@gmail.com)
MFC after: 3 days
cperciva [Tue, 22 Oct 2013 18:36:39 +0000 (18:36 +0000)]
Thou shalt not leak build host state into the system being compiled.
The VERSION variable is encoded into the SUNW_ctf sections of the kernel
and every kernel module when dtrace is enabled; starting with 9.2-RELEASE
(when dtrace was turned on in GENERIC) this means that different host kernels
will result in very different kernel binaries being generated. This tripped
up freebsd-update builds after the build boxes were updated from 9.x to 10.x.
MFC after: 3 days (stable/9)
X-MFC after: 0 days (stable/10)
Security: Rendered two members of so@ temporarily insane
andre [Tue, 22 Oct 2013 18:24:34 +0000 (18:24 +0000)]
The TCP delayed ACK logic isn't aware of LRO passing up large aggregated
segments thinking it received only one segment. This causes it to enable
the delay the ACK for 100ms to wait for another segment which may never
come because all the data was received already.
Doing delayed ACK for LRO segments is bogus for two reasons: a) it pushes
us further away from acking every other packet; b) it introduces additional
delay in responding to the sender. The latter is especially bad because it
is in the nature of LRO to aggregated all segments of a burst with no more
coming until an ACK is sent back.
Change the delayed ACK logic to detect LRO segments by being larger than
the MSS for this connection and issuing an immediate ACK for them to keep
the ACK clock ticking without interruption.
brooks [Tue, 22 Oct 2013 15:53:29 +0000 (15:53 +0000)]
Stop conflating WITHOUT_CLANG with WITHOUT_CLANG_IS_CC. This allows
bootstrapping a copy of clang without building clang for the base system
which is useful for nanobsd and similar setups. It's still probably
wrong to conflate what is installed as /usr/bin/cc with the selection
of a bootstrap compiler under WITH*_CLANG_IS_CC, but that's for another
day.
nwhitehorn [Tue, 22 Oct 2013 15:47:13 +0000 (15:47 +0000)]
Ignore registers on devices where the reg property is malformed. Issue a
warning if this happens under bootverbose. This prevents some
strange-looking entries in dmesg for SMU devices on Apple G5 systems.
Implement a driver for Robert Norton's PIC as an FDT interrupt
controller. Devices whose interrupt-parent property points to a beripic
device will have their interrupt allocation, activation , and setup
operations routed through the IC rather than down the traditional bus
hierarchy.
This driver largely abstracts the underlying CPU away allowing the
PIC to be implemented on CPU's other than BERI. Due to insufficient
abstractions a small amount of MIPS specific code is currently required
in fdt_mips.c and to implement counters.
nwhitehorn [Tue, 22 Oct 2013 14:11:16 +0000 (14:11 +0000)]
Catch up on 6 years of improvements in Open Firmware nexus devices by
importing the sparc64 one. At least 90% of this code is MI and will be
moved into /sys/dev/ofw at some point in the future.
nwhitehorn [Tue, 22 Oct 2013 14:10:00 +0000 (14:10 +0000)]
Set BUS_PROBE_NOWILDCARD on this attachment as a stopgap. Unconditionally
poking at registers in unknown devices is not the best probe mechanism.
This should be reverted and a better solution found later.
nwhitehorn [Tue, 22 Oct 2013 14:07:57 +0000 (14:07 +0000)]
Standards-conformance and code deduplication:
- Use bus reference phandles in place of FDT offsets as IRQ domain keys
- Unify the identical macio/fdt/mambo OpenPIC drivers into one
- Be more forgiving (following ePAPR) about what we need from the device
tree to identify an OpenPIC
- Correctly map all IRQs into an interrupt domain
- Set IRQ_*_CONFORM for interrupts on an unknown PIC type instead of
failing attachment for that device.
smh [Tue, 22 Oct 2013 13:31:36 +0000 (13:31 +0000)]
Use the vdev's ashift to calculate the supported min block size passed to
zio_compress_data(..) when compressing l2arc buffers.
This eliminates l2arc I/O errors, which resulted in very poor performance on
vdev's configured with block size greater than 512b due to compression
assuming a smaller min block size than the vdev supports.
mav [Tue, 22 Oct 2013 12:58:22 +0000 (12:58 +0000)]
Unconditionally acquire periph reference on CCB allocation failure.
cam_periph_acquire() can return error if periph already invalidated, but
that may be unacceptable and cause deadlock if the invalidated periph can't
be destroyed without "executing" the scheduled request.
mav [Tue, 22 Oct 2013 08:22:19 +0000 (08:22 +0000)]
Merge GEOM direct dispatch changes from the projects/camlock branch.
When safety requirements are met, it allows to avoid passing I/O requests
to GEOM g_up/g_down thread, executing them directly in the caller context.
That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid
several context switches per I/O.
The defined now safety requirements are:
- caller should not hold any locks and should be reenterable;
- callee should not depend on GEOM dual-threaded concurency semantics;
- on the way down, if request is unmapped while callee doesn't support it,
the context should be sleepable;
- kernel thread stack usage should be below 50%.
To keep compatibility with GEOM classes not meeting above requirements
new provider and consumer flags added:
- G_CF_DIRECT_SEND -- consumer code meets caller requirements (request);
- G_CF_DIRECT_RECEIVE -- consumer code meets callee requirements (done);
- G_PF_DIRECT_SEND -- provider code meets caller requirements (done);
- G_PF_DIRECT_RECEIVE -- provider code meets callee requirements (request).
Capable GEOM class can set them, allowing direct dispatch in cases where
it is safe. If any of requirements are not met, request is queued to
g_up or g_down thread same as before.
Such GEOM classes were reviewed and updated to support direct dispatch:
CONCAT, DEV, DISK, GATE, MD, MIRROR, MULTIPATH, NOP, PART, RAID, STRIPE,
VFS, ZERO, ZFS::VDEV, ZFS::ZVOL, all classes based on g_slice KPI (LABEL,
MAP, FLASHMAP, etc).
To declare direct completion capability disk(9) KPI got new flag equivalent
to G_PF_DIRECT_SEND -- DISKFLAG_DIRECT_COMPLETION. da(4) and ada(4) disk
drivers got it set now thanks to earlier CAM locking work.
This change more then twice increases peak block storage performance on
systems with manu CPUs, together with earlier CAM locking changes reaching
more then 1 million IOPS (512 byte raw reads from 16 SATA SSDs on 4 HBAs to
256 user-level threads).
bz [Tue, 22 Oct 2013 00:50:53 +0000 (00:50 +0000)]
Make netback compile without INET support in the kernel.
This shuld have been a problem since r230587. Not exactly sure why it
was not detected the last weeks with the tinderbox. I would assume
r255744 is what started to cause it.
sbruno [Mon, 21 Oct 2013 22:55:56 +0000 (22:55 +0000)]
Resolve clang warning about a use of syslog. By using a proper enforcement
of string format in a call so syslog
/usr/src/gnu/lib/libssp/../../../contrib/gcclibs/libssp/ssp.c:137:23:
warning: format string is not a string literal (potentially insecure)
[-Wformat-security]
syslog (LOG_CRIT, msg1);
^~~~
brooks [Mon, 21 Oct 2013 22:43:38 +0000 (22:43 +0000)]
Remove the isf(4) driver. It was created by accident and is subset of
the cfi(4) driver. It remained in the tree longer than would be ideal
due to the time required to bring cfi(4) to feature parity.
brooks [Mon, 21 Oct 2013 21:13:01 +0000 (21:13 +0000)]
MFP4: 223121 (FDT infrastructure portion)
Implement support for interrupt-parent nodes in simplebus. The current
implementation requires that device declarations have an interrupt-parent
node and that it point to a device that has registered itself as a
interrupt controller in fdt_ic_list_head and implements the fdt_ic
interface.
kib [Mon, 21 Oct 2013 16:46:12 +0000 (16:46 +0000)]
Add a resource limit for the total number of kqueues available to the
user. Kqueue now saves the ucred of the allocating thread, to
correctly decrement the counter on close.
Under some specific and not real-world use scenario for kqueue, it is
possible for the kqueues to consume memory proportional to the square
of the number of the filedescriptors available to the process. Limit
allows administrator to prevent the abuse.
This is kernel-mode side of the change, with the user-mode enabling
commit following.
Reported and tested by: pho
Discussed with: jmg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
kib [Mon, 21 Oct 2013 16:44:53 +0000 (16:44 +0000)]
Add a resource limit for the total number of kqueues available to the
user. Kqueue now saves the ucred of the allocating thread, to
correctly decrement the counter on close.
Under some specific and not real-world use scenario for kqueue, it is
possible for the kqueues to consume memory proportional to the square
of the number of the filedescriptors available to the process. Limit
allows administrator to prevent the abuse.
This is kernel-mode side of the change, with the user-mode enabling
commit following.
Reported and tested by: pho
Discussed with: jmg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
kib [Mon, 21 Oct 2013 16:22:51 +0000 (16:22 +0000)]
Reset function on SandyBridge holds the gt_lock for the whole duration
already. Also, according to the specs, GDRST register is not in the
power well, so the forcewake for reset status read is excessive for
this reason.
Use plain register read for waiting of the reset completion
notification, to avoid gt_lock recursion. Linux upstream did the
similar change, but their code was already restructured.
Reported by: ray
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
mav [Mon, 21 Oct 2013 12:00:26 +0000 (12:00 +0000)]
Merge CAM locking changes from the projects/camlock branch to radically
reduce lock congestion and improve SMP scalability of the SCSI/ATA stack,
preparing the ground for the coming next GEOM direct dispatch support.
Replace big per-SIM locks with bunch of smaller ones:
- per-LUN locks to protect device and peripheral drivers state;
- per-target locks to protect list of LUNs on target;
- per-bus locks to protect reference counting;
- per-send queue locks to protect queue of CCBs to be sent;
- per-done queue locks to protect queue of completed CCBs;
- remaining per-SIM locks now protect only HBA driver internals.
While holding LUN lock it is allowed (while not recommended for performance
reasons) to take SIM lock. The opposite acquisition order is forbidden.
All the other locks are leaf locks, that can be taken anywhere, but should
not be cascaded. Many functions, such as: xpt_action(), xpt_done(),
xpt_async(), xpt_create_path(), etc. are no longer require (but allow) SIM
lock to be held.
To keep compatibility and solve cases where SIM lock can't be dropped, all
xpt_async() calls in addition to xpt_done() calls are queued to completion
threads for async processing in clean environment without SIM lock held.
Instead of single CAM SWI thread, used for commands completion processing
before, use multiple (depending on number of CPUs) threads. Load balanced
between them using "hash" of the device B:T:L address.
HBA drivers that can drop SIM lock during completion processing and have
sufficient number of completion threads to efficiently scale to multiple
CPUs can use new function xpt_done_direct() to avoid extra context switch.
Make ahci(4) driver to use this mechanism depending on hardware setup.
bdrewery [Mon, 21 Oct 2013 10:09:48 +0000 (10:09 +0000)]
Fix 'make delete-old-libs' and 'make check-libs' to delete .debug
files created by WITH_DEBUG_FILES. Also cleanup .symbols files from
the period between r244236 when .symbols were supported and r251512
when they were renamed to .debug.
Only propose to delete a .debug file if the corresponding library
itself was deleted already.
Reported by: des
Reviewed by: emaste (earlier version)
Approved by: bapt
MFC after: 3 days