MFC r236884:
Introduce "feature flags" for ZFS pools (bump SPA version to 5000).
Add first feature "com.delphix:async_destroy" (asynchronous destroy
of ZFS datasets).
Implement features support in ZFS boot code.
Illumos revisions merged:
13700:2889e2596bd6
13701:1949b688d5fb
2619 asynchronous destruction of ZFS file systems
2747 SPA versioning with zfs feature flags
1796 "ZFS HOLD" should not be used when doing "ZFS SEND" froma read-only pool
2871 support for __ZFS_POOL_RESTRICT used by ZFS test suite
2903 zfs destroy -d does not work
2957 zfs destroy -R/r sometimes fails when removing defer-destroyed snapshot
MFC r238926:
Partial MFV (illumos-gate 13753:2aba784c276b)
2762 zpool command should have better support for feature flags
References:
https://www.illumos.org/issues/2762
MFC r238950:
Fix reporting of root pool upgrade notice.
MFC r238951:
Fix wrong indent according to style(9)
MFC r239389:
Backport fix for vendor issue #3085
3085 zfs diff panics, then panics in a loop on booting
References:
https://www.illumos.org/issues/3085
MFC r239394:
Update zfs(8) manpage with illumos version of "zfs diff"
Illumos issue:
2399 zfs manual page does not document use of "zfs diff"
References:
https://www.illumos.org/issues/2399
MFC r239620 [2]:
Merge recent vendor changes:
3086 unnecessarily setting DS_FLAG_INCONSISTENT on async destroyed datasets
3090 vdev_reopen() during reguid causes vdev to be treated as corrupt
3102 vdev_uberblock_load() and vdev_validate() may read the wrong label
MFC r240153 (gjb) [3]:
Typo fix and minor word swap.
MFC r240303:
Add assfail() and assfail3() to the opensolaris module.
Remove obsoleted intermediate cddl/compat/opensolaris/sys/debug.h.
MFC r240345 (avg):
zfs: fix sa_modify_attrs handling of variable-sized attributes
- skip length_idx index for a replaced variable-sized attribute
- skip length_idx index for a removed variable-sized attribute
- also re-arranged code to make sure that length_idx is always
incremented for variable-sized attributes
- additionally add an assertion that the number of actually produced
attributes is the same as the expected number of resulting
attributes
Illumos issued covered:
1884 Empty "used" field for zfs *space commands
3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first argument
is zero
3028 zfs {group,user}space -n prints (null) instead of numeric GID/UID
3048 zfs {user,group}space [-s|-S] is broken
3049 zfs {user,group}space -t doesn't really filter the results
3060 zfs {user,group}space -H output isn't tab-delimited
3061 zfs {user,group}space -o doesn't use specified fields order
3064 usr/src/cmd/zpool/zpool_main.c misspells "successful"
3093 zfs {user,group}space's -i is noop
3098 zfs userspace/groupspace fail without saying why when run as non-root
MFC r240955 (partial):
Merge recent vendor changes in ZFS.
Illumos issued covered:
3139 zdb dies when it tries to determine path of unlinked file
3189 kernel panic in ZFS test suite during hotspare_onoffline_004_neg
3208 moving zpool cross-endian results in incorrect user/group accounting
MFC r241655:
Add missing initialization for do_prefix.
Corrects porting error in r238391
Vendor issue and changeset reference:
2883 changing "canmount" property to "on" should not always remount dataset
https://www.illumos.org/issues/2883
Changeset 13743:95aba6e49b9f
MFC r243014:
Move zpool-features manual page from section 5 to section 7
and fix references
mav [Thu, 29 Nov 2012 18:23:21 +0000 (18:23 +0000)]
MFC r242323, r242328:
Add basic BIO_DELETE support to GEOM RAID class for all RAID levels.
If at least one subdisk in the volume supports it, BIO_DELETE requests
will be propagated down. Unfortunatelly, for RAID levels with redundancy
unmapped blocks will be mapped back during first rebuild/resync process.
melifaro [Tue, 27 Nov 2012 20:16:37 +0000 (20:16 +0000)]
MFC r241406, r241502, r241884.
Do not check if found IPv4 rte is dynamic if net.inet.icmp.drop_redirect is
enabled. This eliminates one mtx_lock() per each routing lookup thus improving
performance in several cases (routing to directly connected interface or routing
to default gateway).
Icmp redirects should not be used to provide routing direction nowadays, even
for end hosts. Routers should not use them too (and this is explicitly restricted
in IPv6, see RFC 4861, clause 8.2).
Current commit changes rnh_machaddr function to 'stock' rn_match (and back) for every
AF_INET routing table in given VNET instance on drop_redirect sysctl change.
Eliminate code checking if found IPv6 rte is dynamic. IPv6 redirects
are using (different) ND-based approach described in RFC 4861. This change
is similar to r241406 which conditionally skips the same check in IPv4.
Cleanup documentation: cloning route support has been removed in r186119.
This change is part of bigger patch eliminating rte locking.
mav [Mon, 26 Nov 2012 16:19:27 +0000 (16:19 +0000)]
MFC r238943:
Add several performance optimizations to acpi_cpu_idle().
For C1 and C2 states use cpu_ticks() to measure sleep time instead of much
slower ACPI timer. We can't do it for C3, as TSC may stop there. But it is
less important there as wake up latency is high any way.
For C1 and C2 states do not check/clear bus mastering activity status, as
it is important only for C3. As side effect it can make CPU enter C2 instead
of C3 if last BM activity was two sleeps back (unlike one before), but
that may be even good because of collecting more statistics. Premature BM
wakeup from C3, entered because of overestimation, can easily be worse then
entering C2 from both performance and power consumption points of view.
Together on dual Xeon E5645 system on sequential 512 bytes read test this
change makes cpu_idle_acpi() as fast as simplest cpu_idle_hlt() and only
few percents slower then cpu_idle_mwait(), while deeper states are still
actively used during idle periods.
To help with diagnostics, add C-state type into dev.cpu.X.cx_supported.
yongari [Mon, 26 Nov 2012 04:40:26 +0000 (04:40 +0000)]
MFC r242426:
TCP/UDP checksum offloading feature for IP fragmented datagram was
removed in r99417. bge(4) controllers can do TCP checksum offload
for IP fragmented datagrams but unlike ti(4), it lacks UDP checksum
offloading for IP fragmented datagrams. The problem was bge(4)
blindly requested TCP/UDP checksum for IP fragmented datagrams such
that it resulted in corrupted UDP datagrams before r99417.
Remove remaining code for TCP checksum offloading for IP fragmented
datagrams which should have been removed in r99417.
yongari [Mon, 26 Nov 2012 04:34:53 +0000 (04:34 +0000)]
MFC r241983-241985:
r241983:
Do not hardcode phy address. Multi-port controllers use different phy
address.
r241984:
Ethernet@WireSpeed is defined for 1000baseT adapter to establish a
link at a lower speed so enabling it for fiber adapters is wrong.
Fix the issue by setting BGE_PHY_NO_WIRESPEED such that brgphy(4)
wouldn't enable the feature.
While I'm here move PHY specific feature/bug configuration to new
location(just before mii attach) for readability.
r241985:
For fast ethernet controllers, Ethernet@WireSpeed is not defined so
explicitly set BGE_PHY_NO_WIRESPEED flag.
yongari [Mon, 26 Nov 2012 04:26:27 +0000 (04:26 +0000)]
MFC r241438:
Add APE firmware support and improve firmware handshake procedure.
This change will enable IPMI access on 5717/5718/5719/5720 and 5761
controllers. Because ASF is not available when APE firmware is
present, bge_allow_asf tunable is ignored when driver detects APE
firmware. Also bge(4) no longer performs two resets(one blind
reset and the other reset with firmware in mind) in device attach.
Now bge(4) performs a reset with enough information in bge_reset().
The APE firmware also needs special handling to make suspend/resume
work but it was not implemented yet.
With this change, bge(4) should work on any 5717/5718/5719/5720
controllers. Special thanks to Mike Hibler at Emulab who setup
remote debugging on Dell R820. Without his help I couldn't be able
to address several issues happened on Dell Rx20 systems. And many
thanks to Broadcom for continuing to support FreeBSD!
Submitted by: davidch (initial version)
H/W donated by: Broadcom
Tested by: many
Tested on: Del R820/R720/R620/R420/R320 and HP Proliant DL 360 G8
yongari [Mon, 26 Nov 2012 04:20:59 +0000 (04:20 +0000)]
MFC r241437:
For 5717C/5719C/5720C and 57765 PHYs, do not perform any special
handling(jumbo, wire speed etc) in brgphy_reset(). Touching
BRGPHY_MII_AUXCTL register seems to confuse APE firmware such that
it couldn't establish a link.
yongari [Mon, 26 Nov 2012 04:11:12 +0000 (04:11 +0000)]
MFC r241436:
Rework controller reset procedure. Previously driver saved
BGE_PCI_PCISTATE register before issuing global reset. After
issuing reset, it reads BGE_PCI_PCISTATE register again and
compares the saved register value and current value. It was used to
know whether the global reset operation was completed or not.
Unfortunately, this logic caused several issues on recent BCM5717/
5718/5719 and BCM5720 controllers. It seems APE firmware accesses
some registers while global reset is in progress such that reading
BGE_PCI_PCISTATE register after reset does not yield old pre-reset
state value. This resulted in consuming too much time in global
reset and sometimes it couldn't successfully complete reset.
The BGE_MISCCFG_RESET_CORE_CLOCKS of BGE_MISC_CFG register is
self-clearing bit so driver is able to know the reset completion.
But the core-lock reset will disable indirect/flat/standard access
modes such that driver cannot poll BGE_MISCCFG_RESET_CORE_CLOCKS
bit of BGE_MISC_CFG register. So just wait enough time for
core-clock reset to complete.
Data sheet says driver should wait 100us for PCI/PCI-X devices and
100ms for PCIe devices. I chose 1ms for PCI/PCI-X since this value
was used for many years in bge(4). For PCIe devices, use 100ms as
recommended by data sheet.
bge_chipinit() also cleared BGE_MAC_MODE register which shall clear
firmware configured mode information. I think this will result in
losing ASF/IPMI link in device attachment. Let bge_reset() honor
firmware configured BGE_MAC_MODE register and don't announce driver
is UP in bge_reset(). Firmware should have control over driver until
it's fully initialized by driver.
While I'm here, enable workaround for PCI-X BCM5704 A0 in
bge_reset(). This will prevent internal arbitration logic from
switching to the other DMA engine after a retry cycle.
yongari [Mon, 26 Nov 2012 02:42:19 +0000 (02:42 +0000)]
MFC r241388-241393:
r241388:
If the maximum payload size is 256 bytes or more, set the DMA write
water mark to 256 bytes. Otherwise controller will encounter DMA
write under run errors and would result in RX DMA hang. If the
maximum payload size is 128 bytes, the water mark is set to 128
bytes as usual.
While here, set maximum read request size to 2048 for BCM5719/BCM5720.
For other PCIe devices, use 4096. And reprogram the maximum read
request size whenever device reset is performed.
r241389:
On PHY write error use hex number to show the value.
Add more comments.
r241390:
Honor PHY type fiber for BCM5717/BCM5718/BCM5719/BCM5720.
r241391:
Do not force PCIe 1.0a mode in device reset on BCM5717 and newer
controllers. BCM5785 does not require PCI 1.0a mode as well during
reset.
r241392:
Fix a long standing VCPU reset sequence bug on BCM5906.
The VCPU(Virtual CPU) of BCM5906 is used to provide a mechanism to
control the bootcode execution and to pick up configuration data
stored inside the EEPROM.
The bootcode of BCM5906 will check the BGE_VCPU_STATUS_DRV_RESET
bit to decide which booting procedure to choose.
Data sheet indicates the VCPU of BCM5906 should set
BGE_VCPU_STATUS_DRV_RESET bit *before* VCPU reset or global reset.
r241393:
Remove unnecessary delay. I don't see any comments in data sheet
that requires 10ms delay after device reset. Because that code was
there from day 1, I guess it was added to give enough settlement
time after updating BGE_MAC_MODE register.
The recommended delay time for BGE_MAC_MODE after updating is 40us
and it was already done in r241219.
remko [Fri, 23 Nov 2012 21:27:26 +0000 (21:27 +0000)]
MFC r232486
Add an ifconfig carp option that enables users to set
the state of the carp cluster.
This is a direct commit to stable/9 because -HEAD's
code is very different. I discussed this with Gleb
and the reason for this is that since we do not
touch the kernel itself and are not adding very
weird or confusing things, we can commit this to the
stable branch directly.
The options 'master' and 'backup' are now available,
which enables the administrator to force a node into
the backup or master state on the cluster. Ofcourse
preempt has to be disabled otherwise the master node
will become master again.
One can do that with:
sysctl net.inet.carp.preempt=0
After that one can schedule maintenance on the node
normally running as the master and such.
nyan [Fri, 23 Nov 2012 15:44:04 +0000 (15:44 +0000)]
MFC: r225977, r242867, r242868, r242869
MFi386: r225936
Add some improvements in the idle table callbacks:
- Replace instances of manual assembly instruction "hlt" call
with halt() function calling.
- In cpu_idle_mwait() avoid races in check to sched_runnable() using
the same pattern used in cpu_idle_hlt() with the 'hlt' instruction.
- Add comments explaining the logic behind the pattern used in
cpu_idle_hlt() and other idle callbacks.
MFi386: r211924
Register an interrupt vector for DTrace return probes.
Fix some KASSERTs.
They are missing changes from r208833, r227394 and r227442.
trociny [Mon, 19 Nov 2012 21:11:58 +0000 (21:11 +0000)]
MFC r240997:
Kernel and modules have "set_vnet" linker set, where virtualized
global variables are placed. When a module is loaded by link_elf
linker its variables from "set_vnet" linker set are copied to the
kernel "set_vnet" ("modspace") and all references to these variables
inside the module are relocated accordingly.
The issue is when a module is loaded that has references to global
variables from another, previously loaded module: these references are
not relocated so an invalid address is used when the module tries to
access the variable. The example is V_layer3_chain, defined in ipfw
module and accessed from ipfw_nat.
The same issue is with DPCPU variables, which use "set_pcpu" linker
set.
Fix this making the link_elf linker on a module load recognize
"external" DPCPU/VNET variables defined in the previously loaded
modules and relocate them accordingly. For this set_pcpu_list and
set_vnet_list are used, where the addresses of modules' "set_pcpu" and
"set_vnet" linker sets are stored.
Note, archs that use link_elf_obj (amd64) were not affected by this
issue.
dim [Sat, 17 Nov 2012 23:39:36 +0000 (23:39 +0000)]
MFC r243037:
Fix a bug in aicasm_gram.y, noted by a newer clang 3.2 snapshot: it
compared an enum scope_type against a yacc-generated define, so the
condition would always be false.
mav [Fri, 16 Nov 2012 03:08:23 +0000 (03:08 +0000)]
MFC r242422:
Only four specific ATA PIO commands transfer several sectors per DRQ block
(interrupt). All other ATA PIO commands transfer one sector or 512 bytes
at one time. Hardcode these exceptions in ata(4) with ATA_CAM option.
This fixes timeout of READ LOG EXT command used by `smartctl -x /dev/adaX`.
mav [Fri, 16 Nov 2012 03:05:27 +0000 (03:05 +0000)]
MFC r242156:
Implement CAM_ATAIO_NEEDRESULT (fetching full set of result registers) for
ata(4) driver in ATA_CAM mode. That slighty improves error reporting and
also should fix `smartctl -l scterc /dev/adaX` operation.
mav [Fri, 16 Nov 2012 03:02:07 +0000 (03:02 +0000)]
MFC r241144, r241160:
Implement SATA revision (speed) control for legacy SATA controller for
both boot (via loader tunables) and run-time (via `camcontrol negotiate`).
Tested to work at least on NVIDIA MCP55 chipset.
mav [Fri, 16 Nov 2012 02:55:03 +0000 (02:55 +0000)]
MFC r232380:
Fix names of some Marvell SATA chips. It looks like chips with proprietary
interface supported by mvs(4) are 88SX, while AHCI-like chips are 88SE.
mav [Thu, 15 Nov 2012 06:04:39 +0000 (06:04 +0000)]
MFC r242417:
ASUS EeePC 1001px has strange variant of ALC269 CODEC, that mutes speaker
if unused in that configuration mixer at NID 15 is muted. Probably CODEC
incorrectly reports its internal connections. Hide that muter from the
driver to avoid muting and make built-in speaker work.
There are several different CODECs sharing this ID and I have not enough
information about them and the bug to implement more universal solution.
mav [Thu, 15 Nov 2012 05:58:37 +0000 (05:58 +0000)]
MFC r242357:
Set all pins initial connection status to unknown (2) and then update it
with the real value in regular way if sensing is supported. This fixes
minor inconsistency when playback redirection appeared in undefined state
on boot if headphones were not connected.
mav [Thu, 15 Nov 2012 05:46:02 +0000 (05:46 +0000)]
MFC r240762:
Restore handling of the third argument (id) of hid_start_parse(), same as
it is done in NetBSD/OpenBSD, and as it was here before r205728.
I personally think this API or its implementation is incorrect, as it is not
correct to filter collections based on report ID, as they are orthogonal
in general case, but I see no harm from supporting this feature.
mav [Thu, 15 Nov 2012 05:34:14 +0000 (05:34 +0000)]
MFC r242314:
Make GEOM RAID more aggressive in marking volumes as clean on shutdown
and move that action from shutdown_pre_sync to shutdown_post_sync stage
to avoid extra flapping.
ZFS tends to not close devices on shutdown, that doesn't allow GEOM RAID
to shutdown gracefully. To handle that, mark volume as clean just when
shutdown time comes and there are no active writes.
hselasky [Tue, 13 Nov 2012 17:11:36 +0000 (17:11 +0000)]
MFC r240750, r241987 and r242126:
Add missing CTLFLAG_TUN flag to tunable sysctls in the USB stack.
Adjust timing parameters of FULL/LOW/HIGH speed USB enumeration
and make these timing parameters tunable. This patch will fix
enumeration with some USB devices.
Fix a typo.
dim [Mon, 12 Nov 2012 07:47:19 +0000 (07:47 +0000)]
MFC r242625:
Remove duplicate const specifiers in many drivers (I hope I got all of
them, please let me know if not). Most of these are of the form:
static const struct bzzt_type {
[...list of members...]
} const bzzt_devs[] = {
[...list of initializers...]
};
The second const is unnecessary, as arrays cannot be modified anyway,
and if the elements are const, the whole thing is const automatically
(e.g. it is placed in .rodata).
I have verified this does not change the binary output of a full kernel
build (except for build timestamps embedded in the object files).
eadler [Fri, 9 Nov 2012 00:35:54 +0000 (00:35 +0000)]
MFC r242462:
10 years too late add support for "2.88MB 3.5in Extra High Density"
floppies.
Its unlikely that anyone actually uses these or cares about these
anymore, since we support other floppy types and this change doesn't
hurt - just add it.
yongari [Thu, 8 Nov 2012 02:08:42 +0000 (02:08 +0000)]
MFC r242425:
Remove TCP/UDP checksum offloading feature for IP fragmented
datagrams. Traditionally upper stack fragmented packets without
computing TCP/UDP checksum and these datagrams were passed to
driver. But there are chances that other packets slip into the
interface queue in SMP world. If this happens firmware running on
MIPS 4000 processor in the controller would see mixed packets and
it shall send out corrupted packets.
While I'm here simplify checksum offloading setup.
mav [Tue, 6 Nov 2012 02:08:09 +0000 (02:08 +0000)]
MFC r241329:
Make graid command line a bit more friendly by allowing volume name or
provider name to be specified instead of geom name (first argument in all
subcommands except label). In most cases there is only one array used
any way, so it is not really useful to make user type ugly geom names like
Intel-f0bdf223 or SiI-732c2b9448cf. Though they can be used in some cases.
yongari [Tue, 6 Nov 2012 01:04:46 +0000 (01:04 +0000)]
MFC r242348:
TSO engine of L1 requires a separate DMA descriptor for TCP
payload. This means driver has to split a TX buffer into two
pieces of TX buffers when the TX buffer contains both
ethernet/IP/TCP header and partial TCP payload. The controller
does not require all header should be in a TX buffer but driver
forced it to compute IP/TCP header size/offset which is required
parameter to configure DMA descriptor for TSO.
While here, slightly reorder DMA descriptor setup to enhance
readability and remove unnecessary code for TSO(upper stack never
requests TSO when the frame length is less than or equal to MTU).
eadler [Tue, 6 Nov 2012 00:55:43 +0000 (00:55 +0000)]
MFC r242514,r242541:
Revert the change that makes less default.
Since I've committed this I've receieved roughly an equal
amount of email thanking me for making this change
and asking me to revert it.
I've resisted making this change because
new users tend to prefer less over more
and these users are the least likely to know
how to change the PAGER on their own.
Requested by: many
Objected to: just as many
Decision made by: core
====
Change default prompt to show ~ again for the home directory
des [Mon, 5 Nov 2012 10:45:37 +0000 (10:45 +0000)]
MFH r225813, r233648: man page fixes
MFH r234837: avoid busy-loop on slow connections
MFH r234838: don't reuse credentials when redirected to another host
MFH r240496: use libmd if and only if OpenSSL is not available
jamie [Fri, 2 Nov 2012 01:32:22 +0000 (01:32 +0000)]
MFC r225191:
Delay the recursive decrement of pr_uref when jails are made invisible
but not removed; decrement it instead when the child jail actually
goes away. This avoids letting the counter go below zero in the case
where dying (pr_uref==0) jails are "resurrected", and an associated
KASSERT panic.