markj [Tue, 5 Aug 2014 00:25:46 +0000 (00:25 +0000)]
MFC r267706:
Allow creation of SDT probes from a module in which no providers are
defined. This ensures that the sdt:zfs:: probes appear despite the fact
the sdt provider is defined in the kernel rather than in zfs.ko.
markj [Mon, 4 Aug 2014 21:41:00 +0000 (21:41 +0000)]
MFC r256822:
When fetching function arguments out of a frame on amd64, explicitly select
the register based on the argument index rather than relying on the fields
in struct reg to be in the right order. This assumption is incorrect on
FreeBSD and generally led to bogus argument values for the sixth argument
of PID and USDT probes; the first five are passed directly to dtrace_probe()
via the fasttrap trap handler and so were correctly handled.
markj [Mon, 4 Aug 2014 15:36:22 +0000 (15:36 +0000)]
MFC r256571:
Add a function, memstr, which can be used to convert a buffer of
null-separated strings to a single string. This can be used to print the
full arguments of a process using execsnoop (from the DTrace toolkit) or
with the following one-liner:
Note that this relies on the process arguments being cached via the struct
proc, which means that it will not work for argvs longer than
kern.ps_arg_cache_limit. However, the following rather non-portable
script can be used to extract any argv at exec time:
dim [Mon, 4 Aug 2014 14:56:49 +0000 (14:56 +0000)]
MFC r269125:
In r232153, libarchive 3.0.3 was imported, replacing the archive_hash.h
header with archive_crypto_private.h, and its ARCHIVE_HASH_xxx macros
were renamed to ARCHIVE_CRYPTO_xxx.
Rename these macros in lib/libarchive/config_freebsd.h, to re-enable the
hashes for libarchive again. This affects the mtree format writer, and
the xar format reader and writer modules.
This also requires changes in the library order for statically linking
rescue, otherwise ld would complain about redefined symbols. Thanks to
jkim for pointing out the solution.
pfg [Mon, 4 Aug 2014 00:51:57 +0000 (00:51 +0000)]
MFC r268945:
Fix hdestroy() compliance issue.
The hcreate(3) implementation and related functions we inherited
from NetBSD used to free() the key value, something that is not
supported by the standard implementation.
This would cause a segmentation fault when attempting to run
the examples from the opengroup and linux manpages.
There is no need to bump the __FreeBSD_version as we have
always claimed XPG4.2 compliance but if some reference is
required, the bump for r269484 can be used.
peter [Sun, 3 Aug 2014 22:59:47 +0000 (22:59 +0000)]
Insta-MFC r269489: partial revert of r262867 which was MFC'ed as r263820.
Don't ignore sndbuf/rcvbuf limits for SOCK_DGRAM sockets. This appears
to be an edit error or patch fuzz mismatch.
pfg [Sun, 3 Aug 2014 18:39:11 +0000 (18:39 +0000)]
MFC r268066:
regex(3): Add support for \< and \> word delimiters
Solaris and other OSs have support for \< and \> as word
delimiters in utilities like sed(1). These are useful to
have for general compatiblity with Solaris but should be
avoided for portability with other systems, including the
traditional BSDs.
Bump __FreeBSD_version as this is likely to affect some
userland utilities.
pfg [Sun, 3 Aug 2014 18:31:52 +0000 (18:31 +0000)]
MFC r269124:
strftime() xlocale cleanups.
Replace fprintf_l with fputs when output is unformatted.
Use locale_t in _conv() since it was using sprintf (now sprintf_l)
Use locale_t on _yconv() since it calls _conv()
Obtained from: Apple Inc. (Libc 997.90.3)
CR: D482
Reviewed by: theraven
rmacklem [Sun, 3 Aug 2014 00:35:10 +0000 (00:35 +0000)]
MFC: r268273
The new NFSv3 server did not generate directory postop attributes for
the reply to ReaddirPlus when the server failed within the loop
that calls VFS_VGET(). This failure is most likely an error
return from VFS_VGET() caused by a bogus d_fileno that was
truncated to 32bits.
This patch fixes the server so that it will return directory postop
attributes for the failure. It does not fix the underlying issue caused
by d_fileno being uint32_t when a file system like ZFS generates
a fileno that is greater than 32bits.
marcel [Sat, 2 Aug 2014 22:25:24 +0000 (22:25 +0000)]
MFC 259910, 260023, 260028, 260600 & 260701:
o Fix "kptdir is itself virtual" error, caused by having the kptdir in PBVM.
o Allow building a cross libkvm for ia64.
o Add support for virtual cores (aka minidumps).
o We don't have to worry about page sizes when working on virtual cores.
o Handle truncation of the size returned by _kvm_kvatop().
hselasky [Sat, 2 Aug 2014 20:58:46 +0000 (20:58 +0000)]
Partial MFC of r267961, r267973, r267985, r267992, r267993 and r268005:
Backport some macro definitions to make backporting code from FreeBSD
current easier.
mav [Sat, 2 Aug 2014 06:56:00 +0000 (06:56 +0000)]
MFC r269123:
Implement separate I/O dispatch method for ZVOLs in "dev" mode.
Unlike disk devices ZVOLs process all requests synchronously. That makes
impossible sending multiple requests to them from single thread. From the
other side ZVOLs have real d_read/d_write methods, which unlike d_strategy
can handle uio scatter/gather and have no strict I/O size limitations.
So, if ZVOL in "dev" mode is detected, use of d_read/d_write methods instead
of d_strategy allows to avoid pointless splitting of large requests into
MAXPHYS (128K) sized chunks.
delphij [Sat, 2 Aug 2014 04:06:35 +0000 (04:06 +0000)]
MFC r268865: MFV r268852:
Reduce lock contention on the z_teardown_lock under heavily cached
read workload by splitting the single teardown rrw lock into
RRM_NUM_LOCKS (17) of them.
Read acquisitions are randomly distributed among these locks based
on curthread pointer. Write acquisitions are going to all the
locks, which for the usage of this type of lock should be rare.
Illumos issue:
5008 lock contention (rrw_exit) while running a read only load
delphij [Sat, 2 Aug 2014 04:01:44 +0000 (04:01 +0000)]
MFC r268859: MFV r268851:
When a sync task is waiting for a txg to complete, we should hurry it along
by increasing the number of outstanding async writes (i.e. make
vdev_queue_max_async_writes() return a larger number).
Illumos issue:
4753 increase number of outstanding async writes when sync task is waiting
delphij [Sat, 2 Aug 2014 03:59:35 +0000 (03:59 +0000)]
MFC r268858: MFV r268850:
Change the interaction between the DMU and ARC so that when the DMU is
shutting down an objset, we do not evict the data from the ARC. Instead
we simply coordinate the destruction of the DMU's data with the ARC.
The only case where we actually need to explicitly evict from the ARC is
when dbuf_rele_and_unlock() determines that the administrator has requested
that it not be kept in memory, via the primarycache/secondarycache properties.
In this case, we evict the data from the ARC by its blkptr_t, the same way
as when a block is freed we explicitly evict it from the ARC.
Illumos issue:
4631 zvol_get_stats triggering too many reads
delphij [Sat, 2 Aug 2014 03:56:06 +0000 (03:56 +0000)]
MFC r268855: MFV r268848:
Instead of asserting all zio's be properly aligned, only assert
on the logical ones.
Cap uberblocks at 8k, otherwise with ashift=17, there would be
only one uberblock.
This fixes a problem that zdb would trip assert on pools with
ashift >= 0xe (8k).
While there, also change the code so it only attempt to condense
space map unless the uncondensed size consumes greater than
zfs_metaslab_condense_block_threshold blocks.
Illumos issue:
4958 zdb trips assert on pools with ashift >= 0xe
rmacklem [Fri, 1 Aug 2014 21:10:41 +0000 (21:10 +0000)]
MFC: r268115
Merge the NFSv4.1 server code in projects/nfsv4.1-server over
into head. The code is not believed to have any effect
on the semantics of non-NFSv4.1 server behaviour.
It is a rather large merge, but I am hoping that there will
not be any regressions for the NFS server.
jhb [Fri, 1 Aug 2014 21:00:18 +0000 (21:00 +0000)]
MFC 256657,257423,264837,267559:
Sync vmrun.sh with HEAD:
- Add -e option to vmrun.sh passed to bhyveload(8) to set loader
environment variables.
- Stop passing unused -I option to bhyve(8).
- Reformat the usage to fit in 80 colums and other cleanups.
- Add -C option to specify the console device.
- Add -H option to pass a host path to bhyveload(8).
- Support for multiple disk and tap devices.
truckman [Fri, 1 Aug 2014 15:04:46 +0000 (15:04 +0000)]
MFC r268780
Nuke the never-used RF_TIMESHARE feature, reducing the complexity of the
code. The consensus on arch@ is that this feature might have been useful
in the distant past, but is now just unnecessary bloat.
The int_rman_activate_resource() and int_rman_deactivate_resource()
functions become trivial, so manually inline them.
The special deferred handling of RF_ACTIVE is no longer needed in
reserve_resource_bound(), so eliminate the associated code at the
end of the function.
These changes reduce the object file size by more than 500 bytes on i386.
Update the rman.9 man page to reflect the removal of the RF_TIMESHARE
feature.
Add a 'raw' parameter to the 'modinfo' subcommand. This is handy when
trying to figure out why a QSFP+/SFP+ connector or cable wasn't
identified correctly by cxgbe(4). Its output looks like this:
r268971:
Simplify r267600, there's no need to distinguish between allocated and
inlined mbufs.
r269032:
cxgbe(4): Keep track of the clusters that have to be freed by the
custom free routine (rxb_free) in the driver. Fail MOD_UNLOAD with
EBUSY if any such cluster has been handed up to the kernel but hasn't
been freed yet. This prevents a panic later when the cluster finally
needs to be freed but rxb_free is gone from the kernel.
MFC r264434:
DTrace's pid provider works by inserting breakpoint instructions at probe
sites and installing a hook at the kernel's trap handler. The fasttrap code
will emulate the overwritten instruction in some common cases, but otherwise
copies it out into some scratch space in the traced process' address space
and ensures that it's executed after returning from the trap.
In Solaris and illumos, this (per-thread) scratch space comes from some
reserved space in TLS, accessible via the fs segment register. This
approach is somewhat unappealing on FreeBSD since it would require some
modifications to rtld and jemalloc (for static TLS) to ensure that TLS is
executable, and would thus introduce dependencies on their implementation
details. I think it would also be impossible to safely trace static binaries
compiled without these modifications.
This change implements the functionality in a different way, by having
fasttrap map pages into the target process' address space on demand. Each
page is divided into 64-byte chunks for use by individual threads, and
fasttrap's process descriptor struct has been extended to keep track of
any scratch space allocated for the corresponding process.
With this change it's possible to trace all libc functions in a program,
e.g. with
MFC r268808:
Increase maximal number of SCSI ports in CTL from 32 to 128.
After I gave each iSCSI target its own port, the old limit appeared to be
not so big. This change almost proportionally increases per-LUN memory
use, but it is still three times better then it was before r268807.
MFC r268807:
Reduce per-LUN memory usage from 18MB to 1.8MB.
CTL never had use for CA support code since SPI has gone, and there is no
even frontends supporting that. But it still was reserving 256 bytes of
memory per LUN per every possible initiator on every possible port.
Wrap unused code with ifdef's in case somebody ever need it.
MFC r268767:
Add support for VMWare dialect of EXTENDED COPY command, aka VAAI Clone.
This allows to clone VMs and move them between LUNs inside one storage
host without generating extra network traffic to the initiator and back,
and without being limited by network bandwidth.
LUNs participating in copy operation should have UNIQUE NAA or EUI IDs set.
For LUNs without these IDs VMWare will use traditional copy operations.
Beware: the above LUN IDs explicitly set to values non-unique from the VM
cluster point of view may cause data corruption if wrong LUN is addressed!
MFC: r268726
Move the "retry:" label so that the calls to m_pullup() are
not done after the call to m_defrag(). This fixes a problem
where m_pullup() would prepend an mbuf to the list created
by m_defrag() making the chain greater than 32 again.
MFC r263329:
Only invoke fasttrap hooks for traps from user mode, and ensure that they're
called with interrupts enabled. Calling fasttrap_pid_probe() with interrupts
disabled can lead to deadlock if fasttrap writes to the process' address
space.
MFC r262669:
When our linker merges .SUNW_dof sections from multiple files, it simply
concatenates the DOF tables into one section. Previously, the USDT init
code in drti.o would only look at the first table in the DOF section; with
this change, it iterates over all the tables, passing each DOF table to
the kernel.
marius [Tue, 29 Jul 2014 13:11:37 +0000 (13:11 +0000)]
MFC: r269051
Copying pages via temporary mappings in the !DMAP case of pmap_copy_pages()
involves updating the corresponding page tables followed by accesses to the
pages in question. This sequence is subject to the situation exactly described
in the "AMD64 Architecture Programmer's Manual Volume 2: System Programming"
rev. 3.23, "7.3.1 Special Coherency Considerations" [1, p. 171 f.]. Therefore,
issuing the INVLPG right after modifying the PTE bits is crucial (see also
r269050, MFCed to stable/10 in r269235).
For the amd64 PMAP code, the order of instructions was already correct. The
above fact still is worth documenting, though.
marius [Tue, 29 Jul 2014 13:08:37 +0000 (13:08 +0000)]
MFC: r269050
- Copying and zeroing pages via temporary mappings involves updating the
corresponding page tables followed by accesses to the pages in question.
This sequence is subject to the situation exactly described in the "AMD64
Architecture Programmer's Manual Volume 2: System Programming" rev. 3.23,
"7.3.1 Special Coherency Considerations" [1, p. 171 f.]. Therefore, issuing
the INVLPG right after modifying the PTE bits is crucial.
For pmap_copy_page(), this has been broken in r124956 and later on carried
over to pmap_copy_pages() derived from the former, while all other places
in the i386 PMAP code use the correct order of instructions in this regard.
Fixing the latter breakage solves the problem of data corruption seen with
unmapped I/O enabled when running at least bare metal on AMD R-268D APUs.
However, this might also fix similar corruption reported for virtualized
environments.
- In pmap_copy_pages(), correctly set the cache bits on the source page being
copied. This change is thought to be a NOP for the real world, though. [2]
When doing an "extreme rewind" import ("zpool import -XF"), we attempt
to verify all data in the pool, essentially scrubbing the entire pool.
The problem is that spa_load_verify_cb() issues an unbounded number of
concurrent scrub i/os. This can lead to all of memory being used for
these zio's, wedging the system. Like normal scrub, we need to put a
cap on the number of outstanding i/os, and have the traverse thread
block when we reach this cap.
For this purpose the cap can be very large (10,000) to optimize the
elevator algorithm. Three kernel tunables have been added:
The latter two tunables controls whether metadata and/or user data
when doing extreme rewind.
Make 'zpool import -T' imply scrub.
Make zpool import -T <txg> accept hexadecimal values for the txg when
prefixed with 0x.
Skip txg's for which there is no uberblock when doing extreme rewind.
Skip reading all user data twice by skipping prefetches when doing
extreme rewinds as we do not access via the ARC.
Illumos issues:
4970 need controls on i/o issued by zpool import -XF
4971 zpool import -T should accept hex values
4972 zpool import -T implies extreme rewind, and thus a scrub
4973 spa_load_retry retries the same txg
4974 spa_load_verify() reads all data twice
MFC r268236,268264,268524,268646,268802,269021:
This brings VHD support to mkimg(1); both dynamic and fixed file formats.
Dynamic VHD and VMDK file images are now sparsely written, meaning that
"free" sectors do not occupy space.
MFC r268608:
The tmpfs_link() must not dereference the filesystem-specific data for
a vnode until it is verified that the vnode indeed belongs to tmpfs mount.
MFC r268606:
Generalize vn_get_ino() to allow filesystems to use custom vnode
producer. Convert inline copies of vn_get_ino() in msdosfs and cd9660
into the uses of vn_get_ino_gen().
ian [Fri, 25 Jul 2014 23:29:55 +0000 (23:29 +0000)]
MFC r266565, r266651:
Map device memory using PTE_DEVICE attributes, and also ensure that the
shared flag is set on normal-memory mappings made via pmap_kenter() for SMP.
The "shared flag" part of this change isn't obvious from the diff, here's
the deal... by using the array of preformatted page table entry templates
instead of constructing the PTE from scratch, we automatically get the
right attribute bits set for both caching and shared.
ian [Fri, 25 Jul 2014 23:21:36 +0000 (23:21 +0000)]
MFC r263373, r268402
Add a way to apply CFLAGS only when building the given architecture. This
is useful primarily on a system used for cross-building, when you have a
set of flags to apply to the TARGET_ARCH being cross-built but don't want
those settings applied to building the cross-tools or other components that
run on the build host machine.
Support CXXFLAGS.${MACHINE_ARCH} as well as CFLAGS. This allows different
C++ options for toolchain versus target when cross-building.
ian [Fri, 25 Jul 2014 23:12:22 +0000 (23:12 +0000)]
MFC r261530
Set the malloc alignment to 64 bytes on platforms that use the U-Boot API
device drivers. Recent versions of u-boot run with the MMU enabled, and
require DMA-based I/O to be aligned to cache line boundaries.
r268640:
Allow multi-byte reads in the private CHELSIO_T4_GET_I2C ioctl. The
firmware allows up to 48B to be read this way but the driver limits
itself to 8B at a time to remain compatible with old cxgbetool
binaries.
dim [Thu, 24 Jul 2014 20:16:45 +0000 (20:16 +0000)]
MFC r268957:
Run mtree for BSD.tests.dist during make xdev-install, if the tests are
enabled (which they are in the default configuration). Otherwise, it
will fail because ${XDDESTDIR}/usr/include/atf-c does not exist.
MFC r268466:
Calculate the amount of resident pages by looking in the objects chain
backing the region. Add a knob to disable the residency calculation at
all.
MFC r268490:
Unconditionally initialize addr to handle the case of changed map
timestamp while the map is unlocked.
MFC r268711:
Change the calculation of the kinfo_vmentry field kve_private_resident
to reflect its name.
MFC r268712:
Followup to r268466.
- Move the code to calculate resident count into separate function.
It reduces the indent level and makes the operation of
vmmap_skip_res_cnt tunable more clear.
- Optimize the calculation of the resident page count for map entry.
Skip directly to the next lowest available index and page among the
whole shadow chain.
- Restore the use of pmap_incore(9), only to verify that current
mapping is indeed superpage.
- Note the issue with the invalid pages.