adrian [Sat, 12 Sep 2015 23:10:34 +0000 (23:10 +0000)]
if_rsu debug fixes:
* use an ath/iwn style debug bitmap - it's still global rather than per-device,
but it's better than debug levels
* disable bgscan - it just makes things unstable/unpredictable for now.
marius [Sat, 12 Sep 2015 22:49:32 +0000 (22:49 +0000)]
- Factor out the common and generic parts of the sparc64 host-PCI-bridge
drivers into the revived sys/sparc64/pci/ofw_pci.c, previously already
serving a similar purpose. This has been done with sun4v in mind, which
explains a) the otherwise not that obvious scheme employed and b) why
reusing sys/powerpc/ofw/ofw_pci.c was even lesser an option.
- Add a workaround for QEMU once again not emulating real machines, in
this case by not providing the OFW_PCI_CS_MEM64 range. [1]
Cleanup the handling of error causes for ERROR chunks. This fixes
an inconsistency of the padding handling. The final padding is
now considered to be a chunk padding.
In r286570 (MFV of r277426) an unprotected write to b_flags to
set the compression mode was introduced. This would open a race
window where data is partially decompressed, modified, checksummed
and written to the pool, resulting in pool corruption due to the
partial decompression.
How many demand read didn't have to wait for I/O
because of predictive prefetch. (more is better)
zfetch kstats have been similified to hits, misses, and max_streams,
with max_streams representing times when we were not able to create
new stream because we already have the maximum number of sequences
for a file.
The sysctl variable/loader tunable vfs.zfs.zfetch.block_cap have been
replaced by vfs.zfs.zfetch.max_distance, which controls maximum bytes
to prefetch per stream.
The <arch>/mkisoimages.sh script in release knows how to add
extra bits from an "xtra-bits-dir". This feature is unusable
from release/Makefile. Add an XTRADIR setting to use it.
radeon_suspend_kms: don't mess with pci state that's managed by the bus
The pci bus driver handles the power state, it also manages
configuration state saving and restoring for its child devices. Thus a
PCI device driver does not have to worry about those things. In fact, I
observe a hard system hang when trying to suspend a system with active
radeonkms driver where both the bus driver and radeonkms driver try to
do the same thing. I suspect that it could be because of an access to a
PCI configuration register after the device is placed into D3 state.
Close races between device close and request processing.
All requests arriving for processing after OFFLINE flag set are rejected
with BUSY status. Races around OFFLINE flag setting are closed by calling
taskqueue_drain_all().
Ensure that ERROR chunks are always padded by implementing this
in the routine, which queues an ERROR chunk, instead on relyinh
on the callers to do so. Since one caller missed this, this actially
fixes a bug.
Use strlcpy() in favor of strncpy() as it's defined to have a nul character
at the end of string buffer, and the code context do expects this to behave
correctly (e.g. strchr).
Note that we do not believe there is real-world impact for gstat(8)'s usage
because the strings are length checked, and the on-stack buffer belongs to
main() and we can expect to have zeros in them.
Add stack_save_td_running(), a function to trace the kernel stack of a
running thread.
It is currently implemented only on amd64 and i386; on these
architectures, it is implemented by raising an NMI on the CPU on which
the target thread is currently running. Unlike stack_save_td(), it may
fail, for example if the thread is running in user mode.
This change also modifies the kern.proc.kstack sysctl to use this function,
so that stacks of running threads are shown in the output of "procstat -kk".
This is handy for debugging threads that are stuck in a busy loop.
MFV r283513:
5930 fasttrap_pid_enable() panics when prfind() fails in forking process
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Bryan Cantrill <bryan@joyent.com>
Handle default MAKEOBJDIR for META_MODE.
If MAKEOBJDIRPREFIX is set, use it for default OBJROOT.
If MAKEOBJDIR is empty or not a suitable value (no '/')
set a default that works.
- Avoid accessing window properties directly, instead, use accessors.
This should be no-op for now, but allows the code to work if we
move to NCURSES_OPAQUE.
- Use calloc() instead of malloc+bzero.
Do not hold the process around the vm_fault() call from the trap()s.
The only operation which is prevented by the hold is the kernel stack
swapout for the faulted thread, which should be fine to allow.
Remove useless checks for NULL curproc or curproc->p_vmspace from the
trap_pfault() wrappers on x86 and powerpc.
Reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
CTL HA functionality was originally implemented by Copan many years ago,
but large part of the sources was never published. This change includes
clean room implementation of the missing code and fixes for many bugs.
This code supports dual-node HA with ALUA in four modes:
- Active/Unavailable without interlink between nodes;
- Active/Standby with second node handling only basic LUN discovery and
reservation, synchronizing with the first node through the interlink;
- Active/Active with both nodes processing commands and accessing the
backing storage, synchronizing with the first node through the interlink;
- Active/Active with second node working as proxy, transfering all
commands to the first node for execution through the interlink.
Unlike original Copan's implementation, depending on specific hardware,
this code uses simple custom TCP-based protocol for interlink. It has
no authentication, so it should never be enabled on public interfaces.
The code may still need some polishing, but generally it is functional.
Zero out a local variable also when PURIFY is not defined.
This silence a warning brought up by valgrind whenever if_nametoindex
is used. This was already discussed in PR 166483, but the code
committed in r234329 guards the initilization with #ifdef PURIFY.
Therefore, valgrind still complains. Since this code is not performance
critical, always zero out the local variable to silence valgrind.
adrian [Thu, 10 Sep 2015 04:05:58 +0000 (04:05 +0000)]
Also make kern.maxfilesperproc a boot time tunable.
Auto-tuning threshold discussions aside, it turns out that if you want
to lower this on say, rather memory-packed machines, you either set maxusers
or kern.maxfiles, or you set it in sysctl. The former is a non-exact
way to tune this; the latter doesn't actually affect anything in the
startup scripts.
This first occured because I wondered why the hell screen would take upwards
of 10 seconds to spawn a new screen. I then found python doing the same
thing during fork/exec of child processes - it calls close() on each FD
up to the current openfiles limit. On a 1TB machine this is like, 26 million
FDs per process. Ugh.
So:
* This allows it to be set early in /boot/loader.conf;
* It can be used to work around the ridiculous situation of
screen, python, etc doing a close() on potentially millions of FDs
even though you only have four open.
Tested:
* 4GB, 32GB, 64GB, 128GB, 384GB, 1TB systems with autotune, ensuring
screen and python forking doesn't result in some pretty hilariously
bad behaviour.
TODO:
* Note that the default login.conf sets openfiles-cur to unlimited,
effectively obeying kern.maxfilesperproc. Perhaps we should fix
this.
* .. and even if we do, we need to also ensure that daemons get
a soft limit of something reasonable and capped - they can request
more FDs themselves.
For open("name", O_DIRECTORY | O_CREAT), do not try to create the
named node, open(2) cannot create directories. But do allow the flag
combination to succeed if the directory already exists.
Declare the open("name", O_DIRECTORY | O_CREAT | O_EXCL) always
invalid for the same reason, since open(2) cannot create directory.
Note that there is an argument that O_DIRECTORY | O_CREAT should be
invalid always, regardless of the target directory existence or
O_EXCL. The current fix is conservative and allows the call to
succeed in the situation where it succeeded before the patch.
Reported by: Tom Ridge <freebsd@tom-ridge.com>
Reviewed by: rwatson
PR: 202892
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
andrew [Wed, 9 Sep 2015 11:51:14 +0000 (11:51 +0000)]
Rework copyinstr to:
* Fail when the length passed in is 0
* Remove an unneeded increment of the count on success
* Return ENAMETOOLONG when the input pointer is too long
Remove a check which caused spurious SIGSEGV on usermode access to the
mapped address without valid pte installed, when parallel wiring of
the entry happen. The entry must be copy on write. If entry is COW
but was already copied, and parallel wiring set
MAP_ENTRY_IN_TRANSITION, vm_fault() would sleep waiting for the
MAP_ENTRY_IN_TRANSITION flag to clear. After that, the fault handler
is restarted and vm_map_lookup() or vm_map_lookup_locked() trip over
the check. Note that this is race, if the address is accessed after
the wiring is done, the entry does not fault at all.
There is no reason in the current kernel to disallow write access to
the COW wired entry if the entry permissions allow it. Initially this
was done in r24666, since that kernel did not supported proper
copy-on-write for wired text, which was fixed in r199869. The r251901
revision re-introduced the r24666 fix for the current VM.
Note that write access must clear MAP_ENTRY_NEEDS_COPY entry flag by
performing COW. In reverse, when MAP_ENTRY_NEEDS_COPY is set in
vmspace_fork(), the MAP_ENTRY_USER_WIRED flag is cleared. Put the
assert stating the invariant, instead of returning the error.
Reported and debugging help by: peter
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week