sem_open: Make sure to fail an O_CREAT|O_EXCL open, even if that semaphore
is already open in this process.
If the named semaphore is already open, sem_open() only increments a
reference count and did not take the flags into account (which otherwise
happens by passing them to open()). Add an extra check for O_CREAT|O_EXCL.
PR: kern/166706
Reviewed by: davidxu
MFC after: 10 days
intpm: reflect the fact that SB800 and later AMD chipsets are not supported
They do not have compatible configuration registers in PCI configuration
space. Instead their configuration resides in AMD "PM I/O" space
(accessed via a pair of I/O space registers).
If a page belonging a reservation is cached, then mark the reservation so
that it will be freed to the cache pool rather than the default pool.
Otherwise, the cached pages within the reservation may be recycled sooner
than necessary.
Merge a local fix to OpenBSM's libauditd to avoid a directory descriptor
leak when iterating over possible audit trail directories. This fix will
be merged upstream in an identical form, but hasn't yet appeared in an
OpenBSM release.
When allocation of labels on files is implicitly disabled due to MAC
policy configuration, avoid leaking resources following failed calls
to get and set MAC labels by file descriptor.
Reported by: Mateusz Guzik <mjguzik at gmail.com> + clang scan-build
MFC after: 3 days
Expand locking around identification of filesystem mount point when
accounting for I/O counts at completion of I/O operation. Also switch
from using global devmtx to vnode mutex to reduce contention.
andrew [Sun, 8 Apr 2012 04:36:27 +0000 (04:36 +0000)]
Unlike other functions __aeabi_read_tp function must preserve r1-r3. The
currently generated code clobbers r3. Fix this by loading ARM_TP_ADDRESS
using inline assembly.
- Add a "real" symbol version map to libasn1. The upstream version
of the version map just exported all the symbols, which caused a
binutils bug being triggered when ld fails to link two objects, one
of which exports a versioned version of the symbol, and another --
unversioned. [1]
- Also add version map for libkafs5.
- Add new ARM kernel option QEMU_WORKAROUNDS which can be
used in the code which needs to implement some specific
behaviour when being run under QEMU.
- Make PXA UART probe code to work under QEMU gumstix, which
doesn't emulate all the ports properly.
Properly resolve the _ctx_start function descriptor (the symbol _ctx_start
is a descriptor, not a code address), which prevents crashes when starting
a context. This fixes QEMU on powerpc64.
tmpfs supports only INT_MAX nodes due to limitations of unit number
allocator.
Replace UINT32_MAX checks with INT_MAX. Keeping more than 2^31 nodes in
memory is not likely to become possible in foreseeable feature and would
require new unit number allocator.
joel [Sat, 7 Apr 2012 09:05:30 +0000 (09:05 +0000)]
mdoc: fix column names, indentation, column separation within each row, and
quotation. Also make sure we have the same amount of columns in each row as
the number of columns we specify in the head arguments.
- Do not reinitialize the card if it is already running.
This fixes bootp on if_smc, as bootp code perform SIOCSIFADDR
ioctl call immediately after sending the request (which causes
if_init being called) which causes the adapter to drop all the
packets received in the meantime.
adrian [Sat, 7 Apr 2012 05:48:26 +0000 (05:48 +0000)]
Break out the legacy duration and protection code into routines,
call these after rate control selection is done.
The duration/protection code wasn't working - it expected the rix to
be valid. Unfortunately after I moved the rate control selection into
late in the process, the rix value isn't valid and thus the protection/
duration code would get things wrong.
HT frames are now correctly protected with an RTS and for the AR5416,
this involves having the aggregate frames be limited to 8K.
TODO:
* Fix up the DMA sync to occur just before the frame is queued to the
hardware. I'm adjusting the duration here but not doing the DMA
flush.
* Doubly/triply ensure that the aggregate frames are being limited to
the correct size, or the AR5416 will get unhappy when TXing RTS-protected
aggregates.
adrian [Sat, 7 Apr 2012 03:22:11 +0000 (03:22 +0000)]
Enforce the RTS aggregation limit if RTS/CTS protection is enabled;
if any subframes in an aggregate have different protection from the
first frame in the formed aggregate, don't add that frame to the
aggregate.
This is likely a suboptimal method (I think we'll mostly be OK marking
frames that have seqno's with the same protection as normal data frames)
but I'll just be cautious for now.
adrian [Sat, 7 Apr 2012 02:51:53 +0000 (02:51 +0000)]
Store away the RTS aggregate limit from the HAL.
This will be used by some upcoming code to ensure that aggregates
are enforced to be a certain size. The AR5416 has a limitation on
RTS protected aggregates (8KiB).
ken [Fri, 6 Apr 2012 22:23:13 +0000 (22:23 +0000)]
Change the SCSI INQUIRY peripheral qualifier that CTL reports for LUNs
that don't exist.
Anecdotal evidence indicates that it is better to return 011b (bad LUN)
than 001b (LUN offline). However, this change also gives the user a
sysctl/tunable, kern.cam.ctl.inquiry_pq_no_lun, to override the change
and return to the previous behavior. (The previous behavior was to
return 001b, or LUN offline.)
ctl.c: Change the default inquiry peripheral qualifier to 011b,
and add a sysctl and tunable to allow the user to change
it back to 001b if needed.
Don't insert a Copan copyright statement in the inquiry
data. The copyright statements on the files are
sufficient.
ctl_private.h: Add sysctl variable context to the CTL softc.
ctl_cmd_table.c,
ctl_frontend_internal.c,
ctl_frontend.c,
ctl_backend.c,
ctl_error.c: Include sys/sysctl.h.
Fix interrupt load balancing regression, introduced in revision
222813, that left all un-pinned interrupts assigned to CPU 0.
sys/x86/x86/intr_machdep.c:
In intr_shuffle_irqs(), remove CPU_SETOF() call that initialized
the "intr_cpus" cpuset to only contain CPU0.
This initialization is too late and nullifies the results of calls
the intr_add_cpu() that occur much earlier in the boot process.
Since "intr_cpus" is statically initialized to the empty set, and
all processors, including the BSP, already add themselves to
"intr_cpus" no special initialization for the BSP is necessary.
Reduce the frequency that the PowerPC/AIM pmaps invalidate instruction
caches, by invalidating kernel icaches only when needed and not flushing
user caches for shared pages.
Give the kernel pmap lock a different name than user pmap locks. It has
(slightly) different semantics and renaming it prevents a (harmless)
WITNESS warning during bootup for 32-bit kernels on 64-bit CPUs.
Properly clear the O_NONBLOCK flag after opening the TTY.
Though we should open the TTY with O_NONBLOCK to prevent rc(8) execution
from potentially stalling, we must not forget to clear the flag later
on, to prevent read(2) calls from failing later on.
This prevented the shell pathname prompt from working properly.
The menu item is now made completely independent with the ACPI item - most
modern systems seem to require ACPI and become even more "unsafe"
without it.
Safe Mode no longer disables APIC for the same reason.
kbdmux is not disabled as this feature has proven itself stable.
New actions:
- SMP is disabled in the Safe Mode now
- eventtimers are forced to periodic mode (some real and virtual systems
seem to have problems otherwise)
- geom extra vigorous integrity checking is disabled, this is to
facilitate migration from previous versions
Possible short term to do:
- make SMP switch a separate menu item
- restore APIC switch as a separate menu item
Longer term to do:
- turn various tweaks into separate menu items in a Safe Mode sub-menu
Please consider adding a safety tweak to Safe Mode when introducing
new major features or changes that may cause instabilities.
Linux and Solaris (at least OpenSolaris) has PF_PACKET socket families to send
raw ethernet frames. The only FreeBSD interface that can be used to send raw frames
is BPF. As a result, many programs like cdpd, lldpd, various dhcp stuff uses
BPF only to send data. This leads us to the situation when software like cdpd,
being run on high-traffic-volume interface significantly reduces overall performance
since we have to acquire additional locks for every packet.
Here we add sysctl that changes BPF behavior in the following way:
If program came and opens BPF socket without explicitly specifyin read filter we
assume it to be write-only and add it to special writer-only per-interface list.
This makes bpf_peers_present() return 0, so no additional overhead is introduced.
After filter is supplied, descriptor is added to original per-interface list permitting
packets to be captured.
Unfortunately, pcap_open_live() sets catch-all filter itself for the purpose of
setting snap length.
Fortunately, most programs explicitly sets (event catch-all) filter after that.
tcpdump(1) is a good example.
So a bit hackis approach is taken: we upgrade description only after second
BIOCSETF is received.
Sysctl is named net.bpf.optimize_writers and is turned off by default.
- While here, document all sysctl variables in bpf.4
Interface locks and descriptor locks are converted from mutex(9) to rwlock(9).
This greately improves performance: in most common case we need to acquire 1
reader lock instead of 2 mutexes.
- Remove filter(descriptor) (reader) lock in bpf_mtap[2]
This was suggested by glebius@. We protect filter by requesting interface
writer lock on filter change.
- Cover struct bpf_if under BPF_INTERNAL define. This permits including bpf.h
without including rwlock stuff. However, this is is temporary solution,
struct bpf_if should be made opaque for any external caller.
Found by: Dmitrij Tejblum <tejblum@yandex-team.ru>
Sponsored by: Yandex LLC
Add new ktrace records for the start and end of VM faults. This gives
a pair of records similar to syscall entry and return that a user can
use to determine how long page faults take. The new ktrace records are
enabled via the 'p' trace type, and are enabled in the default set of
trace points.
In sem_post, the field _has_waiters is no longer used, because some
application destroys semaphore after sem_wait returns. Just enter
kernel to wake up sleeping threads, only update _has_waiters if
it is safe. While here, check if the value exceed SEM_VALUE_MAX and
return EOVERFLOW if this is true.
umtx operation UMTX_OP_MUTEX_WAKE has a side-effect that it accesses
a mutex after a thread has unlocked it, it event writes data to the mutex
memory to clear contention bit, there is a race that other threads
can lock it and unlock it, then destroy it, so it should not write
data to the mutex memory if there isn't any waiter.
The new operation UMTX_OP_MUTEX_WAKE2 try to fix the problem. It
requires thread library to clear the lock word entirely, then
call the WAKE2 operation to check if there is any waiter in kernel,
and try to wake up a thread, if necessary, the contention bit is set again
by the operation. This also mitgates the chance that other threads find
the contention bit and try to enter kernel to compete with each other
to wake up sleeping thread, this is unnecessary. With this change, the
mutex owner is no longer holding the mutex until it reaches a point
where kernel umtx queue is locked, it releases the mutex as soon as
possible.
Performance is improved when the mutex is contensted heavily. On Intel
i3-2310M, the runtime of a benchmark program is reduced from 26.87 seconds
to 2.39 seconds, it even is better than UMTX_OP_MUTEX_WAKE which is
deprecated now. http://people.freebsd.org/~davidxu/bench/mutex_perf.c
adrian [Wed, 4 Apr 2012 23:45:15 +0000 (23:45 +0000)]
Implement BAR TX.
A BAR frame must be transmitted when an frame in an A-MPDU session fails
to transmit - it's retried too often, or it can't be cloned for
re-transmission. The BAR frame tells the remote side to advance the
left edge of the block-ack window (BAW) to a new value.
In order to do this:
* TX for that particular node/TID must be paused;
* The existing frames in the hardware queue needs to be completed, whether
they're TXed successfully or otherwise;
* The new left edge of the BAW is then communicated to the remote side
via a BAR frame;
* Once the BAR frame has been sucessfully TXed, aggregation can resume;
* If the BAR frame can't be successfully TXed, the aggregation session
is torn down.
This is a first pass that implements the above. What needs to be done/
tested:
* What happens during say, a channel reset / stuck beacon _and_ BAR
TX. It _should_ be correctly buffered and retried once the
reset has completed. But if a bgscan occurs (and they shouldn't,
grr) the BAR frame will be forcibly failed and the aggregation session
will be torn down.
Yes, another reason to disable bgscan until I've figured this out.
* There's way too much locking going on here. I'm going to do a couple
of further passes of sanitising and refactoring so the (re) locking
isn't so heavy. Right now I'm going for correctness, not speed.
* The BAR TX can fail if the hardware TX queue is full. Since there's
no "free" space kept for management frames, a full TX queue (from eg
an iperf test) can race with your ability to allocate ath_buf/mbufs
and cause issues. I'll knock this on the head with a subsequent
commit.
* I need to do some _much_ more thorough testing in hostap mode to ensure
that many concurrent traffic streams to different end nodes are correctly
handled. I'll find and squish whichever bugs show up here.
But, this is an important step to being able to flip on 802.11n by default.
The last issue (besides bug fixes, of course) is HT frame protection and
I'll address that in a subsequent commit.
adrian [Wed, 4 Apr 2012 21:49:49 +0000 (21:49 +0000)]
Correctly handle AR_MoreAggr when assembling multi-descriptor final frames.
Linux ath9k doesn't have this issue as it doesn't try queuing multi-
descriptor frames to the hardware.
Before, I was only setting the first and last descriptor in the final
frame correctly - and that was done by accident. The first descriptor in
the last sub-frame was being correctly updated by ath_tx_setds_11n();
the last descriptor in the last sub-frame was being correctly updated
by ath_buf_set_rate(). But both of those are "incorrect".
The correct behaviour is:
* AR_IsAggr is set for all descriptors for all subframes in an aggregate.
* AR_MoreAggr is set for all descriptors for all non-final sub-frames
in an aggregate.
Ie, all descriptors in the last sub-frame of an aggregate must have this
field set to 0.
I still need to do a couple of extra passes to ensure the pad delimiter
field is being correctly handled in all descriptors in the last sub-frame.
marius [Wed, 4 Apr 2012 20:42:45 +0000 (20:42 +0000)]
Refine r233827; as it turns out, controllers with a device ID of 0x0059
can be upgraded to MegaRAID mode, in which case mfi(4) should attach to
these based on the sub-vendor and -device ID instead (not currently done).
Therefore, let mpt_pci_probe() return BUS_PROBE_LOW_PRIORITY.
While it, let mpt_pci_probe() return BUS_PROBE_DEFAULT instead of 0 in
the default case.
Merge from OpenBSD:
revision 1.173
date: 2011/11/09 12:36:03; author: camield; state: Exp; lines: +11 -12
State expire time is a baseline time ("last active") for expiry
calculations, and does _not_ denote the time when to expire. So
it should never be added to (set into the future).
Try to reconstruct it with an educated guess on state import and
just set it to the current time on state updates.
This fixes a problem on pfsync listeners where the expiry time
could be double the expected value and cause a lot more states
to linger.
Since pf 4.5 import pf(4) has a mechanism to defer
forwarding a packet, that creates state, until
pfsync(4) peer acks state addition (or 10 msec
timeout passes).
This is needed for active-active CARP configurations,
which are poorly supported in FreeBSD and arguably
a good idea at all.
Unfortunately by the time of import this feature in
OpenBSD was turned on, and did not have a switch to
turn it off. This leaked to FreeBSD.
This change make it possible to turn this feature
off via ioctl() and turns it off by default.
A file cannot be deallocated until its last name has been removed
and it is no longer referenced by a user process. The inode for a
file whose name has been removed, but is still referenced at the
time of a crash will still be allocated in the filesystem, but will
have no references (e.g., they will have no names referencing them
from any directory).
With traditional soft updates these unreferenced inodes will be
found and reclaimed when the background fsck is run. When using
journaled soft updates, the kernel must keep track of these inodes
so that it can find and reclaim them during the cleanup process.
Their existence cannot be stored in the journal as the journal only
handles short-term events, and they may persist for days. So, they
are tracked by keeping them in a linked list whose head pointer is
stored in the superblock. The journal tracks them only until their
linked list pointers have been commited to disk. Part of the cleanup
process involves traversing the list of unreferenced inodes and
reclaiming them.
This bug was triggered when confusion arose in the commit steps
of keeping the unreferenced-inode linked list coherent on disk.
Notably, a race between the link() system call adding a link-count
to a file and the unlink() system call removing a link-count to
the file. Here if the unlink() ran after link() had looked up
the file but before link() had incremented the link-count of the
file, the file's link-count would drop to zero before the link()
incremented it back up to one. If the file was referenced by a
user process, the first transition through zero made it appear
that it should be added to the unreferenced-inode list when in
fact it should not have been added. If the new name created by
link() was deleted within a few seconds (with the file still
referenced by a user process) it would legitimately be a candidate
for addition to the unreferenced-inode list. The result was that
there were two attempts to add the same inode to the unreferenced-inode
list which scrambled the unreferenced-inode list's pointers leading
to a panic. The fix is to detect and avoid the false attempt at
adding it to the unreferenced-inode list by having the link()
system call check to see if the link count is zero before it
increments it. If it is, the link() fails with ENOENT (showing that
it has failed the link()/unlink() race).
While tracking down this bug, we have added additional assertions
to detect the problem sooner and also simplified some of the code.
Reported by: Kirk Russell
Fix submitted by: Jeff Roberson
Tested by: Peter Holm
PR: kern/159971
MFC (to 9 only): 2 weeks
When process exists, not only the children shall be reparented to
init, but also the orphans shall be removed from the orphan list,
because the list header is destroyed.
Move struct megasas_sge from mfi_ioctl.h to mfivar.h so we can
remove including machine/bus.h. Add some more mfi_ prefixes to
avoid name space pollution.
Further tweak the changes made in r233709. The kernel doesn't permit
sleeping from a swi handler (even though in this case it would be ok), so
switch the refill and scanning SWI handlers to being tasks on a fast
taskqueue. Also, only schedule the refill task for a CMCI as an MC# can
fire at any time, so it should do the minimal amount of work needed and
avoid opportunities to deadlock before it panics (such as scheduling a
task it won't ever need in practice). To handle the case of an MC# only
finding recoverable errors (which should never happen), always try to
refill the event free list when the periodic scan executes.
- Use more natural ip->i_flags instead of vap->va_flags in the final
flags check.
- Add a comment for the immutable/append check done after handling of
the flags.
- Style improvements.
- Write the ISO9660 descriptor after the apm partition entries.
- Fill the needed pmPartStatus flags. At least the OpenBIOS
implementation relies on these flags.
This commit fixes the panic seen on OS-X when inserting a FreeBSD/ppc disc.
Additionally OpenBIOS recognizes the partition where the boot code is located.
This lets us load a FreeBSD/ppc PowerMac kernel inside qemu.