vm_page_cache_remove() should only be used in very little and specific
cases (and marked as static likely) where the callers is going to take
care also of the page flags appropriately, otherwise one can end up
with a corrupted page.
- It is wrong to skip tmpfs_mapped{read, write}() when the page cache
is empty and nothing else is checked. Check that also the resident
page pool is empty before to skip the operation.
- Cleanup the cached page if it is found but the page is not active.
Since pf 4.5 import pf(4) has a mechanism to defer
forwarding a packet, that creates state, until
pfsync(4) peer acks state addition (or 10 msec
timeout passes).
This is needed for active-active CARP configurations,
which are poorly supported in FreeBSD and arguably
a good idea at all.
Unfortunately by the time of import this feature in
OpenBSD was turned on, and did not have a switch to
turn it off. This leaked to FreeBSD.
This change make it possible to turn this feature
off via ioctl() and turns it off by default.
A file cannot be deallocated until its last name has been removed
and it is no longer referenced by a user process. The inode for a
file whose name has been removed, but is still referenced at the
time of a crash will still be allocated in the filesystem, but will
have no references (e.g., they will have no names referencing them
from any directory).
With traditional soft updates these unreferenced inodes will be
found and reclaimed when the background fsck is run. When using
journaled soft updates, the kernel must keep track of these inodes
so that it can find and reclaim them during the cleanup process.
Their existence cannot be stored in the journal as the journal only
handles short-term events, and they may persist for days. So, they
are tracked by keeping them in a linked list whose head pointer is
stored in the superblock. The journal tracks them only until their
linked list pointers have been commited to disk. Part of the cleanup
process involves traversing the list of unreferenced inodes and
reclaiming them.
This bug was triggered when confusion arose in the commit steps
of keeping the unreferenced-inode linked list coherent on disk.
Notably, a race between the link() system call adding a link-count
to a file and the unlink() system call removing a link-count to
the file. Here if the unlink() ran after link() had looked up
the file but before link() had incremented the link-count of the
file, the file's link-count would drop to zero before the link()
incremented it back up to one. If the file was referenced by a
user process, the first transition through zero made it appear
that it should be added to the unreferenced-inode list when in
fact it should not have been added. If the new name created by
link() was deleted within a few seconds (with the file still
referenced by a user process) it would legitimately be a candidate
for addition to the unreferenced-inode list. The result was that
there were two attempts to add the same inode to the unreferenced-inode
list which scrambled the unreferenced-inode list's pointers leading
to a panic. The fix is to detect and avoid the false attempt at
adding it to the unreferenced-inode list by having the link()
system call check to see if the link count is zero before it
increments it. If it is, the link() fails with ENOENT (showing that
it has failed the link()/unlink() race).
While tracking down this bug, we have added additional assertions
to detect the problem sooner and also simplified some of the code.
Reported by: Kirk Russell
Fix submitted by: Jeff Roberson
Tested by: Peter Holm
PR: kern/159971
MFC (to 9 only): 2 weeks
When process exists, not only the children shall be reparented to
init, but also the orphans shall be removed from the orphan list,
because the list header is destroyed.
Move struct megasas_sge from mfi_ioctl.h to mfivar.h so we can
remove including machine/bus.h. Add some more mfi_ prefixes to
avoid name space pollution.
Further tweak the changes made in r233709. The kernel doesn't permit
sleeping from a swi handler (even though in this case it would be ok), so
switch the refill and scanning SWI handlers to being tasks on a fast
taskqueue. Also, only schedule the refill task for a CMCI as an MC# can
fire at any time, so it should do the minimal amount of work needed and
avoid opportunities to deadlock before it panics (such as scheduling a
task it won't ever need in practice). To handle the case of an MC# only
finding recoverable errors (which should never happen), always try to
refill the event free list when the periodic scan executes.
- Use more natural ip->i_flags instead of vap->va_flags in the final
flags check.
- Add a comment for the immutable/append check done after handling of
the flags.
- Style improvements.
- Write the ISO9660 descriptor after the apm partition entries.
- Fill the needed pmPartStatus flags. At least the OpenBIOS
implementation relies on these flags.
This commit fixes the panic seen on OS-X when inserting a FreeBSD/ppc disc.
Additionally OpenBIOS recognizes the partition where the boot code is located.
This lets us load a FreeBSD/ppc PowerMac kernel inside qemu.
Make machine check exception logging more readable. On newer Intel systems,
an uncorrected ECC error tends to fire on all CPUs in a package
simultaneously and the current printf hacks are not sufficient to make
the messages legible. Instead, use the existing mca_lock spinlock to
serialize calls to mca_log() and change the machine check code to panic
directly when an unrecoverable error is encoutered rather than falling
back to a trap_fatal() call in trap() (which adds nearly a screen-full of
logging messages that aren't useful for machine checks).
Historically arp(8) did a route lookup for the entry it is
about to add, and failed if it exist and had invalid data
link type.
Later on, in r201282, this check morphed to other code, but
message "proxy entry exists for non 802 device" still left,
and now it is printed in a case if route prefix found is
equal to current address being added. In other words, when
we are trying to add ARP entry for a network address. The
message is absolutely unrelated and disappointing in this
case.
I don't see anything bad with setting ARP entries for
network addresses. While useless in usual network,
in a /31 RFC3021 it may be necessary. This, remove this code.
Eliminate two cases of unwanted strncpy(). The name is not required
by the current code, and the results would get overwritten anyway
by subsequent memset().
Export some more useful info about shared memory objects to userland
via procstat(1) and fstat(1):
- Change shm file descriptors to track the pathname they are associated
with and add a shm_path() method to copy the path out to a caller-supplied
buffer.
- Use the fo_stat() method of shared memory objects and shm_path() to
export the path, mode, and size of a shared memory object via
struct kinfo_file.
- Add a struct shmstat to the libprocstat(3) interface along with a
procstat_get_shm_info() to export the mode and size of a shared memory
object.
- Change procstat to always print out the path for a given object if it
is valid.
- Teach fstat about shared memory objects and to display their path,
mode, and size.
mav [Sat, 31 Mar 2012 11:23:09 +0000 (11:23 +0000)]
Be more conservative in using READ CAPACITY(16) command. Previous code
checked PROTECT bit in INQUIRY data for all SPC devices, while it is defined
only since SPC-3. But there are some SPC-2 USB devices were reported, that
have PROTECT bit set, return no error for READ CAPACITY(16) command, but
return wrong sector count value in response.
ambrisko [Fri, 30 Mar 2012 23:39:39 +0000 (23:39 +0000)]
MFhead_mfi r233621
Remove the magic mfi_array is 288 bytes and just use the
sizeof the array since it is not 288 bytes.
Change reporting of a "SYSTEM" disk to "JBOD" to match
LSI MegaCli and firmware reporting.
This means that fiutil command to "create jbod" is now a
little confusing since a RAID per drive is not really what
LSI defines JBOD to be. This should be fixed in the future
and support added to really create LSI JBOD and enable that
feature on cards that support it.
ambrisko [Fri, 30 Mar 2012 23:05:48 +0000 (23:05 +0000)]
MFhead_mfi r227068
First cut of new HW support from LSI and merge into FreeBSD.
Supports Drake Skinny and ThunderBolt cards.
MFhead_mfi r227574
Style
MFhead_mfi r227579
Use bus_addr_t instead of uintXX_t.
MFhead_mfi r227580
MSI support
MFhead_mfi r227612
More bus_addr_t and remove "#ifdef __amd64__".
MFhead_mfi r227905
Improved timeout support from Scott.
MFhead_mfi r228108
Make file.
MFhead_mfi r228208
Fixed botched merge of Skinny support and enhanced handling
in call back routine.
MFhead_mfi r228279
Remove superfluous !TAILQ_EMPTY() checks before TAILQ_FOREACH().
MFhead_mfi r228310
Move mfi_decode_evt() to taskqueue.
MFhead_mfi r228320
Implement MFI_DEBUG for 64bit S/G lists.
MFhead_mfi r231988
Restore structure layout by reverting the array header to
use [0] instead of [1].
MFhead_mfi r232412
Put wildcard pattern later in the match table.
MFhead_mfi r232413
Use lower case for hexadecimal numbers to match surrounding
style.
MFhead_mfi r232414
Add more Thunderbolt variants.
MFhead_mfi r232888
Don't act on events prior to boot or when shutting down.
Add hw.mfi.detect_jbod_change to enable or disable acting
on JBOD type of disks being added on insert and removed on
removing. Switch hw.mfi.msi to 1 by default since it works
better on newer cards.
MFhead_mfi r233016
Release driver lock before taking Giant when deleting children.
Use TAILQ_FOREACH_SAFE when items can be deleted. Make code a
little simplier to follow. Fix a couple more style issues.
MFhead_mfi r233620
Update mfi_spare/mfi_array with the actual number of elements
for array_ref and pd. Change these max. #define names to avoid
name space collisions. This will require an update to mfiutil
It avoids mfiutil having to do a magic calculation.
Add a note and #define to state that a "SYSTEM" disk is really
what the firmware calls a "JBOD" drive.
Thanks to the many that helped, LSI for the initial code drop,
mav, delphij, jhb, sbruno that all helped with code and testing.
dim [Fri, 30 Mar 2012 22:52:08 +0000 (22:52 +0000)]
Fix the following compilation warning with clang trunk in isci(4):
sys/dev/isci/isci_task_request.c:198:7: error: case value not in enumerated type 'SCI_TASK_STATUS' (aka 'enum _SCI_TASK_STATUS') [-Werror,-Wswitch]
case SCI_FAILURE_TIMEOUT:
^
This is because the switch is done on a SCI_TASK_STATUS enum type, but
the SCI_FAILURE_TIMEOUT value belongs to SCI_STATUS instead.
Because the list of SCI_TASK_STATUS values cannot be modified at this
time, use the simplest way to get rid of this warning, which is to cast
the switch argument to int. No functional change.
jhb [Fri, 30 Mar 2012 20:17:39 +0000 (20:17 +0000)]
Attempt to make machine check handling a bit more robust:
- Don't malloc() new MCA records for machine checks logged due to a
CMCI or MC# exception. Instead, use a pre-allocated pool of records.
When a CMCI or MC# exception fires, schedule a swi to refill the pool.
The pool is sized to hold at least one record per available machine
bank, and one record per CPU. This should handle the case of all CPUs
triggering a single bank at once as well as the case a single CPU
triggering all of its banks. The periodic scans still use malloc()
since they are run from a safe context.
- Since we have to create an swi to handle refills, make the periodic scan
a second swi for the same thread instead of having a separate taskqueue
thread for the scans.
jhb [Fri, 30 Mar 2012 19:54:48 +0000 (19:54 +0000)]
Fix a few issues with transmit handling in em(4) and igb(4):
- Do not define the foo_start() methods or set if_start in the ifnet if
multiq transmit is enabled. Also, set if_transmit and if_qflush before
ether_ifattach rather than after when multiq transmit is enabled. This
helps to ensure that the drivers never try to mix different transmit
methods.
- Properly restart transmit during resume. igb(4) was not restarting it
at all, and em(4) was restarting even if the link was down and was
calling the wrong method if multiq transmit was enabled.
- Remove all the 'more' handling for transmit completions. Transmit
completion processing does not have a processing limit, so it always
runs to completion and never has more work to do when it returns.
Instead, the previous code was returning 'true' anytime there were
packets in the queue that weren't still in the process of being
transmitted. The effect was that the driver would continuously
reschedule a task to process TX completions in effect running at 100%
CPU polling the hardware until it finished transmitting all of the
packets in the ring. Now it will just wait for the next TX completion
interrupt.
- Restart packet transmission when the link becomes active.
- Fix the MSI-X queue interrupt handlers to restart packet transmission if
there are pending packets in the relevant software queue (IFQ or buf_ring)
after processing TX completions. This is the root cause for the OACTIVE
hangs as if the MSI-X queue handler drained all the pending packets from
the TX ring, nothing would ever restart it. As such, remove some
previously-added workarounds to reschedule a task to poll the TX ring
anytime OACTIVE was set.
jkim [Fri, 30 Mar 2012 16:32:41 +0000 (16:32 +0000)]
Work around Erratum 721 for AMD Family 10h and 12h processors.
"Under a highly specific and detailed set of internal timing conditions,
the processor may incorrectly update the stack pointer after a long series
of push and/or near-call instructions, or a long series of pop and/or
near-return instructions. The processor must be in 64-bit mode for this
erratum to occur."
marius [Fri, 30 Mar 2012 15:08:09 +0000 (15:08 +0000)]
- Remove erroneous trailing semicolon. [1]
- Correctly determine the maximum payload size for setting the TX link
frequent NACK latency and replay timer thresholds.
yongari [Fri, 30 Mar 2012 05:27:05 +0000 (05:27 +0000)]
Do not report current link status if driver is not running.
This change also workarounds dhclient's link state handling bug by
not giving current link status.
Unlike other controllers, ale(4)'s PHY hibernation perfectly works
such that driver does not see a valid link if the controller is not
brought up. If dhclient(8) runs on ale(4) it will blindly waits
until link UP and then gives up after 10 seconds. Because
dhclient(8) still thinks interface got a valid link when IFM_AVALID
is not set for selected media, this change makes dhclient initiate
DHCP without waiting for link UP.
yongari [Fri, 30 Mar 2012 04:46:39 +0000 (04:46 +0000)]
Remove task queue based link state change handler. Driver no longer
needs to defer link state handling.
While I'm here, mark IFF_DRV_RUNNING before changing media. If
link is established without any delay, that link state change
handling could be lost.
dim [Thu, 29 Mar 2012 23:31:48 +0000 (23:31 +0000)]
Fix an issue introduced in sys/x86/include/endian.h with r232721. In
that revision, the bswapXX_const() macros were renamed to bswapXX_gen().
Also, bswap64_gen() was implemented as two calls to bswap32(), and
similarly, bswap32_gen() as two calls to bswap16(). This mainly helps
our base gcc to produce more efficient assembly.
However, the arguments are not properly masked, which results in the
wrong value being calculated in some instances. For example,
bswap32(0x12345678) returns 0x7c563412, and bswap64(0x123456789abcdef0)
returns 0xfcdefc9a7c563412.
Fix this by appropriately masking the arguments to bswap16() in
bswap32_gen(), and to bswap32() in bswap64_gen(). This should also
silence warnings from clang.
dim [Thu, 29 Mar 2012 23:30:17 +0000 (23:30 +0000)]
Revert sys/x86/include/endian.h to what it was before r233419, as that
revision has two problems:
- It can produce worse code with both clang and gcc.
- It doesn't fix the actual issue introduced in r232721, which will be
fixed in the next commit.
trociny [Thu, 29 Mar 2012 20:11:16 +0000 (20:11 +0000)]
If hastd is invoked with "-P pidfile" option always create pidfile
regardless of whether -F (foreground) option is set or not.
Also, if -P option is specified, ignore pidfile setting from configuration
not only on start but on reload too. This fixes the issue when for hastd
run with -P option reload caused the pidfile change.
jkim [Thu, 29 Mar 2012 19:26:39 +0000 (19:26 +0000)]
Revert r233662 and generalize the hack. Writing zero to BAR actually does
not disable it and it is even harmful as hselasky found out. Historically,
this code was originated from (OLDCARD) CardBus driver and later leaked into
PCI driver when CardBus was newbus'ified and refactored with PCI driver.
However, it is not really necessary even for CardBus.
jhb [Thu, 29 Mar 2012 19:03:22 +0000 (19:03 +0000)]
Use a more proper fix for enabling HT MSI mapping windows on Host-PCI
bridges. Rather than blindly enabling the windows on all of them, only
enable the window when an MSI interrupt is enabled for a device behind
the bridge, similar to what already happens for HT PCI-PCI bridges.
To implement this, each x86 Host-PCI bridge driver has to be able to
locate it's actual backing device on bus 0. For ACPI, use the _ADR
method to find the slot and function of the device. For the non-ACPI
case, the legacy(4) driver already scans bus 0 looking for Host-PCI
bridge devices. Now it saves the slot and function of each bridge that
it finds as ivars that the Host-PCI bridge driver can then use in its
pcib_map_msi() method.
This fixes machines where non-MSI interrupts were broken by the previous
round of HT MSI changes.
jhb [Thu, 29 Mar 2012 18:58:02 +0000 (18:58 +0000)]
Restore proper use of bounce buffers for ISA DMA. When locking was
added, the call to pmap_kextract() was moved up, and as a result the
code never updated the physical address to use for DMA if a bounce
buffer was used. Restore the earlier location of pmap_kextract() so
it takes bounce buffers into account.
adrian [Thu, 29 Mar 2012 17:39:18 +0000 (17:39 +0000)]
Defer the rescheduling of TID -> TXQ frames in some instances.
Right now ath_txq_sched() is mainly called from the TX ath_tx_processq()
routine, which is (mostly) done as part of the taskqueue. It shouldn't
be called outside the taskqueue.
But now that I'm about to flip back on BAR TX, I'm going to start
stressing the ath_tx_tid_pause() and ath_tx_tid_resume() paths.
What I don't want to have happen is a reschedule of the TID traffic
_during_ the completion of TX frames.
Ideally I'd like to have a way to flag back up to the processing code
that the current hardware queue should be rechecked for software TID
queue frames. But for now, this should suffice for the BAR TX case.
I may eventually delete this code once I've brought some further
sanity to the general TX queue/completion path.
jhb [Thu, 29 Mar 2012 16:51:22 +0000 (16:51 +0000)]
- Rename VM_MEMATTR_UNCACHED to VM_MEMATTR_WEAK_UNCACHEABLE on x86 to
be less ambiguous and more clearly identify what it means. This
attribute is what Intel refers to as UC-, and it's only difference
relative to normal UC memory is that a WC MTRR will override a UC-
PAT entry causing the memory to be treated as WC, whereas a UC PAT
entry will always override the MTRR.
- Remove the VM_MEMATTR_UNCACHED alias from powerpc.
joel [Thu, 29 Mar 2012 16:02:40 +0000 (16:02 +0000)]
mandoc complains loudly when <TAB>s are misused in columnated lists. Fix
this syntax violation and while I'm here also convert <TAB> to Ta and adjust
quotation marks in order to prevent this problem in the future.
hselasky [Thu, 29 Mar 2012 15:33:44 +0000 (15:33 +0000)]
Fix for boot issue: Don't disable BARs on AGP devices. In general:
Don't disable BARs on any PCI display devices, because doing that can
sometimes cause the main memory bus to stop working, causing all
memory reads to return nothing but 0xFFFFFFFF, even though the memory
location was previously written. After a while a privileged
instruction fault will appear and then nothing more can be debugged.
The reason for this behaviour is unknown.
jmallett [Thu, 29 Mar 2012 03:13:43 +0000 (03:13 +0000)]
Fix 32-bit libgeom consumers run on 64-bit kernels with COMPAT_FREEBSD32.
Kernel pointer values are used as opaque unique identifiers, which are then
used to reconstruct references between various providers, classes, etc., inside
libgeom from the source XML. Unfortunately, they're converted to pointer-width
integers (in the form of pointers) to do this, and 32-bit userland pointers
cannot hold sensible representations (however opaque) of 64-bit kernel pointers
on all systems.
In the case where the leading bits are zero and 32 distinct bits of pointer can
be identified, this will happen to work. On systems where the upper 32-bits of
kernel pointers are non-zero and the same for all kernel pointers, this will
result in double frees and all kinds of bizarre crashes and linkage between
objects inside libgeom.
To mitigate this problem, treat the opaque identifiers in the XML as C strings
instead, and internalize them to give unique and consistent per-object pointer
values in userland for each identifier in the XML. This allows us to keep the
libgeom logic the same with only minor changes to initial setup and parsing.
It might be more sensible for speed reasons to treat the identifiers as numbers
of a large size (uintmax_t, say) rather than strings, but strings seem fine for
now.
(As an added side-effect, this makes it slightly easier to identify unresolved
references, but nothing has been added to inform the user of those.)
jmallett [Thu, 29 Mar 2012 02:54:35 +0000 (02:54 +0000)]
Assume a big-endian default on MIPS and drop the "eb" suffix from MACHINE_ARCH.
This makes our naming scheme more closely match other systems and the
expectations of much third-party software. MIPS builds which are little-endian
should require and exhibit no changes. Big-endian TARGET_ARCHes must be
changed:
From: To:
mipseb mips
mipsn32eb mipsn32
mips64eb mips64
An entry has been added to UPDATING and some foot-shooting protection (complete
with warnings which should become errors in the near future) to the top-level
base system Makefile.
mckusick [Wed, 28 Mar 2012 21:21:19 +0000 (21:21 +0000)]
A refinement of change 232351 to avoid a race with a forcible unmount.
While we have a snapshot vnode unlocked to avoid a deadlock with another
inode in the same inode block being updated, the filesystem containing
it may be forcibly unmounted. When that happens the snapshot vnode is
revoked. We need to check for that condition and fail appropriately.
This change will be included along with 232351 when it is MFC'ed to 9.