sam [Mon, 24 Mar 2008 19:01:29 +0000 (19:01 +0000)]
o add M_PROTO[678]; they'll be needed by net80211 vap code
o sort mbuf flags together and extend values to 32 bits
o write M_COPYFLAGS in terms of M_PROTOFLAGS
o move M_COPYFLAGS and M_PROTOFLAGS up to be together with flag defs
marius [Mon, 24 Mar 2008 17:57:01 +0000 (17:57 +0000)]
- Const'ify the bus_stream_asi and bus_type_asi arrays.
- Replace hard-coded functions names missed in bus_machdep.c rev. 1.44
with __func__.
- Break some long lines.
marius [Mon, 24 Mar 2008 17:49:06 +0000 (17:49 +0000)]
- Take advantage of bus_dmamap_load_mbuf_sg(9).
- Take advantage of m_collapse(9).
- Sync with other NIC drivers and prepend a TX mbuf if the first attempt
to load it fails with an error other than EFBIG and stop trying instead
of freeing it and keeping on trying to enqueue more mbufs. Also ensure
the driver queue isn't empty before trying to enqueue mbufs in order to
reduce locking operations.
- In xl_ifmedia_upd() add a missing XL_UNLOCK(). [1]
- Const'ify the xl_devs array.
- Remove an outdated comment.
marius [Mon, 24 Mar 2008 17:38:24 +0000 (17:38 +0000)]
- Const'ify the dc_devs array.
- Correct the maxsize parameter when creating the mbufs busdma tag to
reflect the actual requirement of dc(4).
- Move the KASSERT in dc_newbuf() to the right spot.
- Also convert the TX side to take advantage of bus_dmamap_load_mbuf_sg(9).
- Move the comment regarding dc_start_locked() to the right spot.
emaste [Mon, 24 Mar 2008 16:38:47 +0000 (16:38 +0000)]
Diff reduction to Adaptec driver build 15317 (refactoring and code shuffling):
- Resource allocation in aac_alloc (moved from from aac_init)
- Interrupt setup in aac_setup_intr (from aac_attach)
- Container probing in aac_get_container_info (from aac_startup and
aac_handle_aif)
- Firmware status check moved to aac_check_firmware from aac_init
csjp [Mon, 24 Mar 2008 13:49:17 +0000 (13:49 +0000)]
Introduce support for zero-copy BPF buffering, which reduces the
overhead of packet capture by allowing a user process to directly "loan"
buffer memory to the kernel rather than using read(2) to explicitly copy
data from kernel address space.
The user process will issue new BPF ioctls to set the shared memory
buffer mode and provide pointers to buffers and their size. The kernel
then wires and maps the pages into kernel address space using sf_buf(9),
which on supporting architectures will use the direct map region. The
current "buffered" access mode remains the default, and support for
zero-copy buffers must, for the time being, be explicitly enabled using
a sysctl for the kernel to accept requests to use it.
The kernel and user process synchronize use of the buffers with atomic
operations, avoiding the need for system calls under load; the user
process may use select()/poll()/kqueue() to manage blocking while
waiting for network data if the user process is able to consume data
faster than the kernel generates it. Patchs to libpcap are available
to allow libpcap applications to transparently take advantage of this
support. Detailed information on the new API may be found in bpf(4),
including specific atomic operations and memory barriers required to
synchronize buffer use safely.
These changes modify the base BPF implementation to (roughly) abstrac
the current buffer model, allowing the new shared memory model to be
added, and add new monitoring statistics for netstat to print. The
implementation, with the exception of some monitoring hanges that break
the netstat monitoring ABI for BPF, will be MFC'd.
Zerocopy bpf buffers are still considered experimental are disabled
by default. To experiment with this new facility, adjust the
net.bpf.zerocopy_enable sysctl variable to 1.
Changes to libpcap will be made available as a patch for the time being,
and further refinements to the implementation are expected.
Sponsored by: Seccuris Inc.
In collaboration with: rwatson
Tested by: pwood, gallatin
MFC after: 4 months [1]
[1] Certain portions will probably not be MFCed, specifically things
that can break the monitoring ABI.
ru [Mon, 24 Mar 2008 12:33:28 +0000 (12:33 +0000)]
Fix splitting into words of the .for expression to allow for
spaces in values. Without this change, the following valid
call broke due to parsing of .MAKEFLAGS in bsd.symver.mk:
cd /usr/src/lib/libc && make -n DEBUG_FLAGS="-DFOO -DBAR"
jeff [Mon, 24 Mar 2008 04:22:58 +0000 (04:22 +0000)]
- Greatly simplify vget() by removing the guarantee that any new
references to a vnode with VI_OWEINACT set will force the vinactive()
call. The kernel makes no guarantees about which reference was the
last to close a file or when the actual inactive processing will
happen. The previous code was designed to preserve existing semantics
in the face of shared locks, however, this was unnecessary.
jeff [Mon, 24 Mar 2008 04:17:35 +0000 (04:17 +0000)]
- Don't acquire the vnode interlock in _vn_lock() unless no lock type
is requested. Handle this case specially before the while loop.
- Use the held vnode lock to check for VI_DOOMED. The vnode lock and
interlock must both be held to set VI_DOOMED so either one held, even
shared, is sufficient to check it.
jeff [Mon, 24 Mar 2008 04:11:40 +0000 (04:11 +0000)]
- Remove an old comment; vnodes have been working without Giant for
years now.
- Clarify the locking required for VI_DOOMED in preparation for
simplifications to vget() and vn_lock().
peter [Sun, 23 Mar 2008 23:09:06 +0000 (23:09 +0000)]
First pass at (possibly futile) microoptimizing of cpu_switch. Results
are mixed. Some pure context switch microbenchmarks show up to 29%
improvement. Pipe based context switch microbenchmarks show up to 7%
improvement. Real world tests are far less impressive as they are
dominated more by actual work than switch overheads, but depending on
the machine in question, workload, kernel options, phase of moon, etc, a
few percent gain might be seen.
Summary of changes:
- don't reload MSR_[FG]SBASE registers when context switching between
non-threaded userland apps. These typically cost 120 clock cycles each
on an AMD cpu (less on Barcelona/Phenom). Intel cores are probably no
faster on this.
- The above change only helps unthreaded userland apps that tend to use
the same value for gsbase. Threaded apps will get no benefit from this.
- reorder things like accessing the pcb to be in memory order, to give
prefetching a better chance of working. Operations are now in increasing
memory address order, rather than reverse or random.
- Push some lesser used code out of the main code paths. Hopefully
allowing better code density in cache lines. This is probably futile.
- (part 2 of previous item) Reorder code so that branches have a more
realistic static branch prediction hint. Both Intel and AMD cpus
default to predicting branches to lower memory addresses as being
taken, and to higher memory addresses as not being taken. This is
overridden by the limited dynamic branch prediction subsystem. A trip
through userland might overflow this.
- Futule attempt at spreading the use of the results of previous operations
in new operations. Hopefully this will allow the cpus to execute in
parallel better.
- stop wasting 16 bytes at the top of kernel stack, below the PCB.
- Never load the userland fs/gsbase registers for kthreads, but preserve
curpcb->pcb_[fg]sbase as caches for the cpu. (Thanks Jeff!)
Microbenchmarking this code seems to be really sensitive to things like
scheduling luck, timing, cache behavior, tlb behavior, kernel options,
other random code changes, etc.
While it doesn't help heavy userland workloads much, it does help high
context switch loads a little, and should help those that involve
switching via kthreads a bit more.
A special thanks to Kris for the testing and reality checks, and Jeff for
tormenting me into doing this. :)
alc [Sun, 23 Mar 2008 23:04:09 +0000 (23:04 +0000)]
Correct an error in pmap_mincore() when applied to a 2MB page mapping:
Use PG_PS_FRAME, not PG_FRAME, to obtain the physical address of the
2MB physical page from the PDE.
alc [Sun, 23 Mar 2008 20:38:01 +0000 (20:38 +0000)]
To date, we have assumed that the TLB will only set the PG_M bit in a
PTE if that PTE has the PG_RW bit set. However, this assumption does
not hold on recent processors from Intel. For example, consider a PTE
that has the PG_RW bit set but the PG_M bit clear. Suppose this PTE
is cached in the TLB and later the PG_RW bit is cleared in the PTE,
but the corresponding TLB entry is not (yet) invalidated.
Historically, upon a write access using this (stale) TLB entry, the
TLB would observe that the PG_RW bit had been cleared and initiate a
page fault, aborting the setting of the PG_M bit in the PTE. Now,
however, P4- and Core2-family processors will set the PG_M bit before
observing that the PG_RW bit is clear and initiating a page fault. In
other words, the write does not occur but the PG_M bit is still set.
The real impact of this difference is not that great. Specifically,
we should no longer assert that any PTE with the PG_M bit set must
also have the PG_RW bit set, and we should ignore the state of the
PG_M bit unless the PG_RW bit is set. However, these changes enable
me to remove a work-around from pmap_promote_pde(), the superpage
promotion procedure.
(Note: The AMD processors that we have tested, including the latest,
the Phenom, still exhibit the historical behavior.)
Acknowledgments: After I observed the problem, Stephan (ups) was
instrumental in characterizing the exact behavior of Intel's recent
TLBs.
kib [Sun, 23 Mar 2008 13:45:24 +0000 (13:45 +0000)]
Yield the cpu in the kernel while iterating the list of the
vnodes belonging to the mountpoint. Also, yield when in the
softdep_process_worklist() even when we are not going to sleep due to
buffer drain.
It is believed that the ULE fixed the problem [1], but the yielding
seems to be needed at least for the 4BSD case.
Discussed: on stable@, with bde
Reviewed by: tegge, jeff [1]
MFC after: 2 weeks
cperciva [Sun, 23 Mar 2008 13:41:54 +0000 (13:41 +0000)]
When updating the install list for files which have had local changes
merged with upgrade changes, don't try to compute the SHA256 hash of
files which don't exist.
kib [Sun, 23 Mar 2008 07:07:27 +0000 (07:07 +0000)]
Prevent the overflow in the calculation of the next page directory.
The overflow causes the wraparound with consequent corruption of the
(almost) whole address space mapping.
As Alan noted, pmap_copy() does not require the wrap-around checks
because it cannot be applied to the kernel's pmap. The checks there are
included for consistency.
yongari [Sun, 23 Mar 2008 05:31:35 +0000 (05:31 +0000)]
For MSI capable hardwares, enable MSI enable bit in RL_CFG2
register. If MSI was disabled by hw.re.msi_disable tunable
expliclty clear the MSI enable bit.
yongari [Sun, 23 Mar 2008 05:06:16 +0000 (05:06 +0000)]
VLAN hardware tag information should be set for all desciptors of a
multi-descriptor transmission attempt. Datasheet said nothing about
this requirements. This should fix a long-standing VLAN hardware
tagging issues with re(4).
yongari [Sun, 23 Mar 2008 04:59:13 +0000 (04:59 +0000)]
Always honor configured VLAN/checksum offload capabilities.
Previously re(4) used to blindly enable VLAN hardware tag stripping
and Rx checksum offload regardless of enabled optional features of
interface.
jeff [Sun, 23 Mar 2008 01:44:28 +0000 (01:44 +0000)]
- Only return 1 from sync_vnode() in cases where the vnode is still
at the head of the sync list. This prevents sched_sync() from
re-queueing a vnode which may have been freed already.
marcel [Sun, 23 Mar 2008 01:31:59 +0000 (01:31 +0000)]
Redefine G_PART_SCHEME_DECLARE() from populating a private linker set
to declaring a proper module. The module event handler is part of the
gpart core and will add the scheme to an internal list on module load
and will remove the scheme from the internal list on module unload.
This makes it possible to dynamically load and unload partitioning
schemes.
marcel [Sun, 23 Mar 2008 01:23:35 +0000 (01:23 +0000)]
Add g_retaste(), which given a class will present all non-open providers
to it for tasting. This is useful when the class, through means outside
the scope of GEOM, can claim providers previously unclaimed.
The g_retaste() function posts an event which is handled by the
g_retaste_event().
qingli [Sat, 22 Mar 2008 18:13:39 +0000 (18:13 +0000)]
Reuse the mbuf that was just retrieved from the receive ring if mbuf
exhaustion is encountered. There was a fix made previously for this
problem but the solution (breaking out of the receive loop) does not
seem to work. mbuf reuse strategy is already adopted by other drivers
such as if_bge. The problem was recreated and the patch is also
verified in the same test environment.
sam [Sat, 22 Mar 2008 16:55:51 +0000 (16:55 +0000)]
add hints to specify how NPE ports are mapped to MAC+PHY; these
could be commented out as they just duplicate the defaults that
are built into the code
sam [Sat, 22 Mar 2008 16:53:28 +0000 (16:53 +0000)]
Improve mac+phy configuration so that hints can be used to describe
layouts different than the defaults:
o hint.npe.0.mac="A", "B", etc. specifies the window for MAC register accesses
o hint.npe.0.mii="A", "B", etc. specifies PHY registers
o hint.npe.1.phy=%d specifies the PHY to map to a port
This allows devices like NSLU to be setup w/o code changes and will
also be used for forthcoming support for more Avila boards.
sam [Sat, 22 Mar 2008 16:24:02 +0000 (16:24 +0000)]
Defer state change on disassociate to avoid unnecessarily dropping the
lease: track the current bssid and if it changes (as reported in an
assoc/reassoc) event only then kick the state machine. This gives us
immediate response when roaming but otherwise causes us to fallback on
the normal state machine.
stefanf [Sat, 22 Mar 2008 14:06:01 +0000 (14:06 +0000)]
Reset the internal state used for the 'getopts' built-in when 'shift' or 'set'
are used to modify the arguments. Not doing so caused random memory reads or
null pointer dereferences when 'getopts' was called again later (SUSv3 says
getopts produces unspecified results in this case).
remko [Sat, 22 Mar 2008 12:50:43 +0000 (12:50 +0000)]
In route.c in newroute() there's a call to exit(0) if the command was
'get'. Since rtmsg() always gets called and returns 0 on success and -1
on failure, it's possible to exit with a suitable exit code by calling
exit(ret != 0) instead, as is done at the end of newroute().
jeff [Sat, 22 Mar 2008 09:15:16 +0000 (09:15 +0000)]
- Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
- Create a new lock in the bufobj to lock bufobj fields independently.
This leaves the vnode interlock as an 'identity' lock while the bufobj
is an io lock. The bufobj lock is ordered before the vnode interlock
and also before the mnt ilock.
- Exploit this new lock order to simplify softdep_check_suspend().
- A few sync related functions are marked with a new XXX to note that
we may not properly interlock against a non-zero bv_cnt when
attempting to sync all vnodes on a mountlist. I do not believe this
race is important. If I'm wrong this will make these locations easier
to find.
alfred [Sat, 22 Mar 2008 07:29:45 +0000 (07:29 +0000)]
Fix a race where timeout/untimeout could cause crashes for Giant locked
code.
The bug:
There exists a race condition for timeout/untimeout(9) due to the
way that the softclock thread dequeues timeouts.
The softclock thread sets the c_func and c_arg of the callout to
NULL while holding the callout lock but not Giant. It then drops
the callout lock and acquires Giant.
It is at this point where untimeout(9) on another cpu/thread could
be called.
Since c_arg and c_func are cleared, untimeout(9) does not touch the
callout and returns as if the callout is canceled.
The softclock then tries to acquire Giant and likely blocks due to
the other cpu/thread holding it.
The other cpu/thread then likely deallocates the backing store that
c_arg points to and finishes working and hence drops Giant.
Softclock resumes and acquires giant and calls the function with
the now free'd c_arg and we have corruption/crash.
The fix:
We need to track curr_callout even for timeout(9) (LOCAL_ALLOC)
callouts. We need to free the callout after the softclock processes
it to deal with the race here.
remko [Fri, 21 Mar 2008 20:16:25 +0000 (20:16 +0000)]
Replace reference from vinum.8 to gvinum.8, it was advised in the PR to
replace this with vinum.4, but that's the kernel interface manual, which
is not appropriate in my understanding. I think that gvinum is a suitable
replacement for this.
PR: docs/121938
Submitted by: "Federico" <federicogalvezdurand at yahoo dot com>
MFC after: 3 days
kib [Fri, 21 Mar 2008 12:38:44 +0000 (12:38 +0000)]
Reduce contention on the vnode interlock by not acquiring the BO_LOCK
around the check for the BV_BKGRDINPROG in the brelse() and bqrelse().
See the comment for the explanation why it is safe.
kib [Fri, 21 Mar 2008 12:33:00 +0000 (12:33 +0000)]
Reduce the acquisition of the vnode interlock in the ffs_read() and
ffs_extread() when setting the IN_ACCESS flag by checking whether the
IN_ACCESS is already set. The possible race there is admissible.
jeff [Fri, 21 Mar 2008 10:00:05 +0000 (10:00 +0000)]
- Reduce contention on the global bdonelock and bpinlock by using
a pool mutex to protect these sleep/wakeup/counter races. This
still is preferable to bloating each bio with a mtx.
jeff [Fri, 21 Mar 2008 08:23:25 +0000 (08:23 +0000)]
- Add a new td flag TDF_NEEDSUSPCHK that is set whenever a thread needs
to enter thread_suspend_check().
- Set TDF_ASTPENDING along with TDF_NEEDSUSPCHK so we can move the
thread_suspend_check() to ast() rather than userret().
- Check TDF_NEEDSUSPCHK in the sleepq_catch_signals() optimization so
that we don't miss a suspend request. If this is set use the
expensive signal path.
- Set NEEDSUSPCHK when creating a new thread in thr in case the
creating thread is due to be suspended as well but has not yet.
davidxu [Fri, 21 Mar 2008 02:31:55 +0000 (02:31 +0000)]
Resolve __error()'s PLT early so that it needs not to be resolved again,
otherwise rwlock is recursivly called when signal happens and the __error
was never resolved before.
jhb [Thu, 20 Mar 2008 21:53:27 +0000 (21:53 +0000)]
Explicitly use spinlock_enter/exit rather than locking the icu_lock spin
lock in the 8259A drivers as these drivers are only used on UP systems.
This slightly reduces the penalty of an SMP kernel (such as GENERIC) on
a UP x86 machine.
jhb [Thu, 20 Mar 2008 21:24:32 +0000 (21:24 +0000)]
Implement a BUS_BIND_INTR() method in the bus interface to bind an IRQ
resource to a CPU. The default method is to pass the request up to the
parent similar to BUS_CONFIG_INTR() so that all busses don't have to
explicitly implement bus_bind_intr. A bus_bind_intr(9) wrapper routine
similar to bus_setup/teardown_intr() is added for device drivers to use.
Unbinding an interrupt is done by binding it to NOCPU. The IRQ resource
must be allocated, but it can happen in any order with respect to
bus_setup_intr(). Currently it is only supported on amd64 and i386 via
nexus(4) methods that simply call the intr_bind() routine.
emaste [Thu, 20 Mar 2008 20:33:48 +0000 (20:33 +0000)]
Restore creation of passthrough devices with newer controller firmware by
putting the correct size in the fib header. Presumably the older firmware
silently ignored a bad size field.
(This change tested with a 3805 controller. Passthrough devices were
created when running firmware build 12814, but not 15323 or later. With
this change they're created for both old and new firmware versions.)
emaste [Thu, 20 Mar 2008 17:59:19 +0000 (17:59 +0000)]
Add ioctls FSACTL_SEND_LARGE_FIB, FSACTL_SEND_RAW_SRB,
FSACTL_LNX_SEND_LARGE_FIB, and FSACTL_LNX_SEND_RAW_SRB, and correct size
checks on FIBs passed in from userspace. Both changes were obtained from
Adaptec's driver build 15317. Adaptec's commandline RAID tool arcconf uses
these ioctls when creating a RAID-10 array (and probably other operations
too).
rdivacky [Thu, 20 Mar 2008 17:03:55 +0000 (17:03 +0000)]
o Add stub support for some new futex operations,
so the annoying message is not printed.
o Don't warn about FUTEX_FD not being implemented
and return ENOSYS instead of 0 (eg. success).
o Clear FUTEX_PRIVATE_FLAG as we actually implement
only private futexes so there is no reason to
return ENOSYS when app asks for a private futex.
We don't reject shared futexes because they worked
just fine with our implementation so far.
sam [Thu, 20 Mar 2008 16:19:25 +0000 (16:19 +0000)]
Workaround design botch in usb: blindly mixing bus_dma with PIO does not
work on architectures with a write-back cache as the PIO writes end up
in the cache which the sync(BUS_DMASYNC_POSTREAD) in usb_transfer_complete
then discards; compensate in the xfer methods that do PIO by pushing the
writes out of the cache before usb_transfer_complete is called.
kib [Thu, 20 Mar 2008 16:08:42 +0000 (16:08 +0000)]
Do not dereference cdev->si_cdevsw, use the dev_refthread() to properly
obtain the reference. In particular, this fixes the panic reported in
the PR. Remove the comments stating that this needs to be done.
sam [Thu, 20 Mar 2008 16:04:13 +0000 (16:04 +0000)]
Correct cache handling for xfer requests marked URQ_REQUEST: many (if not
all uses) involve a read but usbd_start_transfer only does a PREWRITE; change
this to BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE as I'm not sure if any
users do write+read.
remko [Thu, 20 Mar 2008 12:56:49 +0000 (12:56 +0000)]
Alert properly when we have stale mounts left after interupting
a tinybsd build. If we do not do this, we can accidentally remove
critical files from directories like /lib (if mounted).
PR: misc/121763
Submitted by: Richard Arends < richard at unixguru dot nl >
MFC after: 3 days