luigi [Mon, 27 Feb 2012 19:05:01 +0000 (19:05 +0000)]
A bunch of netmap fixes:
USERSPACE:
1. add support for devices with different number of rx and tx queues;
2. add better support for zero-copy operation, adding an extra field
to the netmap ring to indicate how many buffers we have already processed
but not yet released (with help from Eddie Kohler);
3. The two changes above unfortunately require an API change, so while
at it add a version field and some spares to the ioctl() argument
to help detect mismatches.
4. update the manual page for the two changes above;
5. update sample applications in tools/tools/netmap
KERNEL:
1. simplify the internal structures moving the global wait queues
to the 'struct netmap_adapter';
2. simplify the functions that map kring<->nic ring indexes
4. start exploring the impact of micro-optimizations (prefetch etc.)
in the ixgbe driver.
Use 'legacy' descriptors on the tx ring and prefetch slots gives
about 20% speedup at 900 MHz. Another 7-10% would come from removing
the explict calls to bus_dmamap* in the core (they are effectively
NOPs in this case, but it takes expensive load of the per-buffer
dma maps to figure out that they are all NULL.
Rx performance not investigated.
I am postponing the MFC so i can import a few more improvements
before merging.
jhb [Mon, 27 Feb 2012 17:33:16 +0000 (17:33 +0000)]
- Panic up front if a kernel does not include 'device atpic' and an
APIC is not found.
- Don't panic if lapic_enable_cmc() is called and the APIC is not enabled.
This can happen due to booting a kernel with APIC disabled on a CPU that
supports CMCI.
- Wrap a long line.
jhb [Mon, 27 Feb 2012 16:08:18 +0000 (16:08 +0000)]
Clear the a device's description string anytime it's driver changes.
Descriptions are specific to drivers and we don't change drivers on attached
devices. This fixes a few places where we were not clearing the description
when detaching a driver (e.g. with device_attach() failed). While here, fix
a few other nits:
- Remove spurious call to remove a device's driver from
devclass_driver_deleted(). device_detach() removes it already.
- Fix a typo.
mav [Mon, 27 Feb 2012 10:31:54 +0000 (10:31 +0000)]
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
delphij [Mon, 27 Feb 2012 05:49:00 +0000 (05:49 +0000)]
Drop setuid status while doing file operations to prevent potential
information leak. This changeset is intended to be a minimal one
to make backports easier.
phk [Sun, 26 Feb 2012 20:56:49 +0000 (20:56 +0000)]
Also call the low-level driver if ->c_iflag & (IXON|IXOFF|IXANY) changes.
Uftdi(4) examines (c_iflag & (IXON|IXOFF)) to control hw XON-XOFF support.
This is obviously no good, if changes to those bits are not communicated
down the stack.
kib [Sun, 26 Feb 2012 13:55:43 +0000 (13:55 +0000)]
Add SO_PROTOCOL/SO_PROTOTYPE socket SOL_SOCKET-level option to get the
socket protocol number. This is useful since the socket type can
be implemented by different protocols in the same protocol family,
e.g. SOCK_STREAM may be provided by both TCP and SCTP.
Submitted by: Jukka A. Ukkonen <jau iki fi>
PR: kern/162352
Discussed with: bz
Reviewed by: glebius
MFC after: 2 weeks
adrian [Sun, 26 Feb 2012 06:04:44 +0000 (06:04 +0000)]
Add in some debugging code to check whether the current rate table has
been bait-and-switched from the rate control code.
This will avoid the panic that I saw and will avoid sending invalid rates
(eg 11a/11g OFDM rates when in 11b, on 11b-only NICs (AR5211)) where the
rate table is not "big".
It also will point out situations where this occurs for the 11n NICs
which will have sufficiently large rate tables that "invalid rix" doesn't
occur.
I'll try to follow this up with a commit that adds a current operating mode
check. The "rix" is only relevant to the current operating mode and rate
table.
adrian [Sat, 25 Feb 2012 19:12:54 +0000 (19:12 +0000)]
Attempt to further fix some of the concurrency/reset issues that occur.
* ath_reset() is being called in softclock context, which may have the
thing sleep on a lock. To avoid this, since we really _shouldn't_
be sleeping on any locks, break out the no-loss reset path into a tasklet
and call that from:
+ ath_calibrate()
+ ath_watchdog()
This has the added advantage that it'll end up also doing the frame
RX cleanup from within the taskqueue context, rather than the softclock
context.
* Shuffle around the taskqueue_block() call to be before we grab the lock
and disable interrupts.
The trouble here is that taskqueue_block() doesn't block currently
queued (but not yet running) tasks so calling it doesn't guarantee
no further tasks (that weren't running on _A_ CPU at the time of this
call) will complete. Calling taskqueue_drain() on these tasks won't
work because if any _other_ thread calls taskqueue_enqueue() for whatever
reason, everything gets very angry and stops working.
This slightly changes the race condition enough to let ath_rx_tasklet()
run before we try disabling it, and thus quietens the warnings a bit.
The (more) true solution will be doing something like the following:
* having a taskqueue_blocked mask in ath_softc;
* having an interrupt_blocked mask in ath_softc;
* only calling taskqueue_drain() on each individual task _after_ the
lock has been acquired - that way no further tasklet scheduling
is going to occur.
* Then once the tasks have been blocked _and_ the interrupt has been
disabled, call taskqueue_drain() on each, ensuring that anything
that _was_ scheduled or running is removed.
The trouble is if something calls taskqueue_enqueue() on a task
after taskqueue_blocked() has been called but BEFORE taskqueue_drain()
has been called, ta_pending will be set to 1 and taskqueue_drain()
will sit there stuck in msleep() until you hard-kill the machine.
alc [Sat, 25 Feb 2012 17:49:59 +0000 (17:49 +0000)]
Simplify vmspace_fork()'s control flow by copying immutable data before
the vm map locks are acquired. Also, eliminate redundant initialization
of the new vm map's timestamp.
mm [Sat, 25 Feb 2012 10:58:02 +0000 (10:58 +0000)]
Update libarchive to 3.0.3
Some of new features:
- New readers: RAR, LHA/LZH, CAB reader, 7-Zip
- New writers: ISO9660, XAR
- Improvements to many formats, especially including ISO9660 and Zip
- Stackable write filters to write, e.g., tar.gz.uu in a single pass
- Exploit seekable input; new "seekable" Zip reader can exploit the Zip
Central Directory when it's available; the old "streamable" Zip reader
is still fully supported for cases where seeking is not possible.
Full release notes available at:
https://github.com/libarchive/libarchive/wiki/ReleaseNotes
trociny [Sat, 25 Feb 2012 10:15:41 +0000 (10:15 +0000)]
When detaching an unix domain socket, uipc_detach() checks
unp->unp_vnode pointer to detect if there is a vnode associated with
(binded to) this socket and does necessary cleanup if there is.
The issue is that after forced unmount this check may be too late as
the unp_vnode is reclaimed and the reference is stale.
To fix this provide a helper function that is called on a socket vnode
reclamation to do necessary cleanup.
yongari [Sat, 25 Feb 2012 04:54:51 +0000 (04:54 +0000)]
Use correct Config registers for RTL8139 family. Unlike RTL8168 and
RTL810x family , RTL8139 has different register map for Config
registers.
While here, follow the lead of re(4) in WOL configuration.
- Disable WOL_UCAST and WOL_MCAST capabilities by default.
- Config5 register write does not need to unlock EEPROM access
on RTL8139 family but unlocking EEPROM access does not affect
its operation and make it consistent with re(4).
Reported by: Matt Renzelmann mjr <> cs dot wisc dot edu
davidxu [Sat, 25 Feb 2012 02:12:17 +0000 (02:12 +0000)]
In revision 231989, we pass a 16-bit clock ID into kernel, however
according to POSIX document, the clock ID may be dynamically allocated,
it unlikely will be in 64K forever. To make it future compatible, we
pack all timeout information into a new structure called _umtx_time, and
use fourth argument as a size indication, a zero means it is old code
using timespec as timeout value, but the new structure also includes flags
and a clock ID, so the size argument is different than before, and it is
non-zero. With this change, it is possible that a thread can sleep
on any supported clock, though current kernel code does not have such a
POSIX clock driver system.
thompsa [Fri, 24 Feb 2012 17:50:36 +0000 (17:50 +0000)]
Only look for a usable MAC address for the bridge ID from ports within our
bridge, this allows us to have more than one independent bridge in the same
STP domain.
jhb [Fri, 24 Feb 2012 17:26:06 +0000 (17:26 +0000)]
Adjust the nfs_skip_wcc_data_onerr setting so that it does not block
post-op attributes for ENOENT errors now that the name caching logic
depends on working post-op attributes.
bz [Fri, 24 Feb 2012 14:13:06 +0000 (14:13 +0000)]
Update scripts to work around two sh(1) bugs found in stable/8:
1) _x=$((_x + 1)) does not work while x=$((x + 1)) does.
2) Parameter Expansion, esp. "${x%%bar}" does not work if quoted.
Correct typos and improve some details forwarding.sh already
had in initiator, esp. related to ipfw accepting if the default
is deny.
Add an extra stat call to the "delay" function in addition to the
touch which together is still a lot faster than sleep 1 but seems
to help a lot more to mitigate the unrelated kernel race seen.
kib [Fri, 24 Feb 2012 10:41:58 +0000 (10:41 +0000)]
Place the if() at the right location, to activate the v_writecount
accounting for shared writeable mappings for all filesystems, not only
for the bypass layers.
Submitted by: alc
Pointy hat to: kib
MFC after: 20 days
adrian [Fri, 24 Feb 2012 05:40:36 +0000 (05:40 +0000)]
Hold IF_LOCK when manipulating the interface flags.
It doesn't _really_ help all that much, I'll commit something to
sys/net/if.c at some point explaining why, but the lock should be held
when checking/manipulating/branching because of said lock.
marius [Fri, 24 Feb 2012 00:42:50 +0000 (00:42 +0000)]
Forced commit to denote that the commit message of r231621 should have read:
- As it turns out, MSI-X is broken for at least LSI SAS1068E when passed
through by VMware so blacklist their PCI-PCI bridge for MSI/MSI-X here.
Note that besides currently there not being a quirk type that disables
MSI-X only and there's no evidence that MSI doesn't work with the VMware
pass-through, it's really questionable whether MSI generally works in
that setup as VMware only mention three know working devices [1, p. 4].
Also not that this quirk entry currently doesn't affect the devices
emulated by VMware in any way as these don't claim support MSI/MSI-X to
begin with. [2]
While at it, make the PCI quirk table const and static.
- Remove some duplicated empty lines.
- Use DEVMETHOD_END.
jkim [Fri, 24 Feb 2012 00:02:46 +0000 (00:02 +0000)]
- Add support for Family 12h, 14h and 15h processors.
- Remove all attempts to guess physical temperature using DiodeOffset.
There are too many reports that it varies wildly depending on motherboard.
Instead, if it is known to scale well and its offset is known from other
temperature sensors on board, the user may set "dev.amdtemp.0.sensor_offset"
tunable to compensate the difference. Document the caveats in amdtemp(4).
- Add a quirk for Socket AM2 Revision G processors. These processors are
known to have a different offset according to Linux k8temp driver.
- Warn about Family 10h Erratum 319. These processors have broken sensors.
- Report temperature in more logical orders under dev.amdtemp node. For
example, "dev.amdtemp.0.sensor0.core0" is now "dev.amdtemp.0.core0.sensor0".
- Replace K8, K10 and K11 with official processor names in amdtemp(4).
kib [Thu, 23 Feb 2012 21:07:16 +0000 (21:07 +0000)]
Account the writeable shared mappings backed by file in the vnode
v_writecount. Keep the amount of the virtual address space used by
the mappings in the new vm_object un_pager.vnp.writemappings
counter. The vnode v_writecount is incremented when writemappings gets
non-zero value, and decremented when writemappings is returned to
zero.
Writeable shared vnode-backed mappings are accounted for in vm_mmap(),
and vm_map_insert() is instructed to set MAP_ENTRY_VN_WRITECNT flag on
the created map entry. During deferred map entry deallocation,
vm_map_process_deferred() checks for MAP_ENTRY_VN_WRITECOUNT and
decrements writemappings for the vm object.
Now, the writeable mount cannot be demoted to read-only while
writeable shared mappings of the vnodes from the mount point
exist. Also, execve(2) fails for such files with ETXTBUSY, as it
should be.
Noted by: tegge
Reviewed by: tegge (long time ago, early version), alc
Tested by: pho
MFC after: 3 weeks
mm [Thu, 23 Feb 2012 18:51:24 +0000 (18:51 +0000)]
To improve control over the use of mount(8) inside a jail(8), introduce
a new jail parameter node with the following parameters:
allow.mount.devfs:
allow mounting the devfs filesystem inside a jail
allow.mount.nullfs:
allow mounting the nullfs filesystem inside a jail
Both parameters are disabled by default (equals the behavior before
devfs and nullfs in jails). Administrators have to explicitly allow
mounting devfs and nullfs for each jail. The value "-1" of the
devfs_ruleset parameter is removed in favor of the new allow setting.
kmacy [Thu, 23 Feb 2012 18:21:37 +0000 (18:21 +0000)]
When using flowtable llentrys can outlive the interface with which they're associated
at which the lle_tbl pointer points to freed memory and the llt_free pointer is no longer
valid.
Move the free pointer in to the llentry itself and update the initalization sites.
rmacklem [Thu, 23 Feb 2012 16:47:05 +0000 (16:47 +0000)]
hrs@ reported a panic to freebsd-stable@ under the subject line
"panic in 8.3-PRERELEASE" on Feb. 22, 2012. This panic was caused
by use of a mix of tsleep() and msleep() calls on the same event
in the new NFS server DRC code. It did "mtx_unlock(); tsleep();"
in two places, which kib@ noted introduced a slight risk that the
wakeup() would occur before the tsleep(), resulting in a 10sec
delay before waking up. This patch fixes the problem by replacing
"mtx_unlock(); tsleep();" with mtx_sleep(..PDROP..). It also
changes a nfsmsleep() call to mtx_sleep() so that the code uses
mtx_sleep() consistently within the file.
Tested by: hrs (in progress)
Reviewed by: jhb
MFC after: 5 days
kib [Thu, 23 Feb 2012 11:50:23 +0000 (11:50 +0000)]
Allow the parent to gather the exit status of the children reparented
to the debugger. When reparenting for debugging, keep the child in
the new orphan list of old parent. When looping over the children in
kern_wait(), iterate over both children list and orphan list to search
for the process by pid.
adrian [Thu, 23 Feb 2012 08:32:54 +0000 (08:32 +0000)]
Use the passed-in channel rather than ic->ic_curchan.
I'm not sure _why_ the ic is NULL here, but I've seen it occasionally do
this after I've been tinkering with things for a while. It ends up
crashing in a call to ath_chan_set() via the net80211 scan code and scan
task.
yongari [Thu, 23 Feb 2012 08:22:44 +0000 (08:22 +0000)]
Add check for IFF_DRV_RUNNING flag after serving an interrupt and
don't give RX path more priority than TX path.
Also remove infinite loop in interrupt handler and limit number of
iteration to 32. This change addresses system load fluctuations
under high network load.
yongari [Thu, 23 Feb 2012 06:35:18 +0000 (06:35 +0000)]
With r232015, sf(4) gets correct speed/duplex of established link.
Add more strict speed check in sf_miibus_statchg() and do not touch
MAC config registers when driver lost a link.
yongari [Thu, 23 Feb 2012 06:13:12 +0000 (06:13 +0000)]
Remove taskqueue based MII stat change handler.
Driver does not need deferred link state change processing.
While I'm here, do not report current link status if interface is
not UP.
yongari [Thu, 23 Feb 2012 05:23:21 +0000 (05:23 +0000)]
Introduce sf_ifmedia_upd_locked() and have driver reset PHY before
switching to selected media. While here, set if_drv_flags before
switching to selected media.