jhb [Thu, 22 Mar 2007 16:09:23 +0000 (16:09 +0000)]
- Simplify the #ifdef's for adaptive mutexes and rwlocks by conditionally
defining a macro earlier in the file.
- Add NO_ADAPTIVE_RWLOCKS option to disable adaptive spinning for rwlocks.
glebius [Thu, 22 Mar 2007 13:21:24 +0000 (13:21 +0000)]
Move the dom_dispose and pru_detach calls in sofree() earlier. Only after
calling pru_detach we can be absolutely sure, that we don't have any
references to the socket in the stack.
This closes race between lockless sbdestroy() and data arriving on socket.
glebius [Thu, 22 Mar 2007 10:51:03 +0000 (10:51 +0000)]
When working on an RTM_CHANGE do the route editing in the following
sequence. First, if rt_ifa is going to be changed, then call
ifa_rtrequest(RTM_DELETE). Second, if gateway is going to be changed,
then call rt_setgate(). Third, change rt_ifa.
With this change we are able to change a link level route to a
gateway one, that wasn't possible before:
glebius [Thu, 22 Mar 2007 10:37:53 +0000 (10:37 +0000)]
Remove global list of all llinfo_arp entries and use a callout per
instance expiry of the ARP entries. Since we no longer abuse the IPv4
radix head lock, we can now enter arp_rtrequest() with a lock held on
an arbitrary rt_entry.
jhb [Wed, 21 Mar 2007 22:22:13 +0000 (22:22 +0000)]
Rename the cv_*wait*() functions to _cv_*wait*() and change their second
argument from a mutex to a lock_object. Add cv_*wait*() wrapper macros
that accept either a mutex, rwlock, or sx lock as the second argument and
convert it to a lock_object and then call _cv_*wait*(). Basically, the
visible difference is that you can now use rwlocks and sx locks with
condition variables using the same API as with mutexes.
jhb [Wed, 21 Mar 2007 22:18:10 +0000 (22:18 +0000)]
Make use of 'lock_object' being the same field name in the witness_check*()
macros.
- witness_check() replaces witness_check_mtx() and
witness_check_exclusive_sx() and checks for an exclusive acquire of
either a mutex, rwlock, or sx lock.
- witness_check_shared() replaces witness_check_shared_sx() and checks for
a shared acquire of either a rwlock or sx lock.
jhb [Wed, 21 Mar 2007 19:32:08 +0000 (19:32 +0000)]
If vn_open() fails during kern_open(), don't fdrop() the new file object
until after the call to fdclose(). This closes an obscure race that
could result in the later call to fdclose() actually closing a different
file descriptor if another thread close()'s the file descriptor being
opened before fdrop() is called, so the fdrop() in kern_open() frees the
file object, then the second thread (or a third) creates a new file
descriptor which reuses both the same index and the same file pointer
thus tricking fdclose() in the first thread into thinking that the
original file was still open.
jhb [Wed, 21 Mar 2007 18:40:31 +0000 (18:40 +0000)]
Fix an off-by-one error in iwi_init_fw_dma(). It didn't reuse the existing
DMA memory for a firmware load if it was the exact size needed, thus in the
common case the driver was constantly free'ing and reallocating the DMA
buffer and it would eventually begin to fail. With this fix, iwi0 reuses
the same buffer the entire time and no longer fails to load the firmware
after the machine has been up for a while.
andre [Wed, 21 Mar 2007 18:25:28 +0000 (18:25 +0000)]
Make TCP_DROP_SYNFIN a standard part of TCP. Disabled by default it
doesn't impede normal operation negatively and is only a few lines of
code. It's close relatives blackhole and log_in_vain aren't options
either.
andre [Wed, 21 Mar 2007 18:05:54 +0000 (18:05 +0000)]
Remove tcp_minmssoverload DoS detection logic. The problem it tried to
protect us from wasn't really there and it only bloats the code. Should
the problem surface in the future we can simply resurrect it from cvs
history.
jhb [Wed, 21 Mar 2007 15:39:11 +0000 (15:39 +0000)]
Change acpi's handling of suballocating system resources to be a little
simpler. It now can just use rman_is_region_manager() during
acpi_release_resource() to see if the the resource is suballocated from
a system resource. Also, the driver no longer needs MD knowledge about
how to setup bus space tags and handles when doing a suballocation, but
can simply rely on bus_activate_resource() in the parent setting all that
up.
jhb [Wed, 21 Mar 2007 15:36:38 +0000 (15:36 +0000)]
Change the amd64, i386, and ia64 nexus drivers to setup bus space tags and
handles when activating a resource via bus_activate_resource() rather than
doing some of the work in bus_alloc_resource() and some of it in
bus_activate_resource().
One note is that when using isa_alloc_resourcev() on PC-98, drivers now
need to just use bus_release_resource() without explicitly calling
bus_deactivate_resource() first. nyan@ has already fixed all of the PC-98
drivers.
sam [Wed, 21 Mar 2007 03:42:51 +0000 (03:42 +0000)]
Overhaul driver/subsystem api's:
o make all crypto drivers have a device_t; pseudo drivers like the s/w
crypto driver synthesize one
o change the api between the crypto subsystem and drivers to use kobj;
cryptodev_if.m defines this api
o use the fact that all crypto drivers now have a device_t to add support
for specifying which of several potential devices to use when doing
crypto operations
o add new ioctls that allow user apps to select a specific crypto device
to use (previous ioctls maintained for compatibility)
o overhaul crypto subsystem code to eliminate lots of cruft and hide
implementation details from drivers
o bring in numerous fixes from Michale Richardson/hifn; mostly for
795x parts
o add an optional mechanism for mmap'ing the hifn 795x public key h/w
to user space for use by openssl (not enabled by default)
o update crypto test tools to use new ioctl's and add cmd line options
to specify a device to use for tests
These changes will also enable much future work on improving the core
crypto subsystem; including proper load balancing and interposing code
between the core and drivers to dispatch small operations to the s/w
driver as appropriate.
These changes were instigated by the work of Michael Richardson.
jhb [Tue, 20 Mar 2007 21:53:31 +0000 (21:53 +0000)]
Add a new apic0 psuedo-device to claim memory resources for the memory
address ranges used by local and I/O APICs in the system. Some systems
also reserve these ranges as system resources via either PnPBIOS or
ACPI, so this device currently attaches after acpi0 and legacy0 so that
the system resources are given precedence.
jhb [Tue, 20 Mar 2007 21:08:39 +0000 (21:08 +0000)]
Add a new ram0 pseudo-device that claims memory resouces for physical
addresses corresponding to system RAM. On amd64 ram0 uses the SMAP
and claims all the type 1 SMAP regions. On i386 ram0 uses the
dump_avail[] array. Note that on i386 we have to ignore regions above
4G in PAE kernels since bus resources use longs.
jkim [Tue, 20 Mar 2007 20:22:45 +0000 (20:22 +0000)]
- Add macros for newly added CPUID bits in the corresponding header files.
- Use correct capticalization in xTPR as Intel uses in their documents.
- Use proper description instead of vendor code name in comment.
jhb [Tue, 20 Mar 2007 20:21:44 +0000 (20:21 +0000)]
Tweak the probe/attach order of devices on the x86 nexus devices.
Various BIOS-related psuedo-devices are added at an order of 5. acpi0 is
added at an order of 10, and legacy0 is added at an order of 11.
bms [Tue, 20 Mar 2007 13:15:20 +0000 (13:15 +0000)]
Increase default size of raw IP send and receive buffers to the same as
udp_sendspace, to avoid a situation where jumbograms (datagrams > 9KB)
are unnecessarily fragmented.
A common use case for this is OSPF link-state database synchronization
during adjacency bringup on a high speed network with a large MTU.
It is not possible to auto-tune this setting until a socket is bound to
a given interface, and because the laddr part of the inpcb tuple may be
overridden, it makes no sense to do so. Applications may request a larger
socket buffer size by using the SO_SENDBUF and SO_RECVBUF socket options.
Certain applications such as Quagga ospfd do not probe for interface MTU
and therefore do not increase SO_SENDBUF in this use case.
XORP is not affected by this problem as it preemptively uses SO_SENDBUF
and SO_RECVBUF to account for any possible additional latency in XRL IPC.
PR: kern/108375
Requested by: Vladimir Ivanov
MFC after: 1 week
rrs [Tue, 20 Mar 2007 10:23:11 +0000 (10:23 +0000)]
- window update sacks sent incorrectly after
shutdown which caused extra abort from peer.
- RTT time calculation was not being done in
express sack handling since it refered to an unused
variable (rto_pending). Removed variable.
- socket buffer high water access macro-ized.
kmacy [Tue, 20 Mar 2007 06:21:47 +0000 (06:21 +0000)]
cxgb_stop is only called from cxgb_ioctl so:
- don't acquire port lock, already held in ioctl
- rename to cxgb_stop_locked
- switch callout_drain to callout_stop to avoid a hang from having the port lock held
jasone [Tue, 20 Mar 2007 03:44:10 +0000 (03:44 +0000)]
Avoid using vsnprintf(3) unless MALLOC_STATS is defined, in order to
avoid substantial potential bloat for static binaries that do not
otherwise use any printf(3)-family functions. [1]
Rearrange arena_run_t so that the region bitmask can be minimally sized
according to constraints related to each bin's size class. Previously,
the region bitmask was the same size for all run headers, which wasted
a measurable amount of memory.
Rather than making runs for small objects as large as possible, make
runs as small as possible such that header overhead stays below a
certain bound. There are two exceptions that override the header
overhead bound:
1) If the bound is impossible to honor, it is relaxed on a
per-size-class basis. Since there is one bit of header
overhead per object (plus a constant), it is impossible to
achieve a header overhead less than or equal to 1/(# of bits
per object). For the current setting of maximum 0.5% header
overhead, this relaxation comes into play for {2, 4, 8,
16}-byte objects, for which header overhead is (on 64-bit
systems) {7.1, 4.3, 2.2, 1.2}%, respectively.
2) There is still a cap on small run size, still set to 64kB.
This comes into play for {1024, 2048}-byte objects, for which
header overhead is {1.6, 3.1}%, respectively.
In practice, this reduces the run sizes, which makes worst case
low-water memory usage due to fragmentation less bad. It also reduces
worst case high-water run fragmentation due to non-full runs, but this
is only a constant improvement (most important to small short-lived
processes).
Reduce the default chunk size from 2MB to 1MB. Benchmarks indicate that
the external fragmentation reduction makes 1MB the new sweet spot (as
small as possible without adversely affecting performance).
njl [Tue, 20 Mar 2007 00:58:19 +0000 (00:58 +0000)]
If we got an OBE/IBF event, we failed to re-enable the GPE. This would
cause the EC to stop handling future events because the GPE stayed masked.
Set a flag when queueing a GPE handler since it will ultimately re-enable
the GPE. In all other cases, re-enable it ourselves. I reworked the
patch from the submitter.
bms [Tue, 20 Mar 2007 00:36:10 +0000 (00:36 +0000)]
Implement reference counting for ifmultiaddr, in_multi, and in6_multi
structures. Detect when ifnet instances are detached from the network
stack and perform appropriate cleanup to prevent memory leaks.
This has been implemented in such a way as to be backwards ABI compatible.
Kernel consumers are changed to use if_delmulti_ifma(); in_delmulti()
is unable to detect interface removal by design, as it performs searches
on structures which are removed with the interface.
With this architectural change, the panics FreeBSD users have experienced
with carp and pfsync should be resolved.
Obtained from: p4 branch bms_netdev
Reviewed by: andre
Sponsored by: Garance A Drosehn
Idea from: NetBSD
MFC after: 1 month
jkim [Mon, 19 Mar 2007 23:17:39 +0000 (23:17 +0000)]
Revert couple of changes from 1.51 and 1.52. Reading link status with BMSR
is okay for most of the chipsets but BCM5701 PHY does not seem to like it.
Set media to IFM_NONE if link is not up instead of the previous value.
brian [Mon, 19 Mar 2007 18:51:02 +0000 (18:51 +0000)]
When we write extended attributes, assert that the inode hasn't
already been deleted. The assertion is important to show that
we won't end up accounting for extended attribute blocks (using
fs_pendingblocks) in our subsequent call to fs_alloc().
bms [Mon, 19 Mar 2007 18:39:36 +0000 (18:39 +0000)]
Clean up the ether_input() path by using the M_PROMISC flag.
Main points of this change:
* Drop frames immediately if the interface is not marked IFF_UP.
* Always trim off the frame checksum if present.
* Always use M_VLANTAG in preference to passing 802.1Q frames
to consumers.
* Use __func__ consistently for KASSERT().
* Use the M_PROMISC flag to detect situations where ether_input()
may reenter itself on the same call graph with the same mbuf which
was promiscuously received on behalf of subsystems such as
netgraph, carp, and vlan.
* 802.1P frames (that is, VLAN frames with an ID of 0) will now be
passed to layer 3 input paths.
* Deal with the special case for CARP in a sane way.
This is a significant rewrite of code on the critical path. Please report
any issues to me if they arise. Frames will now only pass through dummynet
if M_PROMISC is cleared, to avoid problems with re-entry.
The handling of CARP needs to be revisited architecturally. The M_PROMISC
flag may potentially be demoted to a link-layer flag only as it is in
NetBSD, where the idea originated.
Discussed on: net
Idea from: NetBSD
Reviewed by: yar
MFC after: 1 month
andre [Mon, 19 Mar 2007 18:35:13 +0000 (18:35 +0000)]
Maintain a pointer and offset pair into the socket buffer mbuf chain to
avoid traversal of the entire socket buffer for larger offsets on stream
sockets.
rrs [Mon, 19 Mar 2007 11:11:16 +0000 (11:11 +0000)]
Adds a hash table to speed local address lookup
on a per VRF basis (BSD has only one VRF currently).
Hash table is sized to 16 but may need to be adjusted
for machines with large numbers of addresses.
Reviewed by: gnn
rrs [Mon, 19 Mar 2007 06:53:02 +0000 (06:53 +0000)]
- errno -> becomes error in sctp_output.c and sctputil.c
- SB_CLEAR macro defined and used for sb clearing.
- Fix for CMT express_sack_handling did not do proper
pseudo-cumack updates.
- Get rid of extraneous function that was never used ip_2_ip6_hdr()
- Fixed source address selection bug (initialization problem).
- Source address selection debug added.
rik [Sun, 18 Mar 2007 23:28:53 +0000 (23:28 +0000)]
Give a chance for packet to appear with a correct input interfaces
in case of multiple interfaces with the same MAC in the same bridge.
This commit do not solve the entire problem. Only case where packet
arrived from such interface.
PR: kern/109815
MFC after: 7 days
Submitted by: Eygene Ryabinkin and rik@
Discussed with: bms@, thompsa@, yar@
njl [Sun, 18 Mar 2007 01:03:03 +0000 (01:03 +0000)]
Disable burst mode by default. Testing has shown that while it works on
most systems, it causes the EC not to respond for some Acer and Compaq/HP
laptops. This is the default value for Linux also. For systems that need
it, burst mode can be enabled via the tunable/sysctl:
debug.acpi.ec.burst="1"