Randall Stewart [Thu, 3 Feb 2011 19:22:21 +0000 (19:22 +0000)]
1) Move per John Baldwin to mp_maxid
2) Some signed/unsigned errors found by Mac OS compiler (from Michael)
3) a couple of copyright updates on the effected files.
The FDT describes the host controller directly. There's no need to
get properties from the parent. The parent is in fact the FDT bus
itself and will therefore not have the properties we're looking
for.
Alan Cox [Thu, 3 Feb 2011 14:42:46 +0000 (14:42 +0000)]
Eliminate unnecessary page hold_count checks. These checks predate
r90944, which introduced a general mechanism for handling the freeing
of held pages.
Randall Stewart [Thu, 3 Feb 2011 11:52:22 +0000 (11:52 +0000)]
Fix the per CPU stats so that:
1) They don't use the giant "MAX_CPU" define and instead
are allocated dynamically based on mp_ncpus
2) Will zero with the netstat -z -s -p sctp
3) Will be properly handled by both the sctp_init and finish
(the multi-net stuff was incorrectly bzero'ing in sctp_init
the wrong size.. the bzero is now moved to the right places).
And of course the free is put in at the very end.
Setup another socketpair between parent and child, so that primary sandboxed
worker can ask the main privileged process to connect in worker's behalf
and then we can migrate descriptor using this socketpair to worker.
This is not really needed now, but will be needed once we start to use
capsicum for sandboxing.
Randall Stewart [Thu, 3 Feb 2011 10:05:30 +0000 (10:05 +0000)]
Adds an experimental option to create a pool of
threads. These serve as input threads and are queued
packets based on the V-tag number. This is similar to
what a modern card can do with queue's for TCP... but
alas modern cards know nothing about SCTP.
Ed Maste [Thu, 3 Feb 2011 02:14:53 +0000 (02:14 +0000)]
Revert part of r173264. Both aac_ioctl_sendfib and aac_ioctl_send_raw_srb
make use of the aac_ioctl_event callback, if aac_alloc_command fails. This
can end up in an infinite loop in the while loop in aac_release_command.
Further investigation into the issue mentioned by Scott Long [1] will be
necessary.
Put the general logic for being a CPU hog into a new function
should_yield(). Use this in various places. Encapsulate the common
case of check-and-yield into a new function maybe_yield().
Change several checks for a magic number of iterations to use
should_yield() instead.
- Rename proto_descriptor_{send,recv}() functions to
proto_connection_{send,recv} and change them to return proto_conn
structure. We don't operate directly on descriptors, but on
proto_conns.
- Add wrap method to wrap descriptor with proto_conn.
- Remove methods to send and receive descriptors and implement this
functionality as additional argument to send and receive methods.
Add proto_connect_wait() to wait for connection to finish.
If timeout argument to proto_connect() is -1, then the caller needs to use
this new function to wait for connection.
This change is in preparation for capsicum, where sandboxed worker wants
to ask main process to connect in worker's behalf and pass descriptor
to the worker. Because we don't want the main process to wait for the
connection, it will start async connection and pass descriptor to the
worker who will be responsible for waiting for the connection to finish.
Remove OpenSolaris include path referring to a non-existing directory
never committed from p4 dtrace branch.
[The correct include path is referenced from every opensolaris compat
consumer's module Makefile, so it doesn't serve any purpose anyway.]
Reported by: arundel on freebsd-hackers@ via clang
Approved by: kib (mentor)
MFC after: 1 week
Randall Stewart [Wed, 2 Feb 2011 11:13:23 +0000 (11:13 +0000)]
1) Allow a chunk to track the cwnd it was at when sent.
2) Add separate max-bursts for retransmit and hb. These
are set to sysctlable values but not settable via the
socket api. This makes sure we don't blast out HB's or
fast-retransmits.
3) Determine on the first data transmission on a net if
its local-lan (by being under or over a RTT). This
can later be used to think about different algorithms
based on locallan vs big-i (experimental)
4) The cwnd should NOT be allowed to grow when an ECNEcho
is seen (TCP has this same bug). We fix this in SCTP
so an ECNe being seen prevents an advance of cwnd.
5) CWR's should not be sent multiple times to the
same network, instead just updating the TSN being
transmitted if needed.
John Baldwin [Tue, 1 Feb 2011 18:21:45 +0000 (18:21 +0000)]
- Set the next_alloc fields for an i-node after allocating a new block
so that future allocations start with most recently allocated block
rather than the beginning of the filesystem.
- Fix ext2_alloccg() to properly scan for 8 block chunks that are not
aligned on 8-bit boundaries. Previously this was causing new blocks
to be allocated in a highly fragmented fashion (block 0 of a file at
lbn N, block 1 at lbn N + 8, block 2 at lbn N + 16, etc.).
- Cosmetic tweaks to the currently-disabled fancy realloc sysctls.
PR: kern/153584
Discussed with: bde
Tested by: Pedro F. Giffuni giffunip at yahoo, Zheng Liu (lz)
John Baldwin [Tue, 1 Feb 2011 15:48:27 +0000 (15:48 +0000)]
Output an appropriate amount of padding to line up per-CPU state columns
rather than using a terminal sequence to move the cursor when drawing the
initial screen.
Martin Matuska [Tue, 1 Feb 2011 14:28:50 +0000 (14:28 +0000)]
For ZFS, change the type of clock_t to int64_t.
The clock_t type in OpenSolaris is long (int64_t on amd64).
On FreeBSD clock_t is int32_t. The clock_t type is used in several places
in the ZFS code to store system uptime in milliseconds ("seconds * hz").
With hz=1000 we have a 32-bit integer overflow in 24 days, 20 hours,
31 minutes and 23.648 seconds. This has a user reported negative impact
on l2arc_feed_thread() and may cause unexpected results from other functions
using clock_t.
The unp_gc() function drops and reaquires lock between scan and
collect phases. The unp_discard() function executes
unp_externalize_fp(), which might make the socket eligible for gc-ing,
and then, later, taskqueue will close the socket. Since unp_gc()
dropped the list lock to do the malloc, close might happen after the
mark step but before the collection step, causing collection to not
find the socket and miss one array element.
I believe that the race was there before r216158, but the stated
revision made the window much wider by postponing the close to
taskqueue sometimes.
Only process as much array elements as we find the sockets during
second phase of gc [1]. Take linkage lock and recheck the eligibility
of the socket for gc, as well as call fhold() under the linkage lock.
Reported and tested by: jmallett
Submitted by: jmallett [1]
Reviewed by: rwatson, jeff (possibly)
MFC after: 1 week
Algorithm modules can define their own private congestion signal types in the
top 8 bits of the 32 bit signal bit field space for internal use. These private
signals should not be leaked outside of a module.
Given that many algorithm modules use the NewReno hook functions to simplify
their implementation, the obvious place such a leak would show up is in the
NewReno cong_signal hook function.
- Show the full number of significant bits in the signal type definitions in
<netinet/cc.h>.
- Add a bitmask to simplify figuring out if a given signal is in the private or
public bit range.
- Add a sanity check in newreno_cong_signal() to ensure private signals are not
being leaked into the hook function.
Sponsored by: FreeBSD Foundation
Discussed with: David Hayes <dahayes at swin edu au>
MFC after: 1 week
X-MFC with: r215166
Adrian Chadd [Tue, 1 Feb 2011 08:03:01 +0000 (08:03 +0000)]
Include some preliminary TX HT rate scenario setup code.
The AR5416 and later TX descriptors have new fields for supporting
11n bits (eg 20/40mhz mode, short/long GI) and enabling/disabling
RTS/CTS protection per rate.
These functions will be responsible for initialising the TX descriptors
for the AR5416 and later chips for both HT and legacy frames.
Beacon frames will remain using the non-11n TX descriptor setup for now;
Linux ath9k does much the same.
Note that these functions aren't yet used anywhere; a few more framework
changes are needed before all of the right rate information is available
for TX.
Import an implementation of the CAIA-Hamilton-Delay (CHD) congestion control
algorithm described in the paper "Improved coexistence and loss tolerance for
delay based TCP congestion control" by Hayes and Armitage. It is implemented as
a kernel module compatible with the recently committed modular congestion
control framework.
CHD enhances the approach taken by the Hamilton-Delay (HD) algorithm to provide
tolerance to non-congestion related packet loss and improvements to coexistence
with loss-based congestion control algorithms. A key idea in improving
coexistence with loss-based congestion control algorithms is the use of a shadow
window, which attempts to track how NewReno's congestion window (cwnd) would
evolve. At the next packet loss congestion event, CHD uses the shadow window to
correct cwnd in a way that reduces the amount of unfairness CHD experiences when
competing with loss-based algorithms.
In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: bz and others along the way
MFC after: 3 months
Adrian Chadd [Tue, 1 Feb 2011 06:59:44 +0000 (06:59 +0000)]
* Add a rather hacky "does this speak the 11n TX descriptor format"
function; which will be later used by the TX path to determine
whether to use the extended features or not.
* Break out the descriptor chaining logic into a separate function;
again so it can be switched out later on for the 11n version when
needed.
* Refactor out the encryption-swizzling code that's common in the
raw and normal TX path.
Import a clean-room implementation of the Hamilton-Delay (HD) congestion control
algorithm based on the paper "A strategy for fair coexistence of loss and
delay-based congestion control algorithms" by Budzisz, Stanojevic, Shorten and
Baker. It is implemented as a kernel module compatible with the recently
committed modular congestion control framework.
HD uses a probabilistic approach to reacting to delay-based congestion. The
probability of reducing cwnd is zero when the queuing delay is very small,
increasing to a maximum at a set threshold, then back down to zero again when
the queuing delay is high. Normal operation keeps the queuing delay below the
set threshold. However, since loss-based congestion control algorithms push the
queuing delay high when probing for bandwidth, having the probability of
reducing cwnd drop back to zero for high delays allows HD to compete with
loss-based algorithms.
In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: bz and others along the way
MFC after: 3 months
Import a clean-room implementation of the VEGAS congestion control algorithm
based on the paper "TCP Vegas: end to end congestion avoidance on a global
internet" by Brakmo and Peterson. It is implemented as a kernel module
compatible with the recently committed modular congestion control framework.
VEGAS uses network delay as a congestion indicator and unlike regular loss-based
algorithms, attempts to keep the network operating with stable queuing delays
and no congestion losses. By keeping network buffers used along the path within
a set range, queuing delays are kept low while maintaining high throughput.
In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: bz and others along the way
MFC after: 3 months
Adrian Chadd [Tue, 1 Feb 2011 03:51:35 +0000 (03:51 +0000)]
Add a new capability which reports the number of spatial streams a device supports.
The higher levels (net80211, if_ath, ath_rate) need this to make correct
choices about what MCS capabilities to advertise and what MCS rates are
able to be TXed.
Pyun YongHyeon [Mon, 31 Jan 2011 20:00:43 +0000 (20:00 +0000)]
alc_rev was used without initialization such that it failed to
apply AR8152 v1.0 specific initialization code. Fix this bug by
explicitly reading PCI device revision id via PCI accessor.
Reported by: Gabriel Linder ( linder.gabriel <> gmail dot com )
Implement two new functions for sending descriptor and receving descriptor
over UNIX domain sockets and socket pairs.
This is in preparation for capsicum.
- Use pjdlog for assertions and aborts as this will log assert/abort message
to syslog if we run in background.
- Asserts in proto.c that method we want to call is implemented and remove
dummy methods from protocols implementation that are only there to abort
the program with nice message.
Adrian Chadd [Mon, 31 Jan 2011 15:42:42 +0000 (15:42 +0000)]
Don't incorrectly set the burst duration setting in the TX descriptor.
After inspecting the ath9k source, it seems the AR5416 and later MACs
don't take an explicit RTS/CTS duration. A per-scenario (ie, what multi-
rate retry became) rts/cts control flag and packet duration is provided;
the hardware then apparently fills in whatever details are required.
The per-rate sp/lpack duration calculation just isn't used anywhere
in the ath9k TX packet length calculations.
The burst duration register controls something different; it seems to
be involved with RTS/CTS protection of 11n aggregate frames and is set
via a call to ar5416Set11nBurstDuration().
I've done some light testing with rts/cts protected frames and nothing
seems to break; but this may break said RTS/CTS and CTS-to-self protection.
Warner Losh [Mon, 31 Jan 2011 15:17:47 +0000 (15:17 +0000)]
Move the architecture guessing from Makefile.inc1 to Makefile. We
need to do this because variables specified on the command line
override those specified in the Makefile. This is why we also moved
from TARGET to _TARGET in Makefile, and then set TARGET on the command
line when we fork a submake with Makefile.inc1.
This makes mips/mips work again, even without the workaround committed to
lib/libc/Makefile.
Randall Stewart [Mon, 31 Jan 2011 11:50:11 +0000 (11:50 +0000)]
More ECN fixes:
1) We now remove ECN-Nonce since it will no longer continue as a I-D
2) Eliminate last_tsn_echo, this tied us to an assoc not the net
and thus we were not doing m-homing on the ECN-Echo senders side right.
3) Increment the count going out even if the TSN in lower in the pending
ECN-Echo, this way the receiver knows exactly how many packets were
marked even with network re-ordering
4) Fix so we DO NOT stop doing delayed sack if a ECN Echo is in queue
MFC after: 1 month
Bjoern A. Zeeb [Mon, 31 Jan 2011 00:09:52 +0000 (00:09 +0000)]
Update interface stats counters to match the current format in linux and
try to export as much information as we can match.
Requested on: Debian GNU/kFreeBSD list (debian-bsd lists.debian.org) 2010-12
Tested by: Mats Erik Andersson (mats.andersson gisladisker.se)
MFC after: 10 days
Make ldd(1) work when versioned dependency file is cannot be loaded.
Instead of aborting in locate_dependency(), propagate the error to
caller. The rtld startup function does the right thing with an error
from rtld_verify_versions(), depending on the mode of operation.
Reported by: maho
In collaboration with: kan
MFC after: 1 week
Bernhard Schmidt [Sun, 30 Jan 2011 14:22:45 +0000 (14:22 +0000)]
Fix the 'scan hang' issue.
When requesting a scan and one is already in progess, e.g. while in scan
state, we happily wait for a scan done notification. Though, this
notification might never be sent, e.g. if we are trying to find a network
to associate to and there is none. Instead of always waiting for a
notification just do so if a new scan has been started. For both cases the
scan cache is used to report available networks even if the content might
not be fresh.
Bernhard Schmidt [Sun, 30 Jan 2011 14:00:50 +0000 (14:00 +0000)]
Change return code semantics of start_scan_locked(). Instead of reporting
if a scan is running, report if a scan has been started. The return value
itself is not (yet) used anywhere in the tree and it is also not exported
to userspace.
Bernhard Schmidt [Sun, 30 Jan 2011 13:17:45 +0000 (13:17 +0000)]
When doing a scan while being associated it is possible that the scan
is deferred for the time it takes to flush the TX queue. This work being
done the scan then is continued, but only if it is marked to do so. As
the 'ifconfig scan' request is meant to be used after the interface is
brought up, request a background scan by default. This behaviour is
already documented in manual page.
This fixes on possible case where 'ifconfig scan' hangs infinitely.
Marcel Moolenaar [Sat, 29 Jan 2011 21:14:29 +0000 (21:14 +0000)]
Don't operate on the parent of the PCI node. It's the PCI node itself
that represents the host controller. This makes the FDT PCI support
working an a bare-bones manner. This needs a lot more work, of which
the beginning are at the end of the file, compiled-out with #if 0.
The intend being that both the Marvell PCIE and Freescale PCI/PCIX/PCIE
duplicate the same platform-independent domain initialization, that
should be moved into an unified implementation in the FDT code. Handling
of resources requires help from the platform. A unified implementation
allows us to properly support PCI devices listed in the device tree and
configured according to the device tree specification.
Marcel Moolenaar [Sat, 29 Jan 2011 20:58:38 +0000 (20:58 +0000)]
Fix the interrupt code, broken 7 months ago. The interrupt framework
already supported nested PICs, but was limited to having a nested
AT-PIC only. With G5 support the need for nested OpenPIC controllers
needed to be added. This was done the wrong way and broke the MPC8555
eval system in the process.
OFW, as well as FDT, describe the interrupt routing in terms of a
controller and an interrupt pin on it. This needs to be mapped to a
flat and global resource: the IRQ. The IRQ is the same as the PCI
intline and as such needs to be representable in 8 bits. Secondly,
ISA support pretty much dictates that IRQ 0-15 should be reserved
for ISA interrupts, because of the internal workins of south bridges.
Both were broken.
This change reverts revision 209298 for a big part and re-implements
it simpler. In particular:
o The id() method of the PIC I/F is removed again. It's not needed.
o The openpic_attach() function has been changed to take the OFW
or FDT phandle of the controller as a second argument. All bus
attachments that previously used openpic_attach() as the attach
method of the device I/F now implement as bus-specific method
and pass the phandle_t to the renamed openpic_attach().
o Change powerpc_register_pic() to take a few more arguments. In
particular:
- Pass the number of IPIs specificly. The number of IRQs carved
out for a PIC is the sum of the number of int. pins and IPIs.
- Pass a flag indicating whether the PIC is an AT-PIC or not.
This tells the interrupt framework whether to assign IRQ 0-15
or some other range.
o Until we implement proper multi-pass bus enumeration, we have to
handle the case where we need to map from PIC+pin to IRQ *before*
the PIC gets registered. This is done in a similar way as before,
but rather than carving out 256 IRQs per PIC, we carve out 128
IRQs (124 pins + 4 IPIs). This is supposed to handle the G5 case,
but should really be fixed properly using multiple passes.
o Have the interrupt framework set root_pic in most cases and not
put that burden in PIC drivers (for the most part).
o Remove powerpc_ign_lookup() and replace it with powerpc_get_irq().
Remove IGN_SHIFT, INTR_INTLINE and INTR_IGN.
Related to the above, fix the Freescale PCI controller driver, broken
by the FDT code. Besides not attaching properly, bus numbers were
assigned improperly and enumeration was broken in general. This
prevented the AT PIC from being discovered and interrupt routing to
work properly. Consequently, the ata(4) controller stopped functioning.
Fix the driver, and FDT PCI support, enough to get the MPC8555CDS
going again. The FDT PCI code needs a whole lot more work.
No breakages are expected, but lackiong G5 hardware, it's possible
that there are unpleasant side-effects. At least MPC85xx support is
back to where it was 7 months ago -- it's amazing how badly support
can be broken in just 7 months...