Alexander Motin [Fri, 3 Jun 2011 13:49:18 +0000 (13:49 +0000)]
Update disk's stripesize and stripeoffset parameters on provider open.
They are media-dependent and may change in run-time, same as sectorsize
and/or mediasize.
SCSI devices return physical sector size and offset via READ CAPACITY(16)
command and so can not report it until media inserted or at least until
probe sequence completed. UNMAP support is also reported there.
Ruslan Ermilov [Fri, 3 Jun 2011 13:45:11 +0000 (13:45 +0000)]
Don't use col(1) since grotty(1) never outputs reverse line feeds,
and because col(1) mangles ANSI color escapes if enabled. Spaces
to tabs compression is now done by passing -h to grotty(1).
Ruslan Ermilov [Fri, 3 Jun 2011 12:02:53 +0000 (12:02 +0000)]
Re-enable SGR support (ANSI color escapes) in grotty(1) by default.
Our man(1) and bsd.doc.mk still disable it for POLA reasons via the
-c option to grotty(1).
Ruslan Ermilov [Fri, 3 Jun 2011 11:58:17 +0000 (11:58 +0000)]
Don't pass -o1- to groff(1) by default. If ms(7) formatted document
uses the .RP macro, a separate cover page is created as page 0 which
is not otherwise output. The bug was hiding by a hack in troffrc
that disables SGR support in grotty(1), which I'm going to remove now.
For POLA reasons, still disable SGR support in grotty(1), by passing
-P-c to groff(1). If we want SGR sequences in these documents, this
can be removed.
Alexander Motin [Fri, 3 Jun 2011 07:25:36 +0000 (07:25 +0000)]
Increase maximum supported number of ranges per TRIM command from 256 to 512
to use full potential of Intel X25-M SSDs. On synthetic test with 32K ranges
it gives about 20% speedup, which probably costs more then 2K of RAM.
Ruslan Ermilov [Fri, 3 Jun 2011 05:16:33 +0000 (05:16 +0000)]
Added support for the MANWIDTH environment variable:
If set to a numeric value, used as the width manpages should be
displayed. Otherwise, if set to a special value ``tty'', and
output is to a terminal, the pages may be displayed over the
whole width of the screen.
Use stripesize and stripeoffset in the automatic calculation of
partition offsets. If user requests specific alignment and
provider's stripesize is not zero, then use a least common multiple
from the stripesize and user specified value.
Also fix "gpart resize" implementation: do not try to align the partition
size, because the start offset may be not aligned. Instead align the
end offset and then calculate size. Also use stripesize and stripeoffset
for "gpart resize" command.
Alexander Motin [Thu, 2 Jun 2011 20:56:42 +0000 (20:56 +0000)]
When possible, join ranges of subsequest BIO_DELETE requests to handle more
(up to 2048 instead of 256 or even 64) of them with single TRIM request.
OCZ Vertex2/Vertex3 SSDs can handle no more then 64 ranges per TRIM request.
Due to lack of BIO_DELETE clustering now, it means that we could delete no
more then 2MB per request (on FS with 32K block) with limited request rate.
This change increases delete rate on Vertex2 from 250MB/s to 950MB/s.
Rick Macklem [Thu, 2 Jun 2011 20:15:32 +0000 (20:15 +0000)]
Fix the nfs related daemons so that they don't intermittently
fail with "bind: address already in use". This problem was reported
to the freebsd-stable@ mailing list on Feb. 19 under the subject
heading "statd/lockd startup failure" by george+freebsd at m5p dot com.
The problem is that the first combination of {udp,tcp X ipv4,ipv6}
would select a port# dynamically, but one of the other three combinations
would have that port# already in use. The patch is somewhat involved
because it was requested by dougb@ that the four combinations use the
same port# wherever possible. The patch splits the create_service()
function into two functions. The first goes as far as bind(2) in a
loop for up to GETPORT_MAXTRY - 1 times, attempting to use the same port#
for all four cases. If these attempts fail, the last attempt allows
the 4 cases to use different port #s. After this function has succeeded,
the second function, called complete_service(), does the rest of what
create_service() did.
The three daemons mountd, rpc.lockd and rpc.statd all have a
create_service() function that is patched in a similar way. However,
create_service() has non-trivial differences for the three daemons
that made it impractical to share the same functions between them.
Rick Macklem [Thu, 2 Jun 2011 19:49:47 +0000 (19:49 +0000)]
Fix the nfs related daemons so that they don't intermittently
fail with "bind: address already in use". This problem was reported
to the freebsd-stable@ mailing list on Feb. 19 under the subject
heading "statd/lockd startup failure" by george+freebsd at m5p dot com.
The problem is that the first combination of {udp,tcp X ipv4,ipv6}
would select a port# dynamically, but one of the other three combinations
would have that port# already in use. The patch is somewhat involved
because it was requested by dougb@ that the four combinations use the
same port# wherever possible. The patch splits the create_service()
function into two functions. The first goes as far as bind(2) in a
loop for up to GETPORT_MAXTRY - 1 times, attempting to use the same port#
for all four cases. If these attempts fail, the last attempt allows
the 4 cases to use different port #s. After this function has succeeded,
the second function, called complete_service(), does the rest of what
create_service() did.
The three daemons mountd, rpc.lockd and rpc.statd all have a
create_service() function that is patched in a similar way. However,
create_service() has non-trivial differences for the three daemons
that made it impractical to share the same functions between them.
Rick Macklem [Thu, 2 Jun 2011 19:33:33 +0000 (19:33 +0000)]
Fix the nfs related daemons so that they don't intermittently
fail with "bind: address already in use". This problem was reported
to the freebsd-stable@ mailing list on Feb. 19 under the subject
heading "statd/lockd startup failure" by george+freebsd at m5p dot com.
The problem is that the first combination of {udp,tcp X ipv4,ipv6}
would select a port# dynamically, but one of the other three combinations
would have that port# already in use. The patch is somewhat involved
because it was requested by dougb@ that the four combinations use the
same port# wherever possible. The patch splits the create_service()
function into two functions. The first goes as far as bind(2) in a
loop for up to GETPORT_MAXTRY - 1 times, attempting to use the same port#
for all four cases. If these attempts fail, the last attempt allows
the 4 cases to use different port #s. After this function has succeeded,
the second function, called complete_service(), does the rest of what
create_service() did.
The three daemons mountd, rpc.lockd and rpc.statd all have a
create_service() function that is patched in a similar way. However,
create_service() has non-trivial differences for the three daemons
that made it impractical to share the same functions between them.
Temporarily back out those parts of r222613 related to parsing the memory
map. They cause non-understood boot failures on some Apple machines with
more than 2 GB of RAM (like my work desktop).
The POWER7 has only 32 SLB slots instead of 64, like other supported
64-bit PowerPC CPUs. Add infrastructure to support variable numbers of
SLB slots and move the user slot from 63 to 0, so that it is always
available.
Bjoern A. Zeeb [Thu, 2 Jun 2011 14:25:27 +0000 (14:25 +0000)]
Write the multi step netconfig to a temporary file and only move that
to the final name if netconfig was completely finished. This fixes
reentrance problems even better than r222611.
Suggested by: nwhitehorn
Reviewed by: nwhitehorn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
If running under a hypervisor, don't yell at the user about starting
unknown CPU types, instead relying on the hypervisor to have given us a
reasonable environment.
Include the modules area in the mapped kernel code. This fixes the kernel's
access to modules and loader metadata when started from real mode, but
without a direct map.
MFpseries:
Renovate and improve the AIM Open Firmware support:
- Add RTAS (Run-Time Abstraction Services) support, found on all IBM systems
and some Apple ones
- Improve support for 32-bit real mode Open Firmware systems
- Pull some more OF bits over from the AIM directory
- Fix memory detection on IBM LPARs and systems with more than one /memory
node (by andreast@)
Bjoern A. Zeeb [Thu, 2 Jun 2011 14:08:50 +0000 (14:08 +0000)]
Empty the network configuration only after the user decided to pick an
interface. Otherwise an accidental start of the netowrk configuration
and immediate cancel after the install has finished removes the previously
configured settings.
Discussed with: nwhitehorn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
Ed Maste [Thu, 2 Jun 2011 00:43:16 +0000 (00:43 +0000)]
There are a couple of structs in mfireg.h with members named 'class'.
These cause problems when trying to include the header in a C++ project.
Rename them to 'evt_class', and track the change in mfi and mfiutil.
Jack F Vogel [Thu, 2 Jun 2011 00:34:57 +0000 (00:34 +0000)]
First off: update the driver README, the old one was horribly
crusty, and this still isn't perfect, but its at least a bit
more recent.
Secondly, a few improvements to the driver from Andrew Boyer,
support hint to allow devices to not attach, add VLAN_HWTSO
capability so vlans can use TSO, fix in the interrupt handler
to make sure the stack TX queue is processed. Oh, and also
make sure IPv6 does not cause a re-init in the ioctl routine.
Thanks for your efforts Andrew!
Thanks to Claudio Jeker for noticing the ixgbe_xmit() routine
was not correctly swapping the dma map from the first to the
last descriptor in a multi-descriptor transmission, corrected
this.
In the VOP_PUTPAGES() implementations, change the default error from
VM_PAGER_AGAIN to VM_PAGER_ERROR for the uwritten pages. Return
VM_PAGER_AGAIN for the partially written page. Always forward at least
one page in the loop of vm_object_page_clean().
VM_PAGER_ERROR causes the page reactivation and does not clear the
page dirty state, so the write is not lost.
The change fixes an infinite loop in vm_object_page_clean() when the
filesystem returns permanent errors for some page writes.
Adrian Chadd [Wed, 1 Jun 2011 20:09:49 +0000 (20:09 +0000)]
Flesh out the radar detection related operations for the ath driver.
This is in no way a complete DFS/radar detection implementation!
It merely creates an abstracted interface which allows for future
development of the DFS radar detection code.
Note: Net80211 already handles the bulk of the DFS machinery,
all we need to do here is figure out that a radar event has occured
and inform it as such. It then drives the DFS state engine for us.
The "null" DFS radar detection module is included by default;
it doesn't require a device line.
This commit:
* Adds a simple abstracted layer for radar detection state -
sys/dev/ath/ath_dfs/;
* Implements a null DFS module which doesn't do anything;
(ie, implements the exact behaviour at the moment);
* Adds hooks to the ath driver to process received radar events
and gives the DFS module a chance to determine whether
a radar has been detected.
Adrian Chadd [Wed, 1 Jun 2011 20:01:02 +0000 (20:01 +0000)]
Add some missing DFS chipset functionality to the FreeBSD HAL.
Please note - this doesn't in any way constitute a full DFS
implementation, it merely adds the relevant capability bits and
radar detection threshold register access.
The particulars:
* Add new capability bits outlining what the DFS capabilities
are of the various chipsets.
* Add HAL methods to set and get the radar related register values.
* Add AR5212 and AR5416+ DFS radar related register value
routines.
* Add a missing HAL phy error code that's related to radar event
processing.
* Add HAL_PHYERR_PARAM, a data type that encapsulates the radar
register values.
The AR5212 routines are just for completeness. The AR5416 routines
are a super-set of those; I may later on do a drive-by pass to
tidy up duplicate code.
Robert Watson [Wed, 1 Jun 2011 20:00:25 +0000 (20:00 +0000)]
Add an optional netisr dispatch point at ether_input(), but set the
default dispatch method to NETISR_DISPATCH_DIRECT in order to force
direct dispatch. This adds a fairly negligble overhead without
changing default behavior, but in the future will allow deferred or
hybrid dispatch to other worker threads before link layer processing
has taken place.
For example, this could allow redistribution using RSS hashes
without ethernet header cache line hits, if the NIC was unable to
adequately implement load balancing to too small a number of input
queues -- perhaps due to hard queueset counts of 1, 3, or 8, but in
a modern system with 16-128 threads. This can happen on highly
threaded systems, where you want want an ithread per core,
redistributing work to other queues, but also on virtualised systems
where hardware hashing is (or is not) available, but only a single
queue has been directed to one VCPU on a VM.
Note: this adds a previously non-present assertion about the
equivalence of the ifnet from which the packet is received, and the
ifnet stamped in the mbuf header. I believe this assertion to
generally be true, but we'll find out soon -- if it's not, we might
have to add additional overhead in some cases to add an m_tag with
the originating ifnet pointer stored in it.
O_FORWARD_IP is only action which depends from the result of lookup of
dynamic rules. We are doing forwarding in the following cases:
o For the simple ipfw fwd rule, e.g.
fwd 10.0.0.1 ip from any to any out xmit em0
fwd 127.0.0.1,3128 tcp from any to any 80 in recv em1
o For the dynamic fwd rule, e.g.
fwd 192.168.0.1 tcp from any to 10.0.0.3 3333 setup keep-state
When this rule triggers it creates a dynamic rule, but this
dynamic rule should forward packets only in forward direction.
o And the last case that does not work before - simple fwd rule which
triggers when some dynamic rule is already executed.
Pyun YongHyeon [Wed, 1 Jun 2011 18:42:44 +0000 (18:42 +0000)]
Poke correct GPIO pins for newer axe(4) controllers with Marvell
PHY. Newer models seem to use different LED mode that requires
enabling both GPIO1 and GPIO2.
Jayachandran C. [Wed, 1 Jun 2011 10:23:03 +0000 (10:23 +0000)]
Add .interp back into INITIAL_READONLY_SECTIONS in MIPS n64 ABI.
The binutils update in r218822 caused the MIPS n64 dynamic binaries to
fail because the ".interp" section is not in the initial sections.
This happens because elf64bmip-defs.sh overrides INITIAL_READONLY_SECTIONS
to add ".MIPS.options" sections instead of the ".reginfo" section used
by n32.
This used to work fine, but after r218822, INITIAL_READONLY_SECTIONS also
contains the .interp section, so the override has to be done differently.
Kenneth D. Merry [Tue, 31 May 2011 22:39:32 +0000 (22:39 +0000)]
Fix a bug introduced in revision 222537.
In msgbuf_reinit() and msgbuf_init(), we weren't initializing the mutex.
Depending on the contents of memory, the LO_INITIALIZED flag might be
set on the mutex (either due to a warm reboot, and the message buffer
remaining in place, or due to garbage in memory) and in that case, with
INVARIANTS turned on, we would trigger an assertion that the mutex had
already been initialized.
Fix this by bzeroing the message buffer mutex for the _init() and _reinit()
paths.
Pyun YongHyeon [Tue, 31 May 2011 18:45:15 +0000 (18:45 +0000)]
If driver is not running, disable interrupts and do not try to
process received frames. Previously it was possible to handle RX
interrupts even if controller is not fully initialized. This
resulted in non-working driver after system is up and running.
Rick Macklem [Tue, 31 May 2011 18:27:18 +0000 (18:27 +0000)]
Add a sentence to the umount.8 man page to clarify the behaviour
for forced dismount when used on an NFS mount point. Requested by
Jeremy Chadwick.
This is a content change.
Rick Macklem [Tue, 31 May 2011 17:43:25 +0000 (17:43 +0000)]
Fix the new NFS client so that it doesn't do an NFSv3
Pathconf RPC for cases where the reply doesn't include
the answer. This fixes a problem reported by avg@ where
the NFSv3 Pathconf RPC would fail when "ls -l" did an
lpathconf(2) for _PC_ACL_NFS4.
Kenneth D. Merry [Tue, 31 May 2011 17:29:58 +0000 (17:29 +0000)]
Fix apparent garbage in the message buffer.
While we have had a fix in place (options PRINTF_BUFR_SIZE=128) to fix
scrambled console output, the message buffer and syslog were still getting
log messages one character at a time. While all of the characters still
made it into the log (courtesy of atomic operations), they were often
interleaved when there were multiple threads writing to the buffer at the
same time.
This fixes message buffer accesses to use buffering logic as well, so that
strings that are less than PRINTF_BUFR_SIZE will be put into the message
buffer atomically. So now dmesg output should look the same as console
output.
subr_msgbuf.c: Convert most message buffer calls to use a new spin
lock instead of atomic variables in some places.
Add a new routine, msgbuf_addstr(), that adds a
NUL-terminated string to a message buffer. This
takes a priority argument, which allows us to
eliminate some races (at least in the the string
at a time case) that are present in the
implementation of msglogchar(). (dangling and
lastpri are static variables, and are subject to
races when multiple callers are present.)
msgbuf_addstr() also allows the caller to request
that carriage returns be stripped out of the
string. This matches the behavior of msglogchar(),
but in testing so far it doesn't appear that any
newlines are being stripped out. So the carriage
return removal functionality may be a candidate
for removal later on if further analysis shows
that it isn't necessary.
subr_prf.c: Add a new msglogstr() routine that calls
msgbuf_logstr().
Rename putcons() to putbuf(). This now handles
buffered output to the message log as well as
the console. Also, remove the logic in putcons()
(now putbuf()) that added a carriage return before
a newline. The console path was the only path that
needed it, and cnputc() (called by cnputs())
already adds a carriage return. So this
duplication resulted in kernel-generated console
output lines ending in '\r''\r''\n'.
Refactor putchar() to handle the new buffering
scheme.
Add buffering to log().
Change log_console() to use msglogstr() instead of
msglogchar(). Don't add extra newlines by default
in log_console(). Hide that behavior behind a
tunable/sysctl (kern.log_console_add_linefeed) for
those who would like the old behavior. The old
behavior led to the insertion of extra newlines
for log output for programs that print out a
string, and then a trailing newline on a separate
write. (This is visible with dmesg -a.)
msgbuf.h: Add a prototype for msgbuf_addstr().
Add three new fields to struct msgbuf, msg_needsnl,
msg_lastpri and msg_lock. The first two are needed
for log message functionality previously handled
by msglogchar(). (Which is still active if
buffering isn't enabled.)
Include sys/lock.h and sys/mutex.h for the new
mutex.
John Baldwin [Tue, 31 May 2011 15:41:10 +0000 (15:41 +0000)]
- Document the -H option and 'H' key alongside other options and keys
rather than at the bottom of the manpage.
- Remove an obsolete comment about SWAIT being a stale state. It was
resurrected for a different purpose in FreeBSD 5 to mark idle ithreads.
- Add a comment documenting that the SLEEP and LOCK states typically
display the name of the event being waited on with lock names being
prefixed with an asterisk and sleep event names not having a prefix.
Nathan Whitehorn [Tue, 31 May 2011 15:11:43 +0000 (15:11 +0000)]
On multi-core, multi-threaded PPC systems, it is important that the threads
be brought up in the order they are enumerated in the device tree (in
particular, that thread 0 on each core be brought up first). The SLIST
through which we loop to start the CPUs has all of its entries added with
SLIST_INSERT_HEAD(), which means it is in reverse order of enumeration
and so AP startup would always fail in such situations (causing a machine
check or RTAS failure). Fix this by changing the SLIST into an STAILQ,
and inserting new CPUs at the end.
John Baldwin [Tue, 31 May 2011 15:11:23 +0000 (15:11 +0000)]
Add a new option to toggle the display of the system idle process (per-CPU
idle threads). The process is displayed by default (subject to whether or
not system processes are displayed) to preserve existing behavior. The
system idle process can be hidden via the '-z' command line argument or the
'z' key while top is running. When it is hidden, top more closely matches
the behavior of FreeBSD <= 4.x where idle time was not accounted to any
process.
Bjoern A. Zeeb [Tue, 31 May 2011 15:05:29 +0000 (15:05 +0000)]
Remove some further INET related symbols from pf to allow the module
to not only compile bu load as well for testing with IPv6-only kernels.
For the moment we ignore the csum change in pf_ioctl.c given the
pending update to pf45.
Reported by: dru
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 20 days
Bjoern A. Zeeb [Tue, 31 May 2011 15:02:30 +0000 (15:02 +0000)]
Start teaching pc-sysinstall about IPv6.
Add some additional empty string checks for IPv4 and try to configure
a netmask along with the address rather than doing things twice.
Contrary to AUTO-DHCP, IPv6-SLAAC will accept static configuration
as well, which we will use at least for resolv.conf currently and
if we were given a static address configure that as an alias as well.
The pc-sysinstaller changes going along were committed to PC-BSD as r10773.
Reviewed by: kmoore
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 20 days
Bjoern A. Zeeb [Tue, 31 May 2011 14:40:21 +0000 (14:40 +0000)]
Conditionally compile in the af_inet and af_inet6, af_nd6 modules.
If compiled in for dual-stack use, test with feature_present(3)
to see if we should register the IPv4/IPv6 address family related
options.
In case there is no "inet" support we would love to go with the
usage() and make the address family mandatory (as it is for anything
but inet in theory). Unfortunately people are used to
ifconfig IF up/down
etc. as well, so use a fallback of "link". Adjust the man page
to reflect these minor details.
Improve error handling printing a warning in addition to the usage
telling that we do not know the given address family in two places.
Reviewed by: hrs, rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 2 weeks
Alexander Motin [Tue, 31 May 2011 09:22:52 +0000 (09:22 +0000)]
Add quirks to hint 4K physical sector (Advanced Format) for ATA disks not
reporting it properly (none? of known disks now).
Hitachi and WDC AF disks seem could be identified more or less formally.
For Seagate and Samsung enumerate some found models/series.
For other disks it can be forced with kern.cam.ada.X.quirks=1 tunable.
Imagine situation where a security problem is found in setuid binary.
User upgrades his system to fix the problem, but if he has any ZFS snapshots
for the file system which contains problematic binary, any user can mount the
snapshot and execute vulnerable binary.
Prevent this from happening by always mounting snapshots with setuid turned off.
Bjoern A. Zeeb [Tue, 31 May 2011 00:25:52 +0000 (00:25 +0000)]
No logner set an IPv4 loopback address by default in defaults/rc.conf.
If not specified, network.subr will add it automatically if we have
INET support (1).
In network.subr only call the address family up/down functions
if the respective AF is available.
Switch to new kern.features variables for inet and inet6 as the
inet sysctl tree is also available for IPv6-only kernels leading
to unexpected results.
Suggested by: hrs (1)
Reviewed by: hrs
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 20 days
Jilles Tjoelker [Mon, 30 May 2011 21:41:06 +0000 (21:41 +0000)]
posix_spawn(): Do not fail when trying to close an fd that is not open.
As noted in Austin Group issue #370 (an interpretation has been issued),
failing posix_spawn() because an fd specified with
posix_spawn_file_actions_addclose() is not open is unnecessarily harsh, and
there are existing implementations that do not fail posix_spawn() for this
reason.
Navdeep Parhar [Mon, 30 May 2011 21:34:44 +0000 (21:34 +0000)]
- Specialized ingress queues that take interrupts for other ingress
queues. Try to have a set of these per port when possible, fall back
to sharing a common pool between all ports otherwise.
- One control queue per port (used to be one per hardware channel).
Navdeep Parhar [Mon, 30 May 2011 21:07:26 +0000 (21:07 +0000)]
L2 table code. This is enough to get the T4's switch + L2 rewrite
filters working. (All other filters - switch without L2 info rewrite,
steer, and drop - were already fully-functional).
Some contrived examples of "switch" filters with L2 rewriting:
# cxgbetool t4nex0 iport 0 dport 80 action switch vlan +9 eport 3
Intercept all packets received on physical port 0 with TCP port 80 as
destination, insert a vlan tag with VID 9, and send them out of port 3.
# cxgbetool t4nex0 sip 192.168.1.1/32 ivlan 5 action switch \
vlan =9 smac aa:bb:cc:dd:ee:ff eport 0
Intercept all packets (received on any port) with source IP address
192.168.1.1 and VLAN id 5, rewrite the VLAN id to 9, rewrite source mac
to aa:bb:cc:dd:ee:ff, and send it out of port 0.
Adrian Chadd [Mon, 30 May 2011 15:06:57 +0000 (15:06 +0000)]
Enable setting the short-GI bit when TX'ing HT rates but only if the
hardware supports it.
Since ni->ni_htcap in hostap mode is what the remote end has advertised,
not what has been negotiated/decided, we need to check ourselves what
the current channel width is and what the hardware supports before
enabling short-GI.
It's important that short-GI isn't enabled when it isn't negotiated
and when the hardware doesn't support it (ie, short-gi for 20mhz channels
on any chip < AR9287.)
I've quickly verified this on the AR9285 in 11n mode.
Robert Watson [Mon, 30 May 2011 09:43:55 +0000 (09:43 +0000)]
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
Robert Watson [Mon, 30 May 2011 09:34:15 +0000 (09:34 +0000)]
Rework TIMEWAIT regression test so that kernel-allocated port numbers are
used rather than a fixed userspace one, avoiding conflicts between the two
test runs.
MFC after: 3 days
Sponsored by: Juniper Networks, Inc.
Robert Watson [Mon, 30 May 2011 09:04:35 +0000 (09:04 +0000)]
In the tcpdrop regression test, allow the kernel to allocate us a port
rather than using a fixed port number. This means that the regression test
can be run many times in a row without waiting on TIMEWAIT to release a
hard-coded port number.
MFC after: 3 days
Sponsored by: Juniper Networks, Inc.
Jayachandran C. [Mon, 30 May 2011 06:23:51 +0000 (06:23 +0000)]
Fix read_ivar implementation for MMC and SD.
1. Both mmc_read_ivar() and sdhci_read_ivar() use the expression
'*(int *)result = val' to assign to result which is uintptr_t *.
This does not work on big-endian 64 bit systems.
2. The media_size ivar is declared as 'off_t' which does not fit
into uintptr_t in 32bit systems, change this to long.
Submitted by: kanthms at netlogicmicro com (initial version)