yongari [Wed, 22 Jun 2011 01:42:52 +0000 (01:42 +0000)]
MFC r222219,222221,222223,222226-222227,222231,222516:
Merge all relevant changes from HEAD to fix long standing
instability issues of msk(4). To get desired effect of this
merge, cold restarting is required because incorrectly programmed
registers are not reset to default value.
PR: kern/114631, kern/116853, kern/139093, kern/144206,
kern/147824, kern/151169, kern/154591, kern/155636,
kern/156493
r222219:
Do not blindly clear entire GPHY control register. It seems some
bits of the register is used for other purposes such that clearing
these bits resulted in unexpected results such as corrupted RX
frames or missing LE status updates. For old controllers like
Yukon EC it had no effect but it caused all kind of troubles on
Yukon Supreme.
This change shall improve stability of controllers like Yukon
Ultra, Ultra2, Extreme, Optima and Supreme.
r222221:
Rework store and forward configuration of TX MAC FIFO. Basically it
enables store and forward mode except for jumbo frame on Yukon
Ultra.
r222223:
Do not configure RAM registers for controllers that do not have
them. These registers are defined only for Yukon XL, Yukon EC and
Yukon FE.
r222226:
Make sure to enable all clocks before accessing registers.
Releasing PHY from power down/COMA is done after enabling all
clocks. While I'm here remove unnecessary controller reset.
r222227:
Do not touch ASF related register for controllers that do not have
these registers. Also disable Watchdog of ASF microcontroller.
r222231:
When MTU is changed, check whether driver should be reinitialized or
not. If reinitialized is required, clear driver running flag.
r222516:
Correctly check MAC running status before disabling TX/RX MACs.
yongari [Wed, 22 Jun 2011 00:48:13 +0000 (00:48 +0000)]
MFC r205091,216860:
r205091:
Implement Rx checksum offloading for Yukon EC, Yukon Ultra,
Yukon FE and Yukon Ultra2. These controllers provide very simple
checksum computation mechanism and it requires additional pseudo
header checksum computation in upper stack. Even though I couldn't
see much performance difference with/without Rx checksum offloading
it may help notebook based controllers.
Actually controller can compute two checksum value by giving
different starting position of checksum computation on received
frame. However, for long time, Marvell's checksum offloading engine
have been known to have several silicon bugs so don't blindly trust
computed partial checksum value. Instead, compute partial checksum
twice by giving the same checksum computation position and compare
the result. If the value is different it's clear indication of
hardware bug. This configuration lose IP checksum offloading
capability but I think it's better to take safe route.
Note, Rx checksum offloading for Yukon XL was still disabled due to
known silicon bug.
r216860:
Fix endianness bug introduced in r205091.
After controller updates control word in a RX LE, driver converts
it to host byte order. The checksum value in the control word is
stored in big endian form by controller. r205091 didn't account for
the host byte order conversion such that the checksum value was
incorrectly interpreted on big endian architectures which in turn
made all TCP/UDP frames dropped. Make RX checksum offload work
on any architectures by swapping the checksum value.
Reported by: Sreekanth M. ( kanthms <> netlogicmicro dot com )
Tested by: Sreekanth M. ( kanthms <> netlogicmicro dot com )
yongari [Wed, 22 Jun 2011 00:38:25 +0000 (00:38 +0000)]
MFC r222542:
If driver is not running, disable interrupts and do not try to
process received frames. Previously it was possible to handle RX
interrupts even if controller is not fully initialized. This
resulted in non-working driver after system is up and running.
yongari [Wed, 22 Jun 2011 00:35:42 +0000 (00:35 +0000)]
MFC r222581:
Poke correct GPIO pins for newer axe(4) controllers with Marvell
PHY. Newer models seem to use different LED mode that requires
enabling both GPIO1 and GPIO2.
yongari [Wed, 22 Jun 2011 00:16:40 +0000 (00:16 +0000)]
MFC r221818:
Add initial BCM5719 support. TSO and jumbo frame was intentionally
disabled for BCM5719 A0 revision due to known hardware errata.
Many thanks to Broadcom for continuing support of FreeBSD.
mav [Tue, 21 Jun 2011 08:37:55 +0000 (08:37 +0000)]
MFC r222897:
Intel NM10 chipset's SATA controller has same PCI ID and revision as ICH7's,
but has only 2 SATA ports instead of 4. The worst part is that SStatus and
SError registers for missing ports are not implemented and return wrong
values (0xffffffff), that caused infinite reset loop.
Just ignore that SError value while I found no better way to identify them.
jhb [Mon, 20 Jun 2011 18:08:34 +0000 (18:08 +0000)]
MFC 221346,223049:
Handle a rare edge case with nearly full TCP receive buffers. If a TCP
buffer fills up causing the remote sender to enter into persist mode, but
there is still room available in the receive buffer when a window probe
arrives (either due to window scaling, or due to the local application
very slowing draining data from the receive buffer), then the single byte
of data in the window probe is accepted. However, this can cause rcv_nxt
to be greater than rcv_adv. This condition will only last until the next
ACK packet is pushed out via tcp_output(), and since the previous ACK
advertised a zero window, the ACK should be pushed out while the TCP
pcb is write-locked. To guarantee this, advance the advertised window
(rcv_adv) even if we advertise a zero window.
During the window while rcv_nxt is greather than rcv_adv, a few places
would compute the remaining receive window via rcv_adv - rcv_nxt.
However, this value was then (uint32_t)-1. On a 64 bit machine this
could expand to a positive 2^32 - 1 when cast to a long. In particular,
when calculating the receive window in tcp_output(), the result would be
that the receive window was computed as 2^32 - 1 resulting in advertising
a far larger window to the remote peer than actually existed.
Fix various places that compute the remaining receive window to either
assert that it is not negative (i.e. rcv_nxt <= rcv_adv), or treat the
window as full if rcv_nxt is greather than rcv_adv.
bz [Mon, 20 Jun 2011 08:37:20 +0000 (08:37 +0000)]
MFC r223057:
Add a new option -P to suppress getservbyport(3) calls when printing rules.
This allows one to force consistent printing of numeric port numbers like
we do with -n for other tools like netstat (just that -n was already taken)
rather than the service names.
rmacklem [Sun, 19 Jun 2011 02:39:02 +0000 (02:39 +0000)]
MFC: r222722
Add support for flock(2) locks to the new NFSv4 client. I think this
should be ok, since the client now delays NFSv4 Close operations
until VOP_INACTIVE()/VOP_RECLAIM(). As such, there should be no
risk that the NFSv4 Open is closed while an associated byte range lock
still exists.
rmacklem [Sun, 19 Jun 2011 02:24:36 +0000 (02:24 +0000)]
MFC: r222719
The new NFSv4 client was erroneously using "p" instead of
"p_leader" for the "id" for POSIX byte range locking. I think
this would only have affected processes created by rfork(2)
with the RFTHREAD flag specified. This patch fixes that by
passing the "id" down through the various functions from
nfs_advlock().
rmacklem [Sun, 19 Jun 2011 01:44:50 +0000 (01:44 +0000)]
MFC: r222663
Modify the new NFS server so that the NFSv3 Pathconf RPC
doesn't return an error when the underlying file system
lacks support for any of the four _PC_xxx values used, by
falling back to default values.
bz [Sat, 18 Jun 2011 22:12:17 +0000 (22:12 +0000)]
MFC r219722 (by jhb):
Preserve errno in an error case.
Submitted by: gcooper
MFC r222899:
Contrary to when returning in all-good cases at the end of functions we
did not free memory (1) or close the file descriptor (2) in error cases.
bschmidt [Sat, 18 Jun 2011 12:32:48 +0000 (12:32 +0000)]
MFC r220895,221634:
Now that all bits are in for 1030/6230 adapters enable those.
While here pull the adapter names from the Linux driver and sort
the list by ID.
bschmidt [Sat, 18 Jun 2011 12:10:06 +0000 (12:10 +0000)]
MFC r220866-220867:
- Pull some features out of the firmware:
- If a ENH_SENS TLV section exit the firmware is capable of doing
enhanced sensitivity calibration.
- Newer devices/firmwares have more calibration commands therefore
hardcoding the noise gain/reset commands no longer works. It is
supposed to use the next index after the newest calibration type
support. Read the command index of the TLV section if available.
- Enable DC calibration for all 6000 series devices, except those
with an internal PA.
- Override the chainmask also for the 6050.
bschmidt [Sat, 18 Jun 2011 12:07:06 +0000 (12:07 +0000)]
MFC r220729:
Add some new features:
- 6000 series devices need enhanced sensitivity calibration.
- 6000 series devices need a different setting for the shadow reg.
- set the IWN_FLAG_HAS_11N bit if the EEPROM says the device has 11n
support.
bschmidt [Sat, 18 Jun 2011 12:03:30 +0000 (12:03 +0000)]
MFC r220727-220728:
- Read RX/TX chainmasks directly of the EEPROM. Some chips are known to
have the wrong/broken information stored, keep the hardcoded values for
those.
- Bring over the HAL/OPS changes, instead of two const structs it is now
slightly more dynamic.
bschmidt [Sat, 18 Jun 2011 12:00:49 +0000 (12:00 +0000)]
MFC r220721,220723-220726:
- Rename some stuff in favour of the OpenBSD names:
- prefer EDCA over WME
- qid for a TXQ ID
- reg for register values
- Shuffle code around a bit. Mostly to group functional connected things,
others to get the same order as the OpenBSD code.
- Sync debug and error messages with OpenBSD. The device capability
announcements are now hidden behind bootverbose.
- Sync comments with OpenBSD.
- Whitespace sync, some more style(9) conform then others.
bschmidt [Sat, 18 Jun 2011 11:56:40 +0000 (11:56 +0000)]
MFC r220720:
Fix WME/QoS handling:
- move the TX queue selection into iwn_tx_data/iwn_tx_data_raw
- extract traffic identifier and use it
- do not expect ACKs for frames marked as such
bschmidt [Sat, 18 Jun 2011 11:51:17 +0000 (11:51 +0000)]
MFC r220691-220694,220700-220702,220704,220710-220711:
- Remove the flags argument of iwn_dma_contig_alloc(), it is always set
as BUS_DMA_NOWAIT. While here also set BUS_DMA_COHERENT.
- OpenBSD uses IWN_RBUF_SIZE not MJUMPAGESIZE for the RX path, also replace
caddr_t with void * to be in sync.
- In case a new mbuf can't be loaded, reuse the old one.
- scratch_paddr has the same address pre-assigned, use that instead.
- Rewrite DMA segment handling to be more inline with the OpenBSD code.
Also change the m_len == 0 hack to have less code churn.
- Make sure to destroy all DMA tags and maps.
- Unify TX/RX ring allocation, finish the descriptior DMA stuff before
starting with data.
- Add missing bus_dmamap_sync calls as well as remove two duplicate ones.
- Prevent double-free, also use the same error codes as OpenBSD.
- Replace RX/TX ring allocation error messages with something more sane
and remove those where the caller already prints one.
bschmidt [Sat, 18 Jun 2011 11:44:54 +0000 (11:44 +0000)]
MFC r220689:
RSSI related syncs with the OpenBSD code:
- read RSSI only for the active chains
- cast RSSI/NF to int8_t before passing it up to radiotap
- remove the htole64() for the timestamp
bschmidt [Sat, 18 Jun 2011 11:36:57 +0000 (11:36 +0000)]
MFC r220674:
Revert some of local calibration changes in favour of the OpenBSD
implementation. This includes the fix required for the 6050 series
devices.
bschmidt [Sat, 18 Jun 2011 11:33:55 +0000 (11:33 +0000)]
MFC r220667+220668:
Split up watchdog and calibration callout. This allows us to use different
timing on both and to remove some monitor mode specific hacks (which has
no calibration).
bschmidt [Sat, 18 Jun 2011 11:29:44 +0000 (11:29 +0000)]
MFC r220661:
Fixes for firmware handling:
- there is a local variable for sc->fw_dma, use that instead
- OpenBSD uses 5*hz to wait for firmware to be loaded
- in case the firmware module contains invalid data, actually release it
bschmidt [Sat, 18 Jun 2011 11:23:42 +0000 (11:23 +0000)]
MFC r220636:
Instead of trying to figure out which rxon.flags to clear, restart
from scratch. Remove htole16() calls, rxon.chan is an uint8_t,
ieee80211_chan2ieee() does return an ic_ieee as an int, but I heavily
doubt a htole16() will buy us anything here.
bschmidt [Sat, 18 Jun 2011 11:19:12 +0000 (11:19 +0000)]
MFC r220634:
Reuse net80211 code:
- IWN_TXOP_TO_US is equal to IEEE80211_TXOP_TO_US
- use IEEE80211_DUR_TU
- ieee80211_add_rates/ieee80211_add_xrates are public, use em
- copied ieee80211_add_ssid it is not public
rmacklem [Fri, 17 Jun 2011 16:23:50 +0000 (16:23 +0000)]
MFC: r222627
Fix the nfs related daemons so that they don't intermittently
fail with "bind: address already in use". This problem was reported
to the freebsd-stable@ mailing list on Feb. 19 under the subject
heading "statd/lockd startup failure" by george+freebsd at m5p dot com.
The problem is that the first combination of {udp,tcp X ipv4,ipv6}
would select a port# dynamically, but one of the other three combinations
would have that port# already in use. The patch is somewhat involved
because it was requested by dougb@ that the four combinations use the
same port# wherever possible. The patch splits the create_service()
function into two functions. The first goes as far as bind(2) in a
loop for up to GETPORT_MAXTRY - 1 times, attempting to use the same port#
for all four cases. If these attempts fail, the last attempt allows
the 4 cases to use different port #s. After this function has succeeded,
the second function, called complete_service(), does the rest of what
create_service() did.
The three daemons mountd, rpc.lockd and rpc.statd all have a
create_service() function that is patched in a similar way. However,
create_service() has non-trivial differences for the three daemons
that made it impractical to share the same functions between them.
rmacklem [Fri, 17 Jun 2011 16:03:00 +0000 (16:03 +0000)]
MFC: r222624
Fix the nfs related daemons so that they don't intermittently
fail with "bind: address already in use". This problem was reported
to the freebsd-stable@ mailing list on Feb. 19 under the subject
heading "statd/lockd startup failure" by george+freebsd at m5p dot com.
The problem is that the first combination of {udp,tcp X ipv4,ipv6}
would select a port# dynamically, but one of the other three combinations
would have that port# already in use. The patch is somewhat involved
because it was requested by dougb@ that the four combinations use the
same port# wherever possible. The patch splits the create_service()
function into two functions. The first goes as far as bind(2) in a
loop for up to GETPORT_MAXTRY - 1 times, attempting to use the same port#
for all four cases. If these attempts fail, the last attempt allows
the 4 cases to use different port #s. After this function has succeeded,
the second function, called complete_service(), does the rest of what
create_service() did.
The three daemons mountd, rpc.lockd and rpc.statd all have a
create_service() function that is patched in a similar way. However,
create_service() has non-trivial differences for the three daemons
that made it impractical to share the same functions between them.
mav [Fri, 17 Jun 2011 07:05:47 +0000 (07:05 +0000)]
MFC r219969:
Make `geom XXX list` and `geom XXX status` outputs more consistent:
Add -a options to print all geoms, not only ones with providers.
Add -g option for `status` to report geom's names, not provider's.
Make `status` by default report provider's status (if present), not geom's.
Make `status` report consumer's statuses, not only "synchronized" field.
mav [Fri, 17 Jun 2011 06:59:49 +0000 (06:59 +0000)]
MFC r219974, r220209, r220210, r220790:
Add new RAID GEOM class, that is going to replace ataraid(4) in supporting
various BIOS-based software RAIDs. Unlike ataraid(4) this implementation
does not depend on legacy ata(4) subsystem and can be used with any disk
drivers, including new CAM-based ones (ahci(4), siis(4), mvs(4), ata(4)
with `options ATA_CAM`). To make code more readable and extensible, this
implementation follows modular design, including core part and two sets
of modules, implementing support for different metadata formats and RAID
levels.
Support for such popular metadata formats is now implemented:
Intel, JMicron, NVIDIA, Promise (also used by AMD/ATI) and SiliconImage.
Such RAID levels are now supported:
RAID0, RAID1, RAID1E, RAID10, SINGLE, CONCAT.
For all of these RAID levels and metadata formats this class supports
full cycle of volume operations: reading, writing, creation, deletion,
disk removal and insertion, rebuilding, dirty shutdown detection
and resynchronization, bad sector recovery, faulty disks tracking,
hot-spare disks. For Intel and Promise formats there is support multiple
volumes per disk set.
Look graid(8) manual page for additional details.
Co-authored by: imp
Sponsored by: Cisco Systems, Inc. and iXsystems, Inc.
mav [Fri, 17 Jun 2011 06:23:58 +0000 (06:23 +0000)]
MFC r219970:
Introduce new type of BIO_GETATTR -- GEOM::setstate, used to inform lower
GEOM about state of it's providers from the point of upper layers.
Make geom_disk use led(4) subsystem to illuminate states in such fashion:
FAILED - "1" (on), REBUILD - "f5" (slow blink), RESYNC - "f1" (fast blink),
ACTIVE - "0" (off).
LED name should be set for each disk via kern.geom.disk.%s.led sysctl.
mav [Fri, 17 Jun 2011 05:55:41 +0000 (05:55 +0000)]
MFC r219950:
Change BIO_GETATTR("GEOM::kerneldump") API to make set_dumper() called by
consumer (geom_dev) instead of provider (geom_disk). This allows any geom
insert it's code into the dump call chain, implementing more sophisticated
functionality then just disk partitioning.
rmacklem [Thu, 16 Jun 2011 19:47:56 +0000 (19:47 +0000)]
MFC: r222541
Add a sentence to the umount.8 man page to clarify the behaviour
for forced dismount when used on an NFS mount point.
This is a content change.
rmacklem [Thu, 16 Jun 2011 19:32:00 +0000 (19:32 +0000)]
MFC: r222540
Fix the new NFS client so that it doesn't do an NFSv3
Pathconf RPC for cases where the reply doesn't include
the answer. This fixes a problem reported by avg@ where
the NFSv3 Pathconf RPC would fail when "ls -l" did an
lpathconf(2) for _PC_ACL_NFS4.
delphij [Thu, 16 Jun 2011 01:52:42 +0000 (01:52 +0000)]
MFC r222795 (jkim) + 222967:
Validate INT 15h and 16h vectors more strictly. Traditionally these entry
points are fixed addresses and (U)EFI CSM specification also mandated that.
Unfortunately, (U)EFI CSM specification does not specifically mention this
is to call service routine via interrupt vector table or to jump directly
to the entry point. As a result, some CSM seems to install two routines
and acts differently, depending on how it was executed, unfortunately.
When INT 15h is used, it calls a function pointer (which is probably a UEFI
service function). When it jumps directly to the entry point, it executes
a simple and traditional INT 15h service routine. Therefore, actually there
are two possible fixes, i. e., this fix or jumping directly to the fixed
entry point. However, we chose this fix because a) keyboard typematic
support via BIOS is becoming extremely rarer and b) we cannot support random
service routine installed by a firmware or a boot loader. This should fix
Lenovo X220 laptop, specifically.
trociny [Wed, 15 Jun 2011 20:34:40 +0000 (20:34 +0000)]
MFC r222454:
In soreceive_generic(), if MSG_WAITALL is set but the request is
larger than the receive buffer, we have to receive in sections.
When notifying the protocol that some data has been drained the
lock is released for a moment. Returning we block waiting for the
rest of data. There is a race, when data could arrive while the
lock was released and then the connection stalls in sbwait.
Fix this by checking for data before blocking and skip blocking
if there are some.
jhb [Tue, 14 Jun 2011 18:58:57 +0000 (18:58 +0000)]
MFC 222530:
Add a new option to toggle the display of the system idle process (per-CPU
idle threads). The process is displayed by default (subject to whether or
not system processes are displayed) to preserve existing behavior. The
system idle process can be hidden via the '-z' command line argument or the
'z' key while top is running. When it is hidden, top more closely matches
the behavior of FreeBSD <= 4.x where idle time was not accounted to any
process.
jhb [Tue, 14 Jun 2011 18:54:31 +0000 (18:54 +0000)]
MFC 222750:
Clear the device_t pointer in 'struct resource' when releasing a device
as otherwise the sysctl to export rman info can dereference a stale
pointer.
mm [Tue, 14 Jun 2011 10:50:01 +0000 (10:50 +0000)]
MFC 222343, 222518, 222835
MFC r222343 (pjd):
Silence warnings about unsupoorted value types.
MFC r222518 (pjd):
Imagine situation where a security problem is found in setuid binary.
User upgrades his system to fix the problem, but if he has any
ZFS snapshots for the file system which contains problematic binary,
any user can mount the snapshot and execute vulnerable binary.
Prevent this from happening by always mounting snapshots with
setuid turned off.
MFC r222835:
Silence notice on pool creation, import and access.
kib [Mon, 13 Jun 2011 19:40:09 +0000 (19:40 +0000)]
Cherry-pick a single bit from r222958. Do not pass '3' as the sleepflag
to bufobj_wwait() in the ffs_syncvnode(). It only mangles the priority
argument of msleep().
kib [Mon, 13 Jun 2011 19:33:13 +0000 (19:33 +0000)]
MFC r222586:
Fix an infinite loop in vm_object_page_clean() when the
filesystem returns permanent errors for some page writes.
To accomodate the stable/8 locking requirements, vm page queue lock
is taken around the loop in vnode_pager_undirty_pages() which modifies
m->dirty field.
jh [Mon, 13 Jun 2011 15:53:56 +0000 (15:53 +0000)]
MFC r219925:
Recognize "ro", "rdonly", "norw", "rw" and "noro" as equal options in
vfs_equalopts(). This allows vfs_sanitizeopts() to filter redundant
occurrences of these options. It was possible that for example both "ro"
and "rw" options became active concurrently.
rmacklem [Sun, 12 Jun 2011 02:05:59 +0000 (02:05 +0000)]
MFC: r222466
Modify the umount(8) command so that it doesn't do
a sync(2) syscall before unmount(2) for the "-f" case.
This avoids a forced dismount from getting stuck for
an NFS mountpoint in sync() when the server is not
responsive. With this commit, forced dismounts should
normally work for the NFS clients, but can take up to
about 1minute to complete.
rmacklem [Sun, 12 Jun 2011 01:48:31 +0000 (01:48 +0000)]
MFC: r222464
Add a check for MNTK_UNMOUNTF at the beginning of nfs_sync()
in the old NFS client so that a forced dismount doesn't
get stuck in the VFS_SYNC() call that happens before
VFS_UNMOUNT() in dounmount(). Analagous to r222329 for the new NFS client.
An additional change is needed before forced dismounts will work.
gjb [Sat, 11 Jun 2011 00:30:56 +0000 (00:30 +0000)]
MFC 222758, 222759, 222770:
- Document that when running 'su -m <user> -c <command>', <command> is
run within a shell as <user>.
- Bump date
- Attempt to clear up some confusion in the following example, by
stating the '-c' argument is passed to the shell, not to su(1), which
would indicate the login class.
jhb [Fri, 10 Jun 2011 19:16:26 +0000 (19:16 +0000)]
MFC 222532:
- Document the -H option and 'H' key alongside other options and keys
rather than at the bottom of the manpage.
- Remove an obsolete comment about SWAIT being a stale state. It was
resurrected for a different purpose in FreeBSD 5 to mark idle ithreads.
- Add a comment documenting that the SLEEP and LOCK states typically
display the name of the event being waited on with lock names being
prefixed with an asterisk and sleep event names not having a prefix.
jhb [Fri, 10 Jun 2011 19:13:22 +0000 (19:13 +0000)]
MFC 221079:
Generate the network byte order version of the window size structure in a
temporary variable on the stack and then copy that into the output buffer
so that the htons() conversions use aligned accesses.
jhb [Fri, 10 Jun 2011 19:12:00 +0000 (19:12 +0000)]
MFC 222660:
- Rename the Cronyx Omega2-PCI entry to Exar XR17C158 since that is the
real owner of the device ID. Also rename the associated config
function while here.
- Add support for the 2-port and 4-port Exar parts as well: Exar XR17C/D152
and Exar XR17C154.
jhb [Fri, 10 Jun 2011 19:03:17 +0000 (19:03 +0000)]
MFC 222254:
Fix an issue with critical sections and SMP rendezvous handlers.
Specifically, a critical_exit() call that drops the nesting level to zero
has a brief window where the pending preemption flag is set and the
nesting level is set to zero. This is done purposefully to avoid races
where a preemption scheduled by an interrupt could be lost otherwise (see
revision 144777). However, this does mean that if an interrupt fires
during this window and enters and exits a critical section, it may preempt
from the interrupt context. This is generally fine as the interrupt code
is careful to arrange critical sections so that they are not exited until
it is safe to preempt (e.g. interrupts EOI'd and masked if necessary).
However, the SMP rendezvous IPI handler does not quite follow this rule,
and in general a rendezvous can never be preempted. Rendezvous handlers
are also not permitted to schedule threads to execute, so they will not
typically trigger preemptions. SMP rendezvous handlers may use
spinlocks (carefully) such as the rm_cleanIPI() handler used in rmlocks,
but using a spinlock also enters and exits a critical section. If the
interrupted top-half code is in the brief window of critical_exit() where
the nesting level is zero but a preemption is pending, then releasing the
spinlock can trigger a preemption. Because we know that SMP rendezvous
handlers can never schedule a thread, we know that a critical_exit() in
an SMP rendezvous handler will only preempt in this edge case. We also
know that the top-half thread will happily handle the deferred preemption
once the SMP rendezvous has completed, so the preemption will not be lost.
This makes it safe to employ a workaround where we use a nested critical
section in the SMP rendezvous code itself around rendezvous action
routines to prevent any preemptions during an SMP rendezvous. The
workaround intentionally avoids checking for a deferred preemption
when leaving the critical section on the assumption that if there is a
pending preemption it will be handled by the interrupted top-half code.
jhb [Fri, 10 Jun 2011 18:55:58 +0000 (18:55 +0000)]
MFC 222032:
Fix a race in the SMP rendezvous code. Specifically, the write by the
last CPU to to finish the rendezvous action may become visible to
different CPUs at different times. As a result, the CPU that initiated
the rendezvous may exit the rendezvous and drop the lock allowing another
rendezvous to be initiated on the same CPU or a different CPU. In that
case the exit sentinel may be cleared before all CPUs have noticed causing
those CPUs to hang forever.
Workaround this by using a generation count to notice when this race
occurs and to exit the rendezvous in that case.
jhb [Fri, 10 Jun 2011 18:51:22 +0000 (18:51 +0000)]
MFC 220794:
When checking to see if a window update should be sent to the remote peer,
don't force a window update if the window would not actually grow due to
window scaling. Specifically, if the window scaling factor is larger than
2 * MSS, then after the local reader has drained 2 * MSS bytes from the
socket, a window update can end up advertising the same window. If this
happens, the supposed window update actually ends up being a duplicate ACK.
This can result in an excessive number of duplicate ACKs when using a
higher maximum socket buffer size.
jhb [Fri, 10 Jun 2011 18:46:40 +0000 (18:46 +0000)]
MFC 221209:
TCP reuses t_rxtshift to determine the backoff timer used for both the
persist state and the retransmit timer. However, the code that implements
"bad retransmit recovery" only checks t_rxtshift to see if an ACK has been
received in during the first retransmit timeout window. As a result, if
ticks has wrapped over to a negative value and a socket is in the persist
state, it can incorrectly treat an ACK from the remote peer as a
"bad retransmit recovery" and restore saved values such as snd_ssthresh and
snd_cwnd. However, if the socket has never had a retransmit timeout, then
these saved values will be zero, so snd_ssthresh and snd_cwnd will be set
to 0.
If the socket is in fast recovery (this can be caused by excessive
duplicate ACKs such as those fixed by 220794), then each ACK that arrives
triggers either NewReno or SACK partial ACK handling which clamps snd_cwnd
to be no larger than snd_ssthresh. In effect, the socket's send window
is permamently stuck at 0 even though the remote peer is advertising a
much larger window and pending data is only sent via TCP window probes
(so one byte every few seconds).
Fix this by adding a new TCP pcb flag (TF_PREVVALID) that indicates that
the various snd_*_prev fields in the pcb are valid and only perform
"bad retransmit recovery" if this flag is set in the pcb. The flag is set
on the first retransmit timeout that occurs and is cleared on subsequent
retransmit timeouts or when entering the persist state.
rmacklem [Fri, 10 Jun 2011 13:28:14 +0000 (13:28 +0000)]
MFC: r222389
Fix the new NFS client so that it handles NFSv4 state
correctly during a forced dismount. This required that
the exclusive and shared (refcnt) sleep lock functions check
for MNTK_UMOUNTF before sleeping, so that they won't block
while nfscl_umount() is getting rid of the state. As
such, a "struct mount *" argument was added to the locking
functions. I believe the only remaining case where a forced
dismount can get hung in the kernel is when a thread is
already attempting to do a TCP connect to a dead server
when the krpc client structure called nr_client is NULL.
This will only happen just after a "mount -u" with options
that force a new TCP connection is done, so it shouldn't
be a problem in practice.