mav [Thu, 1 Nov 2012 00:09:01 +0000 (00:09 +0000)]
Only four specific ATA PIO commands transfer several sectors per DRQ block
(interrupt). All other ATA PIO commands transfer one sector or 512 bytes
at one time. Hardcode these exceptions in ata(4) with ATA_CAM option.
This fixes timeout of READ LOG EXT command used by `smartctl -x /dev/adaX`.
jfv [Wed, 31 Oct 2012 23:50:36 +0000 (23:50 +0000)]
A few important fixes:
- Testing TSO6 has led me to discover that HW RSC is
a problematic feature, it is ONLY designed to work
with IPv4 in the first place, and if IP forwarding
is done it can't be disabled as LRO in the stack,
also initial testing we've done at Intel shows an
equal performance using TSO[46] on the TX and LRO
on RX, if you ran older code on 82599 or later hardware
you actually could have detrimental performance for
this reason. So I am disabling the feature by default
and all our adapters will now use LRO instead.
- If you have flow control off and multiple queues it
was possible when the buffer of one queue becomes
full that all RX movement is stalled, to eliminate
this problem a feature bit is now set that will allow
packets to be dropped when full rather than stall.
Note, the default is to have flow control on, and this
keeps this from happening.
- Because of the recent fixes in the stack, LRO is now
auto-disabled when problematic, so I have decided to
enable it by default in the capabilities in the driver.
- There are some 1G modules used by some customers, a couple
small tweaks to properly support those in the media code.
- A note: we have now done some testing of TSO6 and using
LRO with IPv6 and it all works great!! Seeing line rate
in both directions in best cases. Thanks bz for your
excellent work!!
mav [Wed, 31 Oct 2012 22:11:51 +0000 (22:11 +0000)]
ASUS EeePC 1001px has strange variant of ALC269 CODEC, that mutes speaker
if unused in that configuration mixer at NID 15 is muted. Probably CODEC
incorrectly reports its internal connections. Hide that muter from the
driver to avoid muting and make built-in speaker work.
There are several different CODECs sharing this ID and I have not enough
information about them and the bug to implement more universal solution.
Tested by: Big Yuuta <init.py@gmail.com>
MFC after: 2 weeks
adrian [Wed, 31 Oct 2012 21:03:55 +0000 (21:03 +0000)]
HAL updates!
* Add some more ANI spur immunity levels.
* For AR5111 radios attached to an AR5212, limit the 5GHz channels
that are available. A later revision of the AR5111 supports the 4.9GHz
PSB channels but right now there's no check in place for the radio
revision.
If someone wants PSB support on AR5212+AR5111 radios then please let
me know and I'll add the relevant version check.
adrian [Wed, 31 Oct 2012 20:58:24 +0000 (20:58 +0000)]
Add the emulation PCI device id - these days, 0xabcd shows up all over
the internet as "AR9380 and later which didn't get its PCI ID written
in at power-on", so it's hardly an unknown constant.
jfv [Wed, 31 Oct 2012 18:16:42 +0000 (18:16 +0000)]
Correct code that was lost somewhere in the past,
this was designed to keep duplicate null vlan tags from
being added. When doing vlans purely via the switch
this problem will occur. Reported by external customer.
attilio [Wed, 31 Oct 2012 18:07:18 +0000 (18:07 +0000)]
Rework the known mutexes to benefit about staying on their own
cache line in order to avoid manual frobbing but using
struct mtx_padalign.
The sole exception being nvme and sxfge drivers, where the author
redefined CACHE_LINE_SIZE manually, so they need to be analyzed and
dealt with separately.
jimharris [Wed, 31 Oct 2012 17:12:12 +0000 (17:12 +0000)]
Pad and align the callout_cpu mtx to its own cacheline to reduce false
sharing especially on the default CPU 0 callout_cpu structure.
This will be followed up by attilio@ with a conversion to the new struct
mtx_padalign but doing this manual conversion first gives an easy MFC
candidate since mtx_padalign is a more extensive system change.
attilio [Wed, 31 Oct 2012 13:38:56 +0000 (13:38 +0000)]
Give mtx(9) the ability to crunch different type of structures, with the
only constraint that they have a lock cookie named mtx_lock.
This name, then, becames reserved from the struct that wants to use the
mtx(9) KPI and other locking primitives cannot reuse it for their
members.
Namely such structs are the current struct mtx and the new
struct mtx_padalign. The new structure will define an object which is
the same as the same layout of a struct mtx but will be allocated in
areas aligned to the cache line size and will be as big as a cache line.
This is supposed to give higher performance for highly contented mutexes
both spin or sleep (because of the adaptive spinning), where the cache
line contention results in too much traffic on the system bus.
The struct mtx_padalign can be used in a completely transparent way
with the mtx(9) KPI.
At the moment, a possibility to MFC the patch should be carefully
evaluated because this patch breaks the low level KPI
(not its representation though).
adrian [Wed, 31 Oct 2012 06:27:58 +0000 (06:27 +0000)]
I give up - introduce a TX lock to serialise TX operations.
I've tried serialising TX using queues and such but unfortunately
due to how this interacts with the locking going on elsewhere in the
networking stack, the TX task gets delayed, resulting in quite a
noticable throughput loss:
* baseline TCP for 2x2 11n HT40 is ~ 170mbit/sec;
* TCP for TX task in the ath taskq, with the RX also going on - 80mbit/sec;
* TCP for TX task in a separate, second taskq - 100mbit/sec.
So for now I'm going with the Linux wireless stack approach - lock tx
early. The linux code does in the wireless stack, before the 802.11
state stuff happens and before it's punted to the driver.
But TX locking needs to also occur at the driver layer as the TX
completion code _also_ begins to drain the ifnet TX queue.
Whilst I'm here, add some KTR traces for the TX path.
Note:
* This really should be done at the net80211 layer (as well, at least.)
But that'll have to wait for a little more thought to happen.
jmallett [Wed, 31 Oct 2012 04:23:36 +0000 (04:23 +0000)]
If the CF physical base is 0, attach no CF devices. This fixes a warning
about a 0 passed to cvmx_phys_to_ptr on systems without a CF interface,
such as the RSYS4GBE.
davide [Wed, 31 Oct 2012 03:55:33 +0000 (03:55 +0000)]
- Do not put in the mntqueue half-constructed vnodes.
- Change the code so that it relies on vfs_hash rather than on a
home-made hashtable.
- There's no need to inline fnv_32_buf().
Reviewed by: delphij
Tested by: pho
Sponsored by: iXsystems inc.
davide [Wed, 31 Oct 2012 03:34:07 +0000 (03:34 +0000)]
Fix panic due to page faults while in kernel mode, under conditions of
VM pressure. The reason is that in some codepaths pointers to stack
variables were passed from one thread to another.
In collaboration with: pho
Reported by: pho's stress2 suite
Sponsored by: iXsystems inc.
dim [Tue, 30 Oct 2012 22:09:53 +0000 (22:09 +0000)]
Pull in r165377 from upstream llvm trunk:
X86: fcmov doesn't handle all possible EFLAGS, fall back to a branch
for the others.
Otherwise it will try to use SSE patterns and fail horribly if sse is
disabled.
Fixes PR14035.
This should fix the following assertion failure:
Assertion failed: (Reg >= X86::FP0 && Reg <= X86::FP6 && "Expected FP
register!"), function getFPReg, file
contrib/llvm/lib/Target/X86/X86FloatingPoint.cpp, line 330.
which can show up when compiling contrib/compiler-rt, using -march=i686
through -march=pentium3 (CPU's which do support fcmov, but don't support
SSE2).
trasz [Tue, 30 Oct 2012 21:32:10 +0000 (21:32 +0000)]
Fix problem with geom_label(4) not recognizing UFS labels on filesystems
extended using growfs(8). The problem here is that geom_label checks if
the filesystem size recorded in UFS superblock is equal to the provider
(i.e. device) size. This check cannot be removed due to backward
compatibility. On the other hand, in most cases growfs(8) cannot set
fs_size in the superblock to match the provider size, because, differently
from newfs(8), it cannot recompute cylinder group sizes.
To fix this problem, add another superblock field, fs_providersize, used
only for this purpose. The geom_label(4) will attach if either fs_size
(filesystem created with newfs(8)) or fs_providersize (filesystem expanded
using growfs(8)) matches the device size.
PR: kern/165962
Reviewed by: mckusick
Sponsored by: FreeBSD Foundation
hselasky [Tue, 30 Oct 2012 16:56:16 +0000 (16:56 +0000)]
If a USB mass storage device doesn't respond properly
to the initial SCSI INQUIRY command, enable all quirks.
This fixes detection of some Transcend TS2GUFM devices.
attilio [Tue, 30 Oct 2012 15:10:50 +0000 (15:10 +0000)]
Fixup r240246: hwpmc needs to retain the pinning until ASTs are not
executed. This means past the point where userret() is generally
executed.
Skip the td_pinned check if a callchain tracing is currently happening
and add a more robust check to pmc_capture_user_callchain() in order to
catch td_pinned leak past ast() in hwpmc case.
mav [Tue, 30 Oct 2012 12:44:30 +0000 (12:44 +0000)]
Set all pins initial connection status to unknown (2) and then update it
with the real value in regular way if sensing is supported. This fixes
minor inconsistency when playback redirection appeared in undefined state
on boot if headphones were not connected.
yongari [Tue, 30 Oct 2012 07:55:03 +0000 (07:55 +0000)]
TSO engine of L1 requires a separate DMA descriptor for TCP
payload. This means driver has to split a TX buffer into two
pieces of TX buffers when the TX buffer contains both
ethernet/IP/TCP header and partial TCP payload. The controller
does not require all header should be in a TX buffer but driver
forced it to compute IP/TCP header size/offset which is required
parameter to configure DMA descriptor for TSO.
While here, slightly reorder DMA descriptor setup to enhance
readability and remove unnecessary code for TSO(upper stack never
requests TSO when the frame length is less than or equal to MTU).
jmallett [Tue, 30 Oct 2012 06:29:17 +0000 (06:29 +0000)]
Remove oct_read64 and oct_write64 and use their equivalents from the Simple
Executive, which are used everywhere else in the Octeon port. While here,
remove other unused things from octeon_pcmap_regs.h.
gonzo [Tue, 30 Oct 2012 01:52:49 +0000 (01:52 +0000)]
Separate interrupts enable/disable logic from setting port parameters.
Otherwise setting baud rate in TTY mode effectively disables TX/RX
interrupts and renders port unusable.
mav [Mon, 29 Oct 2012 21:08:06 +0000 (21:08 +0000)]
Minor addition to r242323:
Alike to BIO_WRITE, report success if at least one subdisk succeeded with
BIO_DELETE. But unlike BIO_WRITE don't fail disk on BIO_DELETE error.
mav [Mon, 29 Oct 2012 18:04:38 +0000 (18:04 +0000)]
Add basic BIO_DELETE support to GEOM RAID class for all RAID levels.
If at least one subdisk in the volume supports it, BIO_DELETE requests
will be propagated down. Unfortunatelly, for RAID levels with redundancy
unmapped blocks will be mapped back during first rebuild/resync process.
nwhitehorn [Mon, 29 Oct 2012 14:27:28 +0000 (14:27 +0000)]
Work around broken device tree on last-generation PowerPC iMacs
(PowerMac12,1), which have a mac-io MPIC cell that indifies itself
as the root PIC despite the actual root PIC being on the northbridge.
No CPC945 systems have a mac-io PIC that does anything so just don't
attach on CPC945 (U4) systems.
mav [Mon, 29 Oct 2012 14:18:54 +0000 (14:18 +0000)]
Make GEOM RAID more aggressive in marking volumes as clean on shutdown
and move that action from shutdown_pre_sync to shutdown_post_sync stage
to avoid extra flapping.
ZFS tends to not close devices on shutdown, that doesn't allow GEOM RAID
to shutdown gracefully. To handle that, mark volume as clean just when
shutdown time comes and there are no active writes.
andre [Mon, 29 Oct 2012 13:16:33 +0000 (13:16 +0000)]
Forced commit to provide the correct commit message to r242251:
Defer sending an independent window update if a delayed ACK is pending
saving a packet. The window update then gets piggy-backed on the next
already scheduled ACK.
andre [Mon, 29 Oct 2012 12:31:12 +0000 (12:31 +0000)]
In soreceive_stream() don't drop an already dequeued mbuf chain by
overwriting the return mbuf pointer with newly received data after
a loop. Instead append the new mbuf chain to the existing one.
Fix up sb_lastrecord when dequeuing mbuf's so that sbappend_stream()
doesn't get confused.
For the remainder copy case in the mbuf delivery part deduct the
copied length len instead of the whole mbuf length. Additionally
don't depend on 'n' being being available which isn't true in the
case of MSG_PEEK.
Fix the MSG_WAITALL case by comparing against sb_hiwat. Before
it was looping for every receive as sb_lowat normally is zero.
Add comment about issue with (MSG_WAITALL | MSG_PEEK) which isn't
properly handled.
Submitted by: trociny (except for the change in last paragraph)
andre [Mon, 29 Oct 2012 12:14:57 +0000 (12:14 +0000)]
Add logging for socket attach failures in sonewconn() during accept(2).
Include the pointer to the PCB so it can be attributed to a particular
application by corresponding it to "netstat -A" output.
alc [Mon, 29 Oct 2012 06:15:04 +0000 (06:15 +0000)]
Replace the page hold queue, PQ_HOLD, by a new page flag, PG_UNHOLDFREE,
because the queue itself serves no purpose. When a held page is freed,
inserting the page into the hold queue has the side effect of setting the
page's "queue" field to PQ_HOLD. Later, when the page is unheld, it will
be freed because the "queue" field is PQ_HOLD. In other words, PQ_HOLD is
used as a flag, not a queue. So, this change replaces it with a flag.
To accomodate the new page flag, make the page's "flags" field wider and
"oflags" field narrower.
attilio [Mon, 29 Oct 2012 01:35:17 +0000 (01:35 +0000)]
Compiler have a precise knowledge of the content of sched_pin() and
sched_unpin() as they are functions static and inline. This way it
can do two dangerous things:
- Reorder instructions around both of them, taking out from the safe
path operations that are supposed to be (ie. per-cpu accesses)
- Cache the value of td_pinned in CPU registers not making visible
in kernel context to the scheduler once it is scanning the runqueue,
as td_pinned is not marked volatile.
In order to avoid both possible bugs explicitly, protect the safe path
with compiler memory barriers. This will prevent reordering and caching
by the compiler about td_pinned operations.
Generally this could lead to suboptimal code traversing the pinnings
but this is not the case as can be easilly verified:
http://lists.freebsd.org/pipermail/svn-src-projects/2012-October/005797.html
jmallett [Mon, 29 Oct 2012 00:51:53 +0000 (00:51 +0000)]
Use Simple Executive LED display routines, which correctly use the LED base
address passed from the bootloader, rather than using a hard-coded value.
Make FreeBSD announce itself on the LED display similar to other kernels.
Remove uses of the previous LED routines, which were under-used and only used
in drivers for what seem like debugging purposes, despite those drivers being
widely-tested.
Remove several inlines for accessing memory that duplicate other functions
which are now used instead, as they are now entirely unused.
adrian [Sun, 28 Oct 2012 21:13:12 +0000 (21:13 +0000)]
Begin fleshing out some software queue awareness for TIM handling with
the power save queue.
* introduce some new ATH_NODE lock protected fields, tracking the
net80211 psq and TIM state;
* when doing buffer transitions - ie, when sending and completing
buffers - check the state of the SWQ and update the TIM appropriately.
* when clearing the TIM bit, if the SWQ is not empty then delay clearing
it.
This is racy, but it's no less racy than the current net80211 power
save queue management code. Specifically, with multiple TX threads,
it's quite plausible that parallel state updates will race and the
TIM will be left in an inconsistent state. I'll address that in
a follow-up commit.
andre [Sun, 28 Oct 2012 19:58:20 +0000 (19:58 +0000)]
If the user has closed the socket then drop a persisting connection
after a much reduced timeout.
Typically web servers close their sockets quickly under the assumption
that the TCP connections goes away as well. That is not entirely true
however. If the peer closed the window we're going to wait for a long
time with lots of data in the send buffer.
andre [Sun, 28 Oct 2012 19:47:46 +0000 (19:47 +0000)]
Increase the initial CWND to 10 segments as defined in IETF TCPM
draft-ietf-tcpm-initcwnd-05. It explains why the increased initial
window improves the overall performance of many web services without
risking congestion collapse.
As long as it remains a draft it is placed under a sysctl marking it
as experimental:
net.inet.tcp.experimental.initcwnd10 = 1
When it becomes an official RFC soon the sysctl will be changed to
the RFC number and moved to net.inet.tcp.
This implementation differs from the RFC draft in that it is a bit
more conservative in the case of packet loss on SYN or SYN|ACK because
we haven't reduced the default RTO to 1 second yet. Also the restart
window isn't yet increased as allowed. Both will be adjusted with
upcoming changes.
Is is enabled by default. In Linux it is enabled since kernel 3.0.
andre [Sun, 28 Oct 2012 19:16:22 +0000 (19:16 +0000)]
Simplify and enhance the window change/update acceptance logic,
especially in the presence of bi-directional data transfers.
snd_wl1 tracks the right edge, including data in the reassembly
queue, of valid incoming data. This makes it like rcv_nxt plus
reassembly. It never goes backwards to prevent older, possibly
reordered segments from updating the window.
snd_wl2 tracks the left edge of sent data. This makes it a duplicate
of snd_una. However joining them right now is difficult due to
separate update dependencies in different places in the code flow.
snd_wnd tracks the current advertized send window by the peer. In
tcp_output() the effective window is calculated by subtracting the
already in-flight data, snd_nxt less snd_una, from it.
ACK's become the main clock of window updates and will always update
the window when the left edge of what we sent is advanced. The ACK
clock is the primary signaling mechanism in ongoing data transfers.
This works reliably even in the presence of reordering, reassembly
and retransmitted segments. The ACK clock is most important because
it determines how much data we are allowed to inject into the network.
Zero window updates get us out of persistence mode are crucial. Here
a segment that neither moves ACK nor SEQ but enlarges WND is accepted.
When the ACK clock is not active (that is we're not or no longer
sending any data) any segment that moves the extended right SEQ edge,
including out-of-order segments, updates the window. This gives us
updates especially during ping-pong transfers where the peer isn't
done consuming the already acknowledged data from the receive buffer
while responding with data.
The SSH protocol is a prime candidate to benefit from the improved
bi-directional window update logic as it has its own windowing
mechanism on top of TCP and is frequently sending back protocol ACK's.
andre [Sun, 28 Oct 2012 19:02:07 +0000 (19:02 +0000)]
For retransmits of SYN|ACK from the syncache use the slightly more
aggressive special tcp_syn_backoff[] retransmit schedule instead of
the normal tcp_backoff[] schedule for established connections.
andre [Sun, 28 Oct 2012 18:56:57 +0000 (18:56 +0000)]
When retransmitting SYN in TCPS_SYN_SENT state use TCPTV_RTOBASE,
the default retransmit timeout, as base to calculate the backoff
time until next try instead of the TCP_REXMTVAL() macro which only
works correctly when we already have measured an actual RTT+RTTVAR.
Before it would cause the first retransmit at RTOBASE, the next
four at the same time (!) about 200ms later, and then another one
again RTOBASE later.
trasz [Sun, 28 Oct 2012 18:53:28 +0000 (18:53 +0000)]
Fix two problems that caused instant panic when the device mounted
with softupdates went away. Note that this does not fix the problem
entirely; I'm committing it now to make it easier for someone to pick
up the work.
adrian [Sun, 28 Oct 2012 18:46:06 +0000 (18:46 +0000)]
Add a temporary (for values of "temporary") work around for hotplug
support with ath(4) and VIMAGE.
Right now the VIMAGE code doesn't supply a default vnet context during:
* hotplug attach;
* any device detach.
It special cases kldload/boot time probing (by setting the context to
vnet0) but that doesn't occur when probing devices during a bus rescan -
eg, adding a cardbus card.
These will eventually go away when the VIMAGE support extends to providing
default contexts to hotplug attach/detach.
andre [Sun, 28 Oct 2012 18:38:51 +0000 (18:38 +0000)]
Improve m_cat() by being able to also merge contents from M_EXT
mbuf's by doing proper testing with M_WRITABLE().
In m_collapse() replace an incomplete manual check for M_RDONLY
with the M_WRITABLE() macro that also tests for shared buffers
and other cases that make a particular mbuf immutable.
andre [Sun, 28 Oct 2012 18:33:52 +0000 (18:33 +0000)]
Allow arbitrary MSS sizes and don't mind about the cluster size anymore.
We've got more cluster sizes for quite some time now and the orginally
imposed limits and the previously codified thoughts on efficiency gains
are no longer true.
andre [Sun, 28 Oct 2012 18:07:34 +0000 (18:07 +0000)]
Change the syncache count reporting the current number of entries
from an unprotected u_int that reports garbage on SMP to a function
based sysctl obtaining the current value from UMA.
Also read back the actual cache_limit after page size rounding by UMA.
andre [Sun, 28 Oct 2012 17:40:35 +0000 (17:40 +0000)]
Prevent a flurry of forced window updates when an application is
doing small reads on a (partially) filled receive socket buffer.
Normally one would a send a window update every time the available
space in the socket buffer increases by two times MSS. This leads
to a flurry of window updates that do not provide any meaningful
new information to the sender. There still is available space in
the window and the sender can continue sending data. All window
updates then get carried by the regular ACKs. Only when the socket
buffer was (almost) full and the window closed accordingly a window
updates delivery new information and allows the sender to start
sending more data again.
Send window updates only every two MSS when the socket buffer
has less than 1/8 space available, or the available space in the
socket buffer increased by 1/4 its full capacity, or the socket
buffer is very small. The next regular data ACK will carry and
report the exact window size again.
andre [Sun, 28 Oct 2012 17:30:28 +0000 (17:30 +0000)]
When SYN or SYN/ACK had to be retransmitted RFC5681 requires us to
reduce the initial CWND to one segment. This reduction got lost
some time ago due to a change in initialization ordering.
Additionally in tcp_timer_rexmt() avoid entering fast recovery when
we're still in TCPS_SYN_SENT state.
andre [Sun, 28 Oct 2012 17:25:08 +0000 (17:25 +0000)]
When SYN or SYN/ACK had to be retransmitted RFC5681 requires us to
reduce the initial CWND to one segment. This reduction got lost
some time ago due to a change in initialization ordering.
Additionally in tcp_timer_rexmt() avoid entering fast recovery when
we're still in TCPS_SYN_SENT state.
hselasky [Sun, 28 Oct 2012 14:37:17 +0000 (14:37 +0000)]
Implement support for the so-called USB feedback endpoint for USB
audio devices. This endpoint gives clues to the USB host about the
actual data rate on asynchronous endpoints and makes the more
expensive USB audio devices usable under FreeBSD.
The Linux USB audio driver was used as reference for the
automagic shift of the received value.