andre [Thu, 7 Sep 2006 12:53:01 +0000 (12:53 +0000)]
Second step of TSO (TCP segmentation offload) support in our network stack.
TSO is only used if we are in a pure bulk sending state. The presence of
TCP-MD5, SACK retransmits, SACK advertizements, IPSEC and IP options prevent
using TSO. With TSO the TCP header is the same (except for the sequence number)
for all generated packets. This makes it impossible to transmit any options
which vary per generated segment or packet.
The length of TSO bursts is limited to TCP_MAXWIN.
The sysctl net.inet.tcp.tso globally controls the use of TSO and is enabled.
TSO enabled sends originating from tcp_output() have the CSUM_TCP and CSUM_TSO
flags set, m_pkthdr.csum_data filled with the header pseudo-checksum and
m_pkthdr.tso_segsz set to the segment size (net payload size, not counting
IP+TCP headers or TCP options).
IPv6 currently lacks a pseudo-header checksum function and thus doesn't support
TSO yet.
make "make TARGET=foo" work correctly. Before, it would fail to set
TARGET_ARCH correctly. Now it does, even for pc98. We should suggest
TARGET=foo in preference to TARGET_ARCH because the former is
unambiguous and the latter isn't, so update the docs.
This means that a long standing gripe I've had with this comes to a
close. I can build pc98 w/o specify both things. make TARGET=arm
works (rather than trying to build a arm:amd64 image and dying badly
in the attempt).
If you specify only TARGET_ARCH, then you get the old behavior.
# we can likely simplify the UNIVERSE target now to use this, but I'm not
# up for breaking that tonight :-).
# We should consider adding some kind of sanity check for TARGET_ARCH
# and TARGET.
# Note to the md5 tracking club: $FreeBSD$ changes md5 after every commit
# so you need to checkout -kk to get $FreeBSD$ instead of the actual value
# of the keyword.
andre [Wed, 6 Sep 2006 22:07:14 +0000 (22:07 +0000)]
Make TSO (TCP segmentation offload) capabilities visible and accessible with
'ifconfig em0 tso' and 'ifconfig em0 -tso'. TSO for IPv4 and IPv6 is always
enabled or disabled together. The driver may enable only one if it doesn't
support both.
Document 'tso' and '-tso' in the ifconfig(8) man pages.
Use sysctl_handle_long() instead of duplicating it's logic for
kern.ipc.maxsockbuf so that this sysctl works for 32-bit binaries running
on amd64 via compat/freebsd32.
andre [Wed, 6 Sep 2006 21:51:59 +0000 (21:51 +0000)]
First step of TSO (TCP segmentation offload) support in our network stack.
o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6
o add CSUM_TSO flag to mbuf pkthdr csum_flags field
o add tso_segsz field to mbuf pkthdr
o enhance ip_output() packet length check to allow for large TSO packets
o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities
o adjust all callers of tcp_maxmtu[46]() accordingly
Discussed on: -current, -net
Sponsored by: TCP/IP Optimization Fundraise 2005
While convenient, avoid using alloca() for reasons specified in
the BUGS section of the alloca(3) manpage. In particular, when
the number of TCP sockets is several tens of thousand, trying to
"sysctl -a" would SIGSEGV on the net.inet.tcp.pcblist entry (it
would exceed the stacksize ulimit, in an undetectable manner).
Remove call to fdfree() for the AIO daemons to prevent kernel panics
with linprocfs. This call is not needed since file descriptor sharing
was removed in v1.125.
Reviewed by: alc, davidxu, ambrisko
MFC after: 3 days
Refine previous revision to allow acpi_wakecode.h to be safely built
from both the acpi module build directory and a kernel build directory.
The latter didn't work when one attempted to build a kernel which had
"device acpi" with the "make kernel-toolchain buildkernel" command
because a cross-compiler couldn't find anything in the standard system
include path (it's empty in the kernel-toolchain case).
Fix this by passing a better root path to kernel headers (src/sys)
which works for both cases, kernel and module (-I@ only worked for
module).
Also, while here, pass -nostdinc (and a different spelling for icc) --
it's a feature that the kernel source tree is self-contained, and this
change enforces this.
o Backout rev. 1.125 of in_pcb.c. It appeared to behave extremely
bad under high load. For example with 40k sockets and 25k tcptw
entries, connect() syscall can run for seconds. Debugging showed
that it iterates the cycle millions times and purges thousands of
tcptw entries at a time.
Besides practical unusability this change is architecturally
wrong. First, in_pcblookup_local() is used in connect() and bind()
syscalls. No stale entries purging shouldn't be done here. Second,
it is a layering violation.
o Return back the tcptw purging cycle to tcp_timer_2msl_tw(),
that was removed in rev. 1.78 by rwatson. The commit log of this
revision tells nothing about the reason cycle was removed. Now
we need this cycle, since major cleaner of stale tcptw structures
is removed.
o Disable probably necessary, but now unused
tcp_twrecycleable() function.
The poison pill of death: adding a target mode reply handler and target
resources to a non-FC card killed us dead. Sorry for the breakage since
last July 12.
Fix problems with destroy and forcible destroy functionality:
- hold/release device in start/done routines, this will probably slow
down things a bit, but previous code was racy;
- only release device if g_gate_destroy() failed - if it succeeded device
is dead and there is nothing to release;
- various other changes which makes forcible destruction reliable.
- Move descriptions of BOOT_COMCONSOLE_PORT, BOOT_COMCONSOLE_SPEED,
and LOADER_TFTP_SUPPORT options into the world section since boot
blocks are built as part of the world.
- Document BOOT_PXELDR_ALWAYS_SERIAL and BOOT_PXELDR_PROBE_KEYBOARD
options of pxeboot(8).
- Make the PROBE_KEYBOARD option better resemble the -P option in
boot2, i.e., if keyboard isn't present then boot with both
RB_SERIAL and RB_MULTIPLE set.
The FreeBSD by default "disables" hyper-threading cores, by not scheduling
any threads to them. However, it still counts those cores as "active but
permanently idle" when calculating system-wide CPUs statistics. It is
incorrect, since it skews statistics quite a bit and creates real problems
for certain types of applications (monitoring applications for example),
by making them believe that the system does have enough idle CPU resources,
while in fact it does not.
Correct the problem by not calling performance counting routines on "disabled"
cores. The cleaner solution would be to just disable APIC timer interrupts on
those cores completely, but ENOTIME here and it is not clear if the
additional complexity really worth minor performance gain.
GC dead code. If we want to stay polite to the foreign compilers,
we can find another way to issue an #error, but using a preprocessed
assembler for that purpose and clobbering libc.a with an empty .o
just for the sake of #error reporting is way too much of a burden.
- Make net.inet.tcp.maxtcptw modifiable at run time.
- If net.inet.tcp.maxtcptw was ever set explicitly, do
not change it if kern.ipc.maxsockets is changed.
Some minor corrections:
* Expose functions for setting the "skip file" dev/ino information
* Expose functions for setting/querying the block size on reads
* Correctly propagate errors out of archive_read_close/archive_write_close
* Update manpage with information about new functions
sam [Mon, 4 Sep 2006 20:12:45 +0000 (20:12 +0000)]
sigh, put back buffer overflow fix of 1.1.11 that seems to have
not gone into the 0.9.4 release; don't put it on the vendor branch
so we won't lose it on the next import if they continue to lose it
marius [Mon, 4 Sep 2006 16:45:08 +0000 (16:45 +0000)]
- Talk about chips rather than chip sets as AMD LANCE and PCnet are
single-chip.
- Add some more rationale about le(4).
- Add/un-comment hardware notes for C-Bus and ISA adapters.
If building the module as part of the kernel build, determine
the "device isa" presence out of the opt_isa.h in the kernel
build directory, rather than always assuming its presence.
sparc64 is still special cased and is not affected by this
change.
Avoid an infinite loop in empty_both_buffers() by adding a timeout.
This helps systems that don't actually have atkbd controllers, such as the Intel
SBX82 blade, boot without device.hints hacks.
Hardware for this fix provided by iXsystems.
PR: 94822
Submitted by: Devon H. O'Dell <devon.odell@coyotepoint.com>
MFC After: 3 days
marius [Sun, 3 Sep 2006 21:20:21 +0000 (21:20 +0000)]
Do as the USII CPU manual suggests and leave interrupts enabled
for a bit before retrying to resend an IPI in order to avoid
deadlocks if the other CPU is also trying to send one.
OpenSolaris uses a delay of 1 microsecond here but waiting 2
microseconds with interrupts enabled like Linux does shouldn't
hurt but is a bit safer.
up the default msgbuf limit to 64k.. a verbose boot on i386 on modern
hardware returns almost 48k of data... to change the default per platform,
it should be done in DEFAULTS, not here...
add a newbus method for obtaining the bus's bus_dma_tag_t... This is
required by arches like sparc64 (not yet implemented) and sun4v where there
are seperate IOMMU's for each PCI bus... For all other arches, it will
end up returning NULL, which makes it a no-op...
Convert a few drivers (the ones we've been working w/ on sun4v) to the
new convection... Eventually all drivers will need to replace the parent
tag of NULL, w/ bus_get_dma_tag(dev), though dev is usually different for
each driver, and will require hand inspection...
Break out typedefs from bus_dma.h to _bus_dma.h so that we can get the
typedef for bus_dma_tag_t in sys/bus.h w/o poluting the namespace...
This is in preperation for adding bus_get_dma_tag to sys/bus.h...