yongari [Thu, 24 Dec 2009 17:22:15 +0000 (17:22 +0000)]
Implement RX interrupt moderation using one-shot timer interrupt.
Unlike TX interrupt, ST201 does not provide any mechanism to
suppress RX interrupts. ste(4) can generate more than 70k RX
interrupts under heavy RX traffics such that these excessive
interrupts make system useless to process other useful things.
Maybe this was the major reason why polling support code was
introduced to ste(4).
The STE_COUNTDOWN register provides a programmable counter that
will generate an interrupt upon its expiration. We program
STE_DMACTL register to use 3.2us clock rate to drive the counter
register. Whenever ste(4) serves RX interrupt, the driver rearm
the timer to expire after STE_IM_RX_TIMER_DEFAULT time and disables
further generation of RX interrupts. This trick seems to work well
and ste(4) generates less than 8k RX interrupts even under 64 bytes
UDP torture test. Combined with TX interrupts, the total number of
interrupts are less than 10k which looks reasonable on heavily
loaded controller.
The default RX interrupt moderation time is 150us. Users can change
the value at any time with dev.ste.%d.int_rx_mod sysctl node.
Setting it 0 effectively disables the RX interrupt moderation
feature. Now we have both TX/RX interrupt moderation code so remove
loop of interrupt handler which resulted in sub-optimal performance
as well as more register accesses.
marius [Thu, 24 Dec 2009 15:14:35 +0000 (15:14 +0000)]
Revert r183628 as with the current ata(4) ATAPI DMA with AcerLabs
M5229 appears to be once again fixed. If this happens to return
we probably should disable ATAPI DMA in ataacerlabs(4) instead
just like the Linux libATA does.
jilles [Thu, 24 Dec 2009 15:14:22 +0000 (15:14 +0000)]
sh: Remove setting variables from dotcmd/exportcmd.
It is already done by evalcommand(), unless special-ness has been removed,
in which case variable assignments should not persist. (These are currently
always special builtins, but this will change later: command builtin,
command substitution.)
This also fixes a memory leak when calling . with variable assignments.
Example:
valgrind --leak-check=full sh -c 'x=1 . /dev/null; x=2'
mav [Thu, 24 Dec 2009 13:38:02 +0000 (13:38 +0000)]
As soon as geom_raid3 reports it's own stripe as sector size, report largest
underlying provider's stripe, multiplied by number of data disks in array,
due to transformation done, as array stripe.
marius [Thu, 24 Dec 2009 12:27:22 +0000 (12:27 +0000)]
- Don't check for a valid interrupt controller on every interrupt
in intr_execute_handlers(). If we managed to get here without an
associated interrupt controller we have way bigger problems.
While at it predict stray vector interrupts as false as they are
rather unlikely.
- Don't blindly call the clear function of an interrupt controller
when adding a handler in inthand_add() as interrupt controllers
like the one driven by upa(4) are auto-clearing and thus provide
NULL instead.
delphij [Thu, 24 Dec 2009 00:43:44 +0000 (00:43 +0000)]
Adapt OpenBSD pf's "sloopy" TCP state machine which is useful for Direct
Server Return mode, where not all packets would be visible to the load
balancer or gateway.
This commit should be reverted when we merge future pf versions. The
benefit it would provide is that this version does not break any existing
public interface and thus won't be a problem if we want to MFC it to
earlier FreeBSD releases.
rpaulo [Wed, 23 Dec 2009 23:16:54 +0000 (23:16 +0000)]
Intel XScale hwpmc(4) support.
This brings hwpmc(4) support for 2nd and 3rd generation XScale cores.
Right now it's enabled by default to make sure we test this a bit.
When the time comes it can be disabled by default.
Tested on Gateworks boards.
marius [Wed, 23 Dec 2009 22:25:23 +0000 (22:25 +0000)]
- By re-arranging the code in OF_decode_addr() somewhat and accepting
a bit of a detour we can just iterate through the banks array instead
of having to calculate every offset. This change is inspired by the
powerpc version of this function.
- Add support for the JBus to EBus bridges which hang off of nexus(4).
marius [Wed, 23 Dec 2009 22:02:34 +0000 (22:02 +0000)]
- Add support for the IOMMUs of Fire JBus to PCIe and Oberon Uranus
to PCIe bridges.
- Add support for talking the PROM mappings over to the kernel IOTSB
just like we do with the kernel TSB in order to allow OFW drivers
to continue to work.
- Change some members, parameters and variables to unsigned where
more appropriate.
marius [Wed, 23 Dec 2009 21:38:59 +0000 (21:38 +0000)]
- Add quirk handling for ALi M5229, mainly setting the magic "force
enable IDE I/O" bit which prevents data access traps with revision
0xc8 in Fire-based machines when pci(4) enables PCIM_CMD_PORTEN.
- Like for sun4v also don't add the PCI side of host-PCIe bridges to
the bus on sun4u as they don't have configuration space implement
there either.
jhb [Wed, 23 Dec 2009 21:11:03 +0000 (21:11 +0000)]
Fix a bug in gzipfs that prevented lseek() from working and add lseek()
support to bzip2fs. This fixes problems with loading compressed amd64
kernel modules containing debug symbols.
Submitted by: David Naylor naylor.b.david (gmail)
MFC after: 1 week
marius [Wed, 23 Dec 2009 21:04:31 +0000 (21:04 +0000)]
Hook ebus(4) and isa(4) up to the sun4v LINT build in order to
ensure that their compilation doesn't break as they are expected
to work as-is now (but aren't actually run-time tested).
marius [Wed, 23 Dec 2009 20:42:14 +0000 (20:42 +0000)]
Don't probe the bq4802 variant found in Ultra 25 and 45 for now as
this chip isn't MC146818 compatible and requires different handlers
(but which I can't test due to lack of such hardware).
marius [Wed, 23 Dec 2009 20:23:04 +0000 (20:23 +0000)]
Don't use an out register to hold the vector number across the call
of the interrupt handler in intr_fast() as the handler might clobber
it (no in-tree handler currently does but an upcoming one will).
While at it, tidy the register usage in the interrupt counting code.
yongari [Wed, 23 Dec 2009 19:38:22 +0000 (19:38 +0000)]
We don't need to generate DMA complete interrupt for every
transmitted frames. So request interrupt for every 16th frames. Due
to the limitation of hardware we can't suppress the interrupt as
driver should have to check TX status register. The TX status
register can store up to 31 TX status so driver can't send more
than 31 frames without reading TX status register.
With this change controller would not generate TX completion
interrupt for every frame, so reclaim transmitted frames in
ste_tick().
yongari [Wed, 23 Dec 2009 19:21:37 +0000 (19:21 +0000)]
Remove unused duplicated register definition. It seems the
definition was made to access STE_ASICCTL register as 16bits but
ste(4) always access the register as 32bits so it was never used
before.
luigi [Wed, 23 Dec 2009 18:53:11 +0000 (18:53 +0000)]
mostly style changes, such as removal of trailing whitespace,
reformatting to avoid unnecessary line breaks, small block
restructuring to avoid unnecessary nesting, replace macros
with function calls, etc.
As a side effect of code restructuring, this commit fixes one bug:
previously, if a realloc() failed, memory was leaked. Now, the
realloc is not there anymore, as we first count how much memory
we need and then do a single malloc.
yongari [Wed, 23 Dec 2009 18:42:25 +0000 (18:42 +0000)]
Report the correct result of mii_mediachg(). Previously it always
used to return success without respect to the result.
While I'm here use mii_mediachg() in ste_init_locked which allows
driver to use currently configured media. ste_ifmedia_upd() is
supposed to be called whenever user changes current media settings.
yongari [Wed, 23 Dec 2009 18:24:22 +0000 (18:24 +0000)]
Overhaul RX filter programming.
o Let RX filter handler program promiscuous/multicast filter as
well as broadcasting.
o Remove unnecessary register access.
o Simplify ioctl handler and have set_rxfilter to handle
IFF_PROMISC and IFF_ALLMULTI change instead of directly
programming the controller.
o Removed unnecessary error variable reinitialization in ioctl
handler.
o Add IFF_DRV_RUNNING check before programming multicast filter.
o Configure maximum allowed frame length before enabling MAC.
Datasheet didn't say the exact ordering of programming sequence
but it looks more natural to set maximum allowed frame length
first prior to enabling controller.
yongari [Wed, 23 Dec 2009 17:54:24 +0000 (17:54 +0000)]
Reimplement controller reset. Datasheet says full reset takes about
1ms. Since we switched to memory register mapping make sure to
flush PCI posted write by reading the register again.
While I'm here add additional delays in loop while driver waits the
completion of the reset.
rwatson [Wed, 23 Dec 2009 12:33:59 +0000 (12:33 +0000)]
When warning about possible netisr configuration problems during boot,
report using "netisr_init" rather than "netisr2", which was the development
name for the project.
marcel [Wed, 23 Dec 2009 06:52:12 +0000 (06:52 +0000)]
Calculate the average CPU clock frequency and export that through
the hw.freq.cpu sysctl variable. This can be used by ports that
need to know "the" CPU frequency.
marcel [Wed, 23 Dec 2009 04:48:42 +0000 (04:48 +0000)]
Export the bus, cpu and itc frequencies under the hw.freq sysctl node.
The frequencies are in MHz (i.e. a value of 1000 represents 1GHz). The
frequencies are rounded to the nearest whole MHz.
While here, rename and re-type bus_frequency, processor_frequency and
itc_frequency to bus_freq, cpu_freq and itc_freq and make them static.
As unsigned integers, the hw.freq.cpu sysctl can more easily be made
generic (across all architectures) making porting easier.
yongari [Tue, 22 Dec 2009 23:57:10 +0000 (23:57 +0000)]
Reimplement Tx status error handler as recommended by datasheet.
If ste(4) encounter TX underrun or excessive collisions the TX MAC
of controller is stalled so driver should wake it up again. TX
underrun requires increasing TX threshold value to minimize
further TX underruns. Previously ste(4) used to reset controller
to recover from TX underrun, excessive collision and reclaiming
error. However datasheet says only TX underrun requires resetting
entire controller. So implement ste_restart_tx() that restarts TX
MAC and do not perform full reset except TX underrun case.
Now ste(4) uses CSR_READ_2 instead of CSR_READ_1 to read
STE_TX_STATUS register. This way ste(4) will also read frame id
value and we can write the same value back to STE_TX_FRAMEID
register instead of overwriting it to 0. The datasheet was wrong
in write back of STE_TX_STATUS so add some comments why we do so.
Also always invoke ste_txeoc() after ste_txeof() in ste_poll as
without reading TX status register can stall TX MAC.
marius [Tue, 22 Dec 2009 21:49:53 +0000 (21:49 +0000)]
- Add support for the JBus to EBus bridges which hang off of nexus(4)
and are found in sun4u and sun4v machines based on the Fire ASIC.
- Initialize the configuration space of the PCI to EBus variant the
same way as OpenSolaris does.
marius [Tue, 22 Dec 2009 21:48:18 +0000 (21:48 +0000)]
- Add macros for the states of the interrupt clear registers.
- Change INTMAP_VEC() to take an INO as its second argument rather
than an INR. The former is what I actually intended with this
macro and how it's currently used.
yongari [Tue, 22 Dec 2009 21:39:34 +0000 (21:39 +0000)]
Prefer memory space register mapping over io space. If memory space
mapping fails fall back to old io space mapping.
While I'm here use PCIR_BAR macro.
marius [Tue, 22 Dec 2009 21:02:46 +0000 (21:02 +0000)]
Enroll these drivers in multipass probing. The motivation behind this
is that the JBus to EBus bridges share the interrupt controller of a
sibling JBus to PCIe bridge (at least as far as the OFW device tree
is concerned, in reality they are part of the same chip) so we have to
probe and attach the latter first. That happens to be also the case
due to the fact that the JBus to PCIe bridges appear first in the OFW
device tree but it doesn't hurt to ensure the right order.
yongari [Tue, 22 Dec 2009 20:57:30 +0000 (20:57 +0000)]
Instead of relying on hard resetting of controller to stop
receiving incoming traffics, try harder to gracefully stop active
DMA cycles and then stop MACs. This is the way what datasheet
recommends and seems to work reliably. Resetting controller while
active DMAs are in progress is bad thing as we can't predict how
DMAs touche allocated TX/RX buffers. This change ensures controller
stop state before attempting to release allocated TX/RX buffers.
Also update MAC statistics which could have been updated during the
wait time of MAC stop.
While I'm here remove unnecessary controller resets in various
location. ste(4) no longer relies on hard controller reset to stop
controller and resetting controller also clears all configured
settings which makes it hard to implement WOL in near future.
Now resetting a controller is performed in ste_init_locked().
bms [Tue, 22 Dec 2009 20:40:22 +0000 (20:40 +0000)]
Use ALLOW_NEW_SOURCES and BLOCK_OLD_SOURCES to signal a join or leave
with SSM MLDv2 by default.
This is current practice and complies with RFC 4604, as well as being
required by production IPv6 networks in Japan.
The behaviour may be disabled by setting the net.inet6.mld.use_allow
sysctl/tunable to 0.
yongari [Tue, 22 Dec 2009 20:11:56 +0000 (20:11 +0000)]
Reimplement miibus_statchg method. Don't rely on link state change
interrupt. If we want to use link state change interrupt ste(4)
should also implement auto-negotiation complete handler as well as
various PHY access handling. Now link state change is handled by
mii(4) polling so it will automatically update link state UP/DOWN
events which in turn make ste(4) usable with lagg(4).
r199559 added a private timer to drive watchdog and the timer also
used to drive MAC statistics update. Because the MAC statistics
update is called whenever statistics counter reaches near-full, it
drove watchdog timer too fast such that it caused false watchdog
timeouts under heavy TX traffic conditions.
Fix the regression by separating ste_stats_update() from driving
watchdog timer and introduce a new function ste_tick() that handles
periodic job such as driving watchdog, MAC statistics update and
link state check etc.
While I'm here clear armed watchdog timer in ste_stop().
yongari [Tue, 22 Dec 2009 19:32:16 +0000 (19:32 +0000)]
Introduce sc_flags member variable and use it to keep track of
link state and PHY related information.
Remove ste_link and ste_one_phy variable of softc as it's not used
anymore.
While I'm here add IFF_DRV_RUNNING check in ste_start_locked().
luigi [Tue, 22 Dec 2009 19:01:47 +0000 (19:01 +0000)]
merge code from ipfw3-head to reduce contention on the ipfw lock
and remove all O(N) sequences from kernel critical sections in ipfw.
In detail:
1. introduce a IPFW_UH_LOCK to arbitrate requests from
the upper half of the kernel. Some things, such as 'ipfw show',
can be done holding this lock in read mode, whereas insert and
delete require IPFW_UH_WLOCK.
2. introduce a mapping structure to keep rules together. This replaces
the 'next' chain currently used in ipfw rules. At the moment
the map is a simple array (sorted by rule number and then rule_id),
so we can find a rule quickly instead of having to scan the list.
This reduces many expensive lookups from O(N) to O(log N).
3. when an expensive operation (such as insert or delete) is done
by userland, we grab IPFW_UH_WLOCK, create a new copy of the map
without blocking the bottom half of the kernel, then acquire
IPFW_WLOCK and quickly update pointers to the map and related info.
After dropping IPFW_LOCK we can then continue the cleanup protected
by IPFW_UH_LOCK. So userland still costs O(N) but the kernel side
is only blocked for O(1).
4. do not pass pointers to rules through dummynet, netgraph, divert etc,
but rather pass a <slot, chain_id, rulenum, rule_id> tuple.
We validate the slot index (in the array of #2) with chain_id,
and if successful do a O(1) dereference; otherwise, we can find
the rule in O(log N) through <rulenum, rule_id>
All the above does not change the userland/kernel ABI, though there
are some disgusting casts between pointers and uint32_t
Operation costs now are as follows:
Function Old Now Planned
-------------------------------------------------------------------
+ skipto X, non cached O(N) O(log N)
+ skipto X, cached O(1) O(1)
XXX dynamic rule lookup O(1) O(log N) O(1)
+ skipto tablearg O(N) O(1)
+ reinject, non cached O(N) O(log N)
+ reinject, cached O(1) O(1)
+ kernel blocked during setsockopt() O(N) O(1)
-------------------------------------------------------------------
The only (very small) regression is on dynamic rule lookup and this will
be fixed in a day or two, without changing the userland/kernel ABI
yongari [Tue, 22 Dec 2009 18:57:07 +0000 (18:57 +0000)]
Add bus_dma(9) and endianness support to ste(4).
o Sorted includes and added missing header files.
o Added basic endianness support. In theory ste(4) should work on
any architectures.
o Remove the use of contigmalloc(9), contigfree(9) and vtophys(9).
o Added 8 byte alignment limitation of TX/RX descriptor.
o Added 1 byte alignment requirement for TX/RX buffers.
o ste(4) controllers does not support DAC. Limit DMA address space
to be within 32bit address.
o Added spare DMA map to gracefully recover from DMA map failure.
o Removed dead code for checking STE_RXSTAT_DMADONE bit. The bit
was already checked in each iteration of loop so it can't be true.
o Added second argument count to ste_rxeof(). It is used to limit
number of iterations done in RX handler. ATM polling is the only
consumer.
o Removed ste_rxeoc() which was added to address RX stuck issue
(cvs rev 1.66). Unlike TX descriptors, ST201 supports chaining
descriptors to form a ring for RX descriptors. If RX descriptor
chaining is not supported it's possible for controller to stop
receiving incoming frames once controller pass the end of RX
descriptor which in turn requires driver post new RX
descriptors to receive more frames. For TX descriptors which
does not support chaning, we exactly do manual chaining in
driver by concatenating new descriptors to the end of previous
TX chain.
Maybe the workaround was borrowed from other drivers that does
not support RX descriptor chaining, which is not valid for ST201
controllers. I still have no idea how this address RX stuck
issue and I can't reproduce the RX stuck issue on DFE-550TX
controller.
o Removed hw.ste_rxsyncs sysctl as the workaround was removed.
o TX/RX side bus_dmamap_load_mbuf_sg(9) support.
o Reimplemented optimized ste_encap().
o Simplified TX logic of ste_start_locked().
o Added comments for TFD/RFD requirements.
o Increased number of RX descriptors to 128 from 64. 128 gave much
better performance than 64 under high network loads.
jhb [Tue, 22 Dec 2009 15:47:40 +0000 (15:47 +0000)]
- Rename the __tcpi_(snd|rcv)_mss fields of the tcp_info structure to remove
the leading underscores since they are now implemented.
- Implement the tcpi_rto and tcpi_last_data_recv fields in the tcp_info
structure.
luigi [Tue, 22 Dec 2009 13:53:34 +0000 (13:53 +0000)]
some mostly cosmetic changes in preparation for upcoming work:
+ in many places, replace &V_layer3_chain with a local
variable chain;
+ bring the counter of rules and static_len within ip_fw_chain
replacing static variables;
+ remove some spurious comments and extern declaration;
+ document which lock protects certain data structures