hiren [Mon, 12 Jan 2015 08:33:04 +0000 (08:33 +0000)]
DCTCP (Data Center TCP) implementation.
DCTCP congestion control algorithm aims to maximise throughput and minimise
latency in data center networks by utilising the proportion of Explicit
Congestion Notification (ECN) marked packets received from capable hardware as a
congestion signal.
Highlights:
Implemented as a mod_cc(4) module.
ECN (Explicit congestion notification) processing is done differently from
RFC3168.
Takes one-sided DCTCP into consideration where only one of the sides is using
DCTCP and other is using standard ECN.
IETF draft: http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Thesis report by Midori Kato: https://eggert.org/students/kato-thesis.pdf
Submitted by: Midori Kato <katoon@sfc.wide.ad.jp> and
Lars Eggert <lars@netapp.com>
with help and modifications from
hiren
Differential Revision: https://reviews.freebsd.org/D604
Reviewed by: gnn
kib [Mon, 12 Jan 2015 07:48:22 +0000 (07:48 +0000)]
Fix several issues with /dev/mem and /dev/kmem devices on amd64.
For /dev/mem, when requested physical address is not accessible by the
direct map, do temporal remaping with the caching attribute
'uncached'. Limit the accessible addresses by MAXPHYADDR, since the
architecture disallowes writing non-zero into reserved bits of ptes
(or setting garbage into NX).
For /dev/kmem, only access existing kernel mappings for direct map
region. For all other addresses, obtain a physical address of the
mapping and fall back to the /dev/mem mechanism. This ensures that
/dev/kmem i/o does not fault even if the accessed region is changed in
parallel, by using either direct map or temporal mapping.
For both devices, operate on one page by iteration. Do not return
error if any bytes were moved around, return the (partial) bytes count
to userspace.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
yongari [Mon, 12 Jan 2015 07:43:19 +0000 (07:43 +0000)]
Receive filter configuration is done in nge_rxfilter(). Remove
unnecessary filter configuration code in nge_init_locked().
While I'm here add a check for driver running state for multicast
filter handling. Also remove unnecessary assignment to error
variable since it is cleared in the function entry.
kib [Mon, 12 Jan 2015 07:36:25 +0000 (07:36 +0000)]
For x86, read MAXPHYADDR, defined in SDM vol 3 4.1.4 Enumeration of Paging
Features by CPUID as CPUID.80000008H:EAX[7:0], into variable cpu_maxphyaddr.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
hselasky [Mon, 12 Jan 2015 06:34:23 +0000 (06:34 +0000)]
Increase the maximum number of dynamic USB quirks. USB memory stick
devices which don't support the synchronize cache SCSI command are
likely to also not support the prevent-allow medium removal SCSI
command.
loos [Mon, 12 Jan 2015 03:23:16 +0000 (03:23 +0000)]
Add support to turn off Beaglebone with poweroff(8) or shutdown(8) -p.
To cut off the power we need to start the shutdown sequence by writing
the OFF bit on PMIC.
Once the PMIC is programmed the SoC needs to toggle the PMIC_PWR_ENABLE
pin when it is ready for the PMIC to cut off the power. This is done by
triggering the ALARM2 interrupt on SoC RTC.
The RTC driver only works in power management mode which means it won't
provide any kind of time keeping functionality. It only implements a way
to trigger the ALARM2 interrupt when requested.
ian [Mon, 12 Jan 2015 02:42:33 +0000 (02:42 +0000)]
Handle dma mappings with more than one segment for rpi sdhci.
The driver inherently does dma in 512 byte chunks, but it's possible that
such a buffer can span two physically discontiguous pages (such as when
a userland program does IO on the raw /dev/mmcsdN devices). Now the driver
can handle a buffer that's split across two pages.
It could in theory handle any number of segments now, but as long as IO is
being done in 512 byte blocks it will never need more than two.
kib [Sun, 11 Jan 2015 22:16:31 +0000 (22:16 +0000)]
Reduce the size of the interposing table and amount of
cancellation-handling code in the libthr. Translate some syscalls
into their more generic counterpart, and remove translated syscalls
from the table.
List of the affected syscalls:
creat, open -> openat
raise -> thr_kill
sleep, usleep -> nanosleep
pause -> sigsuspend
wait, wait3, waitpid -> wait4
Suggested and reviewed by: jilles (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
ian [Sun, 11 Jan 2015 21:27:46 +0000 (21:27 +0000)]
Check for and handle failures of bus_dmamap_load(). The driver currently
requires that each 512 byte IO be in a single contiguous buffer, but if a
buffer crosses a page boundary and the physical pages aren't contiguous
you can get an EFBIG failure (too many segments).
The driver really should handle multiple segment IO, but before adding that
I wanted to make sure that it's handling failure properly while the failure
is easily recreatable.
ian [Sun, 11 Jan 2015 21:25:03 +0000 (21:25 +0000)]
Handle the possibility that SDHCI_PLATFORM_START_TRANSFER() can fail, by
moving the handling of curcmd->error != 0 to the end of the interrupt
handler. Also make sdhci_finish_data() idempotent by moving the setting
of slot->data_done = 1 down past the point where the busdma buffer is
unmapped. This allows for the possibility that the finish routine can
get called from multiple places when handling errors.
kib [Sun, 11 Jan 2015 20:27:15 +0000 (20:27 +0000)]
Right now, for non-coherent DMARs, page table update code flushes the
cache for whole page containing modified pte, and more, only last page
in the series of the consequtive pages is flushed (i.e. the affected
mappings should be larger than 2MB).
Avoid excessive flushing and do missed neccessary flushing, by
splitting invalidation and unmapping. For now, flush exactly the
range of the changed pte. This is still somewhat bigger than
neccessary, since pte is 8 bytes, while cache flush line is at least
32 bytes.
The originator of the issue reports that after the change,
'dmar_bus_dmamap_unload went from 13,288 cycles down to
3,257. dmar_bus_dmamap_load_buffer went from 9,686 cycles down to
3,517. and I am now able to get line 1GbE speed with Netperf TCP
(even with 1K message size).'
Diagnosed and tested by: Nadav Amit <nadav.amit@gmail.com>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
gjb [Sun, 11 Jan 2015 19:01:28 +0000 (19:01 +0000)]
Fix the release notes article.xml file to confirm with
FDP style, specifically reindenting the entire file for
tag alignment, rewrapping lines where necessary.
mav [Sun, 11 Jan 2015 16:36:39 +0000 (16:36 +0000)]
When aggregating TRIM segments, move the new one to the list end.
New segment at the list head may block all TRIM requests until txg of that
segment can be processed. On my random I/O tests this change reduce peak
TRIM list length from 650 to 450 segments. Hopefully it should reduce TRIM
burstiness when list processing is unblocked.
np [Sun, 11 Jan 2015 07:51:58 +0000 (07:51 +0000)]
cxgb: replace r273280 with a more comprehensive fix.
Poll for link state when the link is down, even for interrupt capable
PHYs.
Allow PHYs to report a dubious "partial" link. If this state is seen 3
consecutive times (each check is ~1s apart) then reset the PHY. This is
a workaround for a situation where repeatedly toggling the link from the
peer gets the AEL2005 PHY into a state where it never establishes a PCS
block lock even when everything is in order.
mav [Sun, 11 Jan 2015 00:26:18 +0000 (00:26 +0000)]
Add LBA as secondary sort key for synchronous I/O requests.
On FreeBSD gethrtime() implemented via getnanouptime(), that has 1ms (1/hz)
precision. It makes primary sort key (timestamp) collision very possible.
In such situations sorting by secondary key of LBA is much more reasonable
then by totally meaningless zio pointer value.
With this change on multi-threaded synchronous ZVOL read I've measured 10%
throughput increase and average latency reduction.
imp [Sat, 10 Jan 2015 23:43:39 +0000 (23:43 +0000)]
Use .MAKE.LEVEL being defined as a bootstrap aid when providing
fallback targets to build the aic generated files. fmake doesn't like
the current construct, and since it doesn't have .MAKE.LEVEL, just
don't provide the fallback targets for fmake. This gives a little
extra compatibility to old systems trying to build new kernels at
almost no cost to the current code.
kib [Sat, 10 Jan 2015 23:12:49 +0000 (23:12 +0000)]
Fix calculation of requester for PCI device behind PCIe/PCI bridge.
In my case on the test machine, I have hierarchy of
pcib2 (PCIe port on host bridge with PCIe capability) -> pci2 ->
pcib3 (ITE PCIe/PCI bridge) -> pci3 -> em1
The device to check PCIe capability is pcib2 and not pcib3, as it is
currently done in the code. Also, in case of the bridge, we shall
step to pcib2 for the loop iteration, since pcib3 does not carry PCIe
capability info and would force wrong recalculation of rid.
Also change the returned requester to the PCIe bus which provides port
for the bridge. This only results in changing
hw.busdma.pciX.X.X.X.bounce tunable to force identity-mapped context
for the device.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
kib [Sat, 10 Jan 2015 22:57:08 +0000 (22:57 +0000)]
Print rid when announcing DMAR context creation. Print sid when fault
occurs. This allows to connect dots in case the requester is
calculated erronously.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
dim [Sat, 10 Jan 2015 22:22:42 +0000 (22:22 +0000)]
Add the llvm-symbolizer tool, which enables the sanitizers to report
more complete debugging information. This tools is only enabled when
WITH_CLANG_EXTRAS is on.
Submitted by: Dan McGregor <danismostlikely@gmail.com>
rwatson [Sat, 10 Jan 2015 10:41:23 +0000 (10:41 +0000)]
Garbage collect m_copymdata(), an mbuf utility routine introduced
in FreeBSD 7 that has not been used since. It contains a number
of unresolved bugs including an inverted bcopy() and incorrect
handling of read-only mbufs using internal storage. Removing this
unused code is substantially essier than fixing it in order to
update it to the coming mbuf world order -- but it can always be
restored from revision history if it turns out to prove useful for
future work.