Xin LI [Thu, 9 Oct 2014 06:02:53 +0000 (06:02 +0000)]
MFV r272802:
- Limit ARC for zdb at 256MB. zdb do not typically revisit data
in the ARC.
- Increase default max_inflight from 200 to 1000 (can be overriden
by -I) so we can queue more I/Os when doing scrubbing.
- Print status while loading meataslabs for leak detection.
Illumos issues:
5169 zdb should limit its ARC size
5170 zdb -c should create more scrub i/os by default
5171 zdb should print status while loading metaslabs for leak detection
Xin LI [Thu, 9 Oct 2014 05:48:06 +0000 (05:48 +0000)]
5169 zdb should limit its ARC size
5170 zdb -c should create more scrub i/os by default
5171 zdb should print status while loading metaslabs for leak detection
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
Adrian Chadd [Thu, 9 Oct 2014 05:33:25 +0000 (05:33 +0000)]
Add a bus method to fetch the VM domain for the given device/bus.
* Add a bus_if.m method - get_domain() - returning the VM domain or
ENOENT if the device isn't in a VM domain;
* Add bus methods to print out the domain of the device if appropriate;
* Add code in srat.c to save the PXM -> VM domain mapping that's done and
expose a function to translate VM domain -> PXM;
* Add ACPI and ACPI PCI methods to check if the bus has a _PXM attribute
and if so map it to the VM domain;
* (.. yes, this works recursively.)
* Have the pci bus glue print out the device VM domain if present.
Note: this is just the plumbing to start enumerating information -
it doesn't at all modify behaviour.
Check for mbuf copy failure when there are multiple multicast sockets
This partitular case is the only path where the mbuf could be NULL.
udp_append() checked for a NULL mbuf only after invoking the tunneling
callback. Our only in tree tunneling callback - SCTP - assumed a non
NULL mbuf, and it is a bit odd to make the callbacks responsible for
checking this condition.
This also reduces the differences between the IPv4 and IPv6 code.
The M_FLOWID flag should be propagated to the new mbuf pkthdr in
m_move_pkthdr() and m_dup_pkthdr(). The new mbuf already got the
existing flowid value, but would be ignored since the flag was
not set.
Fix draining in ttydev_leave():
1. ERESTART is not only returned when the revoke count changed. It
is also returned when a signal is received. While a change in
the revoke count should be ignored, a signal should not.
2. Waiting until the output queue is entirely drained can cause a
hang when the underlying device is stuck or broken.
Have tty_drain() take care of this by telling it when we're leaving.
When leaving, tty_drain() will use a timed wait to address point 2
above and it will check the revoke count to handle point 1 above.
The timeout is set to 1 second, which is arbitrary and long enough
to expect a change in the output queue.
Properly NUL-terminate the on-stack buffer for reading /boot.config
or /boot/config. In qemu, on a warm boot, the stack is not all zeroes
and we parse beyond the file's contents.
John Baldwin [Wed, 8 Oct 2014 21:53:24 +0000 (21:53 +0000)]
Rewrite timeout(9) to be callout(9)-centric instead. Move the description
of timeout(9) to the end and mark it prominently as deprecated. Document
somewhat how times are specified for the 'sbt' variants. Better explain
how using callout_init_*() to associate a lock with a callout resolves
common races.
When tunneling interface is going to insert mbuf into netisr queue after stripping
outer header, consider it as new packet and clear the protocols flags.
This fixes problems when IPSEC traffic goes through various tunnels and router
doesn't send ICMP/ICMPv6 errors.
Add an argument to the x86 pmap_invalidate_cache_range() to request
forced invalidation of the cache range regardless of the presence of
self-snoop feature. Some recent Intel GPUs in some modes are not
coherent, and dirty lines in CPU cache must be flushed before the
pages are transferred to GPU domain.
Reviewed by: alc (previous version)
Tested by: pho (amd64)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
John Baldwin [Wed, 8 Oct 2014 16:22:59 +0000 (16:22 +0000)]
Add schedgraph traces for callout handlers. Specifically, a callwheel logs
a running event each time it executes a callout function. The event
includes the function pointer, argument, and whether or not it was run from
hardware interrupt context. The callwheel is marked idle when each handler
completes. This effectively logs the duration of each callout routine in
the graph.
Michael Tuexen [Wed, 8 Oct 2014 15:30:59 +0000 (15:30 +0000)]
Ensure that the list of streams sent in a stream reset parameter fits
in an mbuf-cluster.
Thanks to Peter Bostroem for drawing my attention to this part of the code.
Michael Tuexen [Wed, 8 Oct 2014 15:29:49 +0000 (15:29 +0000)]
Ensure that the number of stream reported in srs_number_streams is
consistent with the amount of data provided in the SCTP_RESET_STREAMS
socket option.
Thanks to Peter Bostroem from Google for drawing my attention to
this part of the code.
Kashyap D Desai [Wed, 8 Oct 2014 09:37:47 +0000 (09:37 +0000)]
In the passthru IOCTL path, the mfi command pool was freely accessible N times
where as there are limited number(32) of mfi commands in the pool.
The mfi command pool is now restricted to 27 simultaneous accesses by using
a counting semaphore while calling the passthru function.
In the mrsas_cam.c source file there was a same function name mrsas_poll(),
which was same as the mrsas_poll() implemented in the mrsas.c file for the
polling interface.
To clearly distinguish the functionality by usage we have renamed the former
as mrsas_cam_poll().
In the passthru function let's say it has got an mfi command from the pool
but it has failed in one of the DMA function call which will lead to leak
an mfi command because in the ERROR case it directly returns and not freeing up
the occupied mfi command.
Kashyap D Desai [Wed, 8 Oct 2014 09:35:52 +0000 (09:35 +0000)]
d_poll() callback function is the entry point for poll system call for the application.
It is meant to notify the applications which will be waiting for some
controller events to be occured.
Kashyap D Desai [Wed, 8 Oct 2014 09:34:25 +0000 (09:34 +0000)]
Extended MSI-x vectors support for Invader and Fury(12Gb/s HBA).
This Driver will create multiple MSI-x vector depending upon what FW expose.
As of now 12 Gbp/s MR controller (Invader and Fury) expose 96 msix vector.
As of now 6 Gbp/s MR controller (Thunderbolt) expose 16 msix vector.
Kashyap D Desai [Wed, 8 Oct 2014 09:19:35 +0000 (09:19 +0000)]
This is a feature provided to run 32-bit linux binaries on FreeBSD 64bit
machine, for which 32bit compatibilty code has been added.
As in linux there is only one device entry that is used to fire IOCTL commands,
a new device entry megaraid_sas_ioctl_node is added for solely this
purpose.
From one dev node i.e mrgaraid_sa_ioctl_node we have to find out the
controller instance in case of multicontroller, for which one management info
structure has been added.
Kashyap D Desai [Wed, 8 Oct 2014 08:48:18 +0000 (08:48 +0000)]
Current MegaRAID firmware and hence the driver only supported 64VDs.
E.g: If the user wants to create more than 64VD on a controller,
it is not possible on current firmware/driver.
New feature and requirement to support upto 256VD, firmware/driver/apps need changes.
In addition to that, there must be a backward compatibility of the new driver with the
older firmware and vice versa.
RAID map is the interface between Driver and FW to fetch all required
fields(attributes) for each Virtual Drives.
In the earlier design driver was using the FW copy of RAID map where as
in the new design the Driver will keep the RAID map copy of its own; on which
it will operate for any raid map access in fast path.
Local driver raid map copy will provide ease of access through out the code
and provide generic interface for future FW raid map changes.
For the backward compatibility driver will notify FW that it supports 256VD
to the FW in driver capability field.
Based on the controller properly returned by the FW, the Driver will know
whether it supports 256VD or not and will copy the RAID map accordingly.
At any given time, driver will always have old or new Raid map.
Reviewed by : ambrisko
MFC after : 2 weeks
Sponsored by: AVAGO Technologies
Pyun YongHyeon [Wed, 8 Oct 2014 05:47:01 +0000 (05:47 +0000)]
Add support for QAC AR816x/AR817x Gigabit/Fast Ethernet controllers.
These controllers seem to have the same feature of AR813x/AR815x and
improved RSS support(4 TX queues and 8 RX queues). alc(4) supports
all hardware features except RSS. I didn't implement RX checksum
offloading for AR816x/AR817x just because I couldn't get
confirmation from the Vendor whether AR816x/AR817x corrected its
predecessor's RX checksum offloading bug on fragmented packets.
This change adds supports for the following controllers.
o AR8161 PCIe Gigabit Ethernet controller
o AR8162 PCIe Fast Ethernet controller
o AR8171 PCIe Gigabit Ethernet controller
o AR8172 PCIe Fast Ethernet controller
o Killer E2200 Gigabit Ethernet controller
Tested by: Many
Relnotes: yes
MFC after: 2 weeks
HW donated by: Qualcomm Atheros Communications, Inc.
Pyun YongHyeon [Wed, 8 Oct 2014 05:34:39 +0000 (05:34 +0000)]
Add new quirk PCI_QUIRK_MSI_INTX_BUG to pci(4).
QAC AR816x/E2200 controller has a silicon bug that MSI interrupt
does not assert if PCIM_CMD_INTxDIS bit of command register is set.
Pyun YongHyeon [Wed, 8 Oct 2014 01:03:32 +0000 (01:03 +0000)]
Fix a long standing bug in MAC statistics register access. One
additional register was erroneously added in the MAC register set
such that 7 TX statistics counters were wrong.
Sean Bruno [Tue, 7 Oct 2014 21:50:28 +0000 (21:50 +0000)]
Implement PLPMTUD blackhole detection (RFC 4821), inspired by code
from xnu sources. If we encounter a network where ICMP is blocked
the Needs Frag indicator may not propagate back to us. Attempt to
downshift the mss once to a preconfigured value.
Default this feature to off for now while we do not have a full PLPMTUD
implementation in our stack.
Adds the following new sysctl's for control:
net.inet.tcp.pmtud_blackhole_detection -- turns on/off this feature
net.inet.tcp.pmtud_blackhole_mss -- mss to try for ipv4
net.inet.tcp.v6pmtud_blackhole_mss -- mss to try for ipv6
Adds the following new sysctl's for monitoring:
-- Number of times the code was activated to attempt a mss downshift
net.inet.tcp.pmtud_blackhole_activated
-- Number of times the blackhole mss was used in an attempt to downshift
net.inet.tcp.pmtud_blackhole_min_activated
-- Number of times that we failed to connect after we downshifted the mss
net.inet.tcp.pmtud_blackhole_failed
Gavin Atkinson [Tue, 7 Oct 2014 19:07:50 +0000 (19:07 +0000)]
Support the Vodafone R215 LET USB dongle, which is apparently a rebadged
E5372 with different product IDs.
Interestingly, the standard E5372 IDs (12d1:1506) are currently listed in
u3g.c and are the same as the E3131. However, the R215/E5372 is an NCM
device and works well with cdce(4) whereas the E3131 isn't. More work
may be needed to better identify the other device IDs.
Bjoern A. Zeeb [Tue, 7 Oct 2014 18:00:34 +0000 (18:00 +0000)]
Since introducing the extra mapping in r250103 for architectural performance
events we have actually counted 'Branch Instruction Retired' when people
asked for 'Unhalted core cycles' using the 'unhalted-core-cycles' event mask
mnemonic.
Andriy Gapon [Tue, 7 Oct 2014 16:08:21 +0000 (16:08 +0000)]
l2arc_write_buffers: reduce headroom value
FreeBSD has ARC_BUFC_NUMMETADATALISTS metadata lists and ARC_BUFC_NUMDATALISTS
data lists (currently both are 16) while illumos has just a single list
of each kind.
headroom determines how much data is scanned on a single list
during each run of the l2arc feed thread.
Because FreeBSD has more lists we proportionally decrease the limit.
Andriy Gapon [Tue, 7 Oct 2014 14:30:24 +0000 (14:30 +0000)]
reduce L2ARC_WRITE_SIZE on FreeBSD
FreeBSD has ARC_BUFC_NUMMETADATALISTS metadata lists and ARC_BUFC_NUMDATALISTS
data lists (currently both are 16) while illumos has just a single list
of each kind.
L2ARC_WRITE_SIZE determines the default value of l2arc_write_max which
defines limits on how much data is scanned and written to a cache device
during each run of the l2arc feed thread. The limits are applied on the
per buffer list basis.
Because FreeBSD has more lists we proportionally reduce the limits.
POSIX treats negative time_t as undefined (i.e. may be valid too,
depends on system's policy we don't have) and we don't set EOVERFLOW
in mktime/timegm as POSIX requires to surely distinguish -1 return
as valid negative time from -1 as error return.
Mark Johnston [Mon, 6 Oct 2014 21:52:40 +0000 (21:52 +0000)]
Treat D keywords as identifiers in certain postfix expressions. This allows
one to, for example, access the "provider" field of a struct g_consumer,
even though "provider" is a D keyword.
Neel Natu [Mon, 6 Oct 2014 20:48:01 +0000 (20:48 +0000)]
Inject #UD into the guest when it executes either 'MONITOR' or 'MWAIT'.
The hypervisor hides the MONITOR/MWAIT capability by unconditionally setting
CPUID.01H:ECX[3] to 0 so the guest should not expect these instructions to
be present anyways.
John Baldwin [Mon, 6 Oct 2014 18:16:45 +0000 (18:16 +0000)]
Properly set the timeout in a query_state. The global query_timeout
configuration value is an integer count of seconds, it is not a timeval.
Using memcpy() to copy a timeval from it put garbage into the tv_usec
field.
Luigi Rizzo [Mon, 6 Oct 2014 15:48:28 +0000 (15:48 +0000)]
Add netmap support to libpcap. Tcpdump and other native pcap application can now
run directly on netmap ports using netmap:foo or valeXX:YY device names.
Modifications to existing code are small and trivial, the netmap-specific
code is all in a new file.
Please be aware that in netmap mode the physical interface is disconnected from
the host stack, so libpcap will steal the traffic not just make a copy.
For the full version of the code (including linux and autotools support) see
https://code.google.com/p/netmap-libpcap/
Craig Rodrigues [Mon, 6 Oct 2014 14:43:02 +0000 (14:43 +0000)]
MFV:
use calloc in get_line() when allocating line to ensure it is fully initialized,
fixes a later uninitialized value in copy_param() (FreeBSD #193499).
PR: 193499
Submitted by: Thomas E. Dickey <tom@invisible-island.net>
Alexander Motin [Mon, 6 Oct 2014 12:20:46 +0000 (12:20 +0000)]
Add support for MaxBurstLength and Expected Data transfer Length parameters.
Before this change target could send R2T request for write transfer of any
size, that could violate iSCSI RFC, which allows initiator to limit maximum
R2T size by negotiating MaxBurstLength connection parameter.
Also report an error in case of write underflow, when initiator provides
less data than initiator expects. Previously in such case our target
sent R2T request for non-existing data, violating the RFC, and confusing
some initiators. SCSI specs don't explicitly define how write underflows
should be handled and there are different oppinions, but reporting error
is hopefully better then violating iSCSI RFC with unpredictable results.
we can't easily predict (in current parsing model)
if the keyword is ipfw(8) reserved keyword or port name.
Checking proto database via getprotobyname() consumes a lot of
CPU and leads to tens of seconds for parsing large ruleset.
Use list of reserved keywords and check them as pre-requisite
before doing getprotobyname().
Alexander Motin [Mon, 6 Oct 2014 10:58:54 +0000 (10:58 +0000)]
Use r271207 optimization only for MSI-enabled HBAs.
It was found that VirtualBox' AHCI does not allow nterrupt to be cleared
before the interrupt status register is read, causing interrupt storm.
AHCI specification allows to skip this register use when multi-vector MSI
is enabled and so interrupting port is known. For single-vector MSI that
is not stated explicitly, but if the port is only one, it is obviously
known too.