adrian [Fri, 8 Mar 2013 20:23:55 +0000 (20:23 +0000)]
Bring over my initial work from the net80211 TX locking branch.
This patchset implements a new TX lock, covering both the per-VAP (and
thus per-node) TX locking and the serialisation through to the underlying
physical device.
This implements the hard requirement that frames to the underlying physical
device are scheduled to the underlying device in the same order that they
are processed at the VAP layer. This includes adding extra encapsulation
state (such as sequence numbers and CCMP IV numbers.) Any order mismatch
here will result in dropped packets at the receiver.
There are multiple transmit contexts from the upper protocol layers as well
as the "raw" interface via the management and BPF transmit paths.
All of these need to be correctly serialised or bad behaviour will result
under load.
The specifics:
* add a new TX IC lock - it will eventually just be used for serialisation
to the underlying physical device but for now it's used for both the
VAP encapsulation/serialisation and the physical device dispatch.
This lock is specifically non-recursive.
* Methodize the parent transmit, vap transmit and ic_raw_xmit function
pointers; use lock assertions in the parent/vap transmit routines.
* Add a lock assertion in ieee80211_encap() - the TX lock must be held
here to guarantee sensible behaviour.
* Refactor out the packet sending code from ieee80211_start() - now
ieee80211_start() is just a loop over the ifnet queue and it dispatches
each VAP packet send through ieee80211_start_pkt().
Yes, I will likely rename ieee80211_start_pkt() to something that
better reflects its status as a VAP packet transmit path. More on
that later.
* Add locking around the management and BAR TX sending - to ensure that
encapsulation and TX are done hand-in-hand.
* Add locking in the mesh code - again, to ensure that encapsulation
and mesh transmit are done hand-in-hand.
* Add locking around the power save queue and ageq handling, when
dispatching to the parent interface.
* Add locking around the WDS handoff.
* Add a note in the mesh dispatch code that the TX path needs to be
re-thought-out - right now it's doing a direct parent device transmit
rather than going via the vap layer. It may "work", but it's likely
incorrect (as it bypasses any possible per-node power save and
aggregation handling.)
Why not a per-VAP or per-node lock?
Because in order to ensure per-VAP ordering, we'd have to hold the
VAP lock across parent->if_transmit(). There are a few problems
with this:
* There's some state being setup during each driver transmit - specifically,
the encryption encap / CCMP IV setup. That should eventually be dragged
back into the encapsulation phase but for now it lives in the driver TX path.
This should be locked.
* Two drivers (ath, iwn) re-use the node->ni_txseqs array in order to
allocate sequence numbers when doing transmit aggregation. This should
also be locked.
* Drivers may have multiple frames queued already - so when one calls
if_transmit(), it may end up dispatching multiple frames for different
VAPs/nodes, each needing a different lock when handling that particular
end destination.
So to be "correct" locking-wise, we'd end up needing to grab a VAP or
node lock inside the driver TX path when setting up crypto / AMPDU sequence
numbers, and we may already _have_ a TX lock held - mostly for the same
destination vap/node, but sometimes it'll be for others. That could lead
to LORs and thus deadlocks.
So for now, I'm sticking with an IC TX lock. It has the advantage of
papering over the above and it also has the added advantage that I can
assert that it's being held when doing a parent device transmit.
I'll look at splitting the locks out a bit more later on.
General outstanding net80211 TX path issues / TODO:
* Look into separating out the VAP serialisation and the IC handoff.
It's going to be tricky as parent->if_transmit() doesn't give me the
opportunity to split queuing from driver dispatch. See above.
* Work with monthadar to fix up the mesh transmit path so it doesn't go via
the parent interface when retransmitting frames.
* Push the encryption handling back into the driver, if it's at all
architectually sane to do so. I know it's possible - it's what mac80211
in Linux does.
* Make ieee80211_raw_xmit() queue a frame into VAP or parent queue rather
than doing a short-cut direct into the driver. There are QoS issues
here - you do want your management frames to be encapsulated and pushed
onto the stack sooner than the (large, bursty) amount of data frames
that are queued. But there has to be a saner way to do this.
* Fragments are still broken - drivers need to be upgraded to an if_transmit()
implementation and then fragmentation handling needs to be properly fixed.
Tested:
* STA - AR5416, AR9280, Intel 5300 abgn wifi
* Hostap - AR5416, AR9160, AR9280
* Mesh - some testing by monthadar@, more to come.
marius [Fri, 8 Mar 2013 13:11:45 +0000 (13:11 +0000)]
Merge r247814 from x86 modulo whitespace bug:
Turn on the CTL disable tunable by default.
This will allow GENERIC configurations to boot on small memory boxes, but
not require end users who want to use CTL to recompile their kernel. They
can simply set kern.cam.ctl.disable=0 in loader.conf.
andre [Fri, 8 Mar 2013 10:37:17 +0000 (10:37 +0000)]
Move the callout subsystem initialization to its own SYSINIT()
from being indirectly called via cpu_startup()+vm_ksubmap_init().
The boot order position remains the same at SI_SUB_CPU.
Allocation of the callout array is changed to stardard kernel malloc
from a slightly obscure direct kernel_map allocation.
kern_timeout_callwheel_alloc() is renamed to callout_callwheel_init()
to better describe its purpose.
kern_timeout_callwheel_init() is removed simplifying the per-cpu
initialization.
andre [Fri, 8 Mar 2013 10:14:58 +0000 (10:14 +0000)]
Move the auto-sizing of the callout array from init_param2() to
kern_timeout_callwheel_alloc() where it is actually used.
This is a mechanical move and no tuning parameters are changed.
The pre-allocated callout array is only used for legacy timeout(9)
calls and is only allocated and active on cpu0. Eventually all
remaining users of timeout(9) should switch to the callout_* API.
obrien [Thu, 7 Mar 2013 22:54:28 +0000 (22:54 +0000)]
Fix GCC build:
/usr/src/sys/modules/nvme/../../dev/nvme/nvme.c:211: warning: format '%qx' expects type 'long unsigned int', but argument 9 has type 'long long unsigned int' [-Wformat]
dim [Thu, 7 Mar 2013 22:16:35 +0000 (22:16 +0000)]
Make ctfconvert work correctly on clang-compiled object files. Clang
puts the full original source filename in the STT_FILE entry of the ELF
symbol table, while gcc saves only the basename.
Since the DWARF DW_AT_name attribute contains the full source filename,
both for clang and gcc, ctfconvert takes just the basename of it, for
matching with the STT_FILE entry. So when attempting to match with such
an entry, use its basename, if necessary.
dim [Thu, 7 Mar 2013 21:37:23 +0000 (21:37 +0000)]
Make c99(1) invoke /usr/bin/cc with argv[0] set to "/usr/bin/cc" instead
of just "cc", since there is no reason to cause additional path searches
in this case.
dim [Thu, 7 Mar 2013 21:34:16 +0000 (21:34 +0000)]
Make c89(1) invoke /usr/bin/cc with argv[0] also set to /usr/bin/cc,
similar to what c99(1) does, to prevent "c89: illegal option -- 1"
messages, when clang is /usr/bin/cc.
lstewart [Thu, 7 Mar 2013 04:42:20 +0000 (04:42 +0000)]
The hashmask returned by hashinit() is a valid index in the returned hash array.
Fix a siftr(4) potential memory leak and INVARIANTS triggered kernel panic in
hashdestroy() by ensuring the last array index in the flow counter hash table is
flushed of entries.
ian [Thu, 7 Mar 2013 02:53:29 +0000 (02:53 +0000)]
Call sched_prio() to immediately change the priority of the thread in
response to an rtprio_thread() call, when the priority is different
than the old priority, and either the old or the new priority class is
not RTP_PRIO_NORMAL (timeshare).
The reasoning for the second half of the test is that if it's a change in
timeshare priority, then the scheduler is going to adjust that priority
in a way that completely wipes out the requested change anyway, so
what's the point? (If that's not true, then allowing a thread to change
its own timeshare priority would subvert the scheduler's adjustments and
let a cpu-bound thread monopolize the cpu; if allowed at all, that
should require priveleges.)
On the other hand, if either the old or new priority class is not
timeshare, then the scheduler doesn't make automatic adjustments, so we
should honor the request and make the priority change right away. The
reason the old class gets caught up in this is the very reason for this
change: when thread A changes the priority of its child thread B from
idle back to timeshare, thread B never actually gets moved to a
timeshare-range run queue unless there are some idle cycles available
to allow it to first get scheduled again as an idle thread.
mav [Wed, 6 Mar 2013 22:40:47 +0000 (22:40 +0000)]
Reduce minimal time intervals of setitimer(2) from 1/HZ to 1/(16*HZ) by
using callout_reset_sbt() instead of callout_reset(). We can't remove
lower limit completely in this case because of significant processing
overhead, caused by unability to use direct callout execution due to using
process mutex in callout handler for sending SEGALRM signal. With support
of periodic events that would allow unprivileged user to abuse the system.
des [Wed, 6 Mar 2013 13:48:49 +0000 (13:48 +0000)]
Forced commit to note that this file had not been regenerated since 5.8
due to issues with the configure script incorrectly detecting utmp and
lastlog despite the fact that FreeBSD 10 does not have them any more.
bryanv [Wed, 6 Mar 2013 07:17:53 +0000 (07:17 +0000)]
Remove the virtio dependency entry for the VirtIO device drivers. This
will prevent the kernel from linking if the device driver are included
without the virtio module. Remove pci and scbus for the same reason.
Also explain the relationship and necessity of the virtio and virtio_pci
modules. Currently in FreeBSD, we only support VirtIO PCI, but it could
be replaced with a different interface (like MMIO) and the device
(network, block, etc) will still function.
Requested by: luigi
Approved by: grehan (mentor)
MFC after: 3 days
hrs [Wed, 6 Mar 2013 04:58:48 +0000 (04:58 +0000)]
Fix SIGSEGV when set_short_delay() is called when ifi->ifi_ra_timer is NULL.
This can happen in a short period when a prefix is changed by a rtmsg and a
new interface arrives.
jkim [Tue, 5 Mar 2013 23:05:43 +0000 (23:05 +0000)]
Update the manual page to reflect reality. With r138509 and r152355,
"nostrictjoliet" option for mount_cd9660(8) was completely replaced with
"brokenjoliet" somehow.
bapt [Tue, 5 Mar 2013 13:31:06 +0000 (13:31 +0000)]
Add the ability to correctly read pkg.conf is exists.
Only look for boostrap useful options:
- PACKAGESITE
- ABI
- MIRROR_TYPE
- ASSUME_ALWAYS_YES
While here makes PACKAGESITE expand the ${ABI} variable.
Allow to deactivate any SRV record look up (MIRROR_TYPE=none)
Use the same mechanism as for pkgng itself: first get configuration out of
environment variable and fallback on pkg.conf if exists.
dumbbell [Tue, 5 Mar 2013 11:02:05 +0000 (11:02 +0000)]
g_label_ntfs.c: Mark structures as __packed
Without this, read data is mis-interpreted. This could trigger a panic,
as was the case on one computer where computed "recsize" was zero,
leading to a call to g_read_page() asking for 0 bytes.
bryanv [Tue, 5 Mar 2013 07:00:05 +0000 (07:00 +0000)]
Only set the barrier flag if the feature was negotiated
When the VirtIO barrier feature is not negotiated, the driver
must enforce the proper ordering for BIO_ORDERED BIOs. All the
in-flight BIOs must complete before starting the BIO, and the
ordered BIO must complete before subsequent BIOs can start.
jfv [Mon, 4 Mar 2013 23:15:07 +0000 (23:15 +0000)]
Fix a small, but important bug, a task drain was mistakenly
being compiled only when setting LEGACY_TX, this means you would
not get the drain when needed on detach!!
Thanks to Bryan Venteicher (bryanv@freebsd.org) for catching this
little gremlin!! :)
jfv [Mon, 4 Mar 2013 23:07:40 +0000 (23:07 +0000)]
First, sync to internal shared code, and then
Fixes:
- flow control - don't override user value on re-init
- fix to make 1G optics work correctly
- change to interrupt enabling - some bits were incorrect
for certain hardware.
- certain stats fixes, remove a duplicate increment of
ierror, thanks to Scott Long for pointing these out.
- shared code link interface changed, requiring some
core code changes to accomodate this.
- add an m_adj() to ETHER_ALIGN on the recieve side, this
was requested by Mike Karels, thanks Mike.
- Multicast code corrections also thanks to Mike Karels.
gibbs [Mon, 4 Mar 2013 22:07:36 +0000 (22:07 +0000)]
Fix assertion failure when using userland DTrace probes from
the pid provider on a kernel compiled with INVARIANTS.
sys/cddl/contrib/opensolaris/uts/intel/dtrace/fasttrap_isa.c:
In fasttrap_probe_pid(), attempts to write to the
address space of the thread that fired the probe
must be performed with the process of the thread
held. Use _PHOLD() to ensure this is the case.
In fasttrap_probe_pid(), use proc_write_regs() instead
of calling set_regs() directly. proc_write_regs()
performs invariant checks to verify the calling
environment of set_regs(). PROC_LOCK()/UNLOCK() around
the call to proc_write_regs() so that it's invariants
are satisfied.
ken [Mon, 4 Mar 2013 21:18:45 +0000 (21:18 +0000)]
Re-enable CTL in GENERIC on i386 and amd64, but turn on the CTL disable
tunable by default.
This will allow GENERIC configurations to boot on small memory boxes, but
not require end users who want to use CTL to recompile their kernel. They
can simply set kern.cam.ctl.disable=0 in loader.conf.
The eventual solution to the memory usage problem is to change the way
CTL allocates memory to be more configurable, but this should fix things
for small memory situations in the mean time.
UPDATING: Explain the change in the CTL configuration, and
how users can enable CTL if they would like to use
it.
sys/conf/options: Add a new option, CTL_DISABLE, that prevents CTL
from initializing.
ctl.c: If CTL_DISABLE is turned on, don't initialize.
i386/conf/GENERIC,
amd64/conf/GENERIC: Re-enable device ctl, and add the CTL_DISABLE
option.
davide [Mon, 4 Mar 2013 19:10:39 +0000 (19:10 +0000)]
MFcalloutng:
Dcoument the new functions added to condvar(9), sleep(9), sleepqueue(9)
KPIs. Also document recent changes in timeout(9) and eventtimers(4).
davide [Mon, 4 Mar 2013 16:55:16 +0000 (16:55 +0000)]
MFcalloutng:
- Rewrite kevent() timeout implementation to allow sub-tick precision.
- Make the interval timings for EVFILT_TIMER more accurate. This also
removes an hack introduced in r238424.
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, markj, Fabian Keil
davide [Mon, 4 Mar 2013 16:41:27 +0000 (16:41 +0000)]
MFcalloutng:
Fix kern_select() and sys_poll() so that they can handle sub-tick
precision for timeouts (in the same fashion it was done for nanosleep()
in r247797).
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, markj, Fabian Keil
davide [Mon, 4 Mar 2013 15:57:41 +0000 (15:57 +0000)]
MFcalloutng:
kern_nanosleep() is now converted to use tsleep_sbt(). With this change
nanosleep() and usleep() can handle sub-tick precision for timeouts.
Also, try to help coalesce of events passing as argument to tsleep_bt()
a precision value calculated as a percentage of the sleep time.
This percentage is default 5%, but it can tuned according to users
need via the sysctl interface.
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, markj, Fabian Keil
davide [Mon, 4 Mar 2013 14:00:58 +0000 (14:00 +0000)]
MFcalloutng (r244249, r244306 by mav):
- Switch syscons from timeout() to callout_reset_flags() and specify that
precision is not important there -- anything from 20 to 30Hz will be fine.
- Reduce syscons "refresh" rate to 1-2Hz when console is in graphics mode
and there is nothing to do except some polling for keyboard. Text mode
refresh would also be nice to have adaptive, but this change at least
should help laptop users who running X.
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, markj, Fabian Keil
attilio [Mon, 4 Mar 2013 13:10:59 +0000 (13:10 +0000)]
Merge from vmcontention:
As vm objects are type-stable there is no need to initialize the
resident splay tree pointer and the cache splay tree pointer in
_vm_object_allocate() but this could be done in the init UMA zone
handler.
The destructor UMA zone handler, will further check if the condition is
retained at every destruction and catch for bugs.
davide [Mon, 4 Mar 2013 12:48:41 +0000 (12:48 +0000)]
MFcalloutng:
Introduce sbt variants of msleep(), msleep_spin(), pause(), tsleep() in
the KPI, allowing to specify timeout in 'sbintime_t' rather than ticks.
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, markj, Fabian Keil
davide [Mon, 4 Mar 2013 11:22:19 +0000 (11:22 +0000)]
MFcalloutng (r244355):
Make loadavg calculation callout direct. There are several reasons for it:
- it is very simple and doesn't worth context switch to SWI;
- since SWI is no longer used here, we can remove twelve years old hack,
excluding this SWI from from the loadavg statistics;
- it fixes problem when eventtimer (HPET) shares interrupt with some other
device, and that interrupt thread counted as permanent loadavg of 1; now
loadavg accounted before that interrupt thread is scheduled.
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, Fabian Keil, markj
davide [Mon, 4 Mar 2013 11:09:56 +0000 (11:09 +0000)]
- Make callout(9) tickless, relying on eventtimers(4) as backend for
precise time event generation. This greatly improves granularity of
callouts which are not anymore constrained to wait next tick to be
scheduled.
- Extend the callout KPI introducing a set of callout_reset_sbt* functions,
which take a sbintime_t as timeout argument. The new KPI also offers a
way for consumers to specify precision tolerance they allow, so that
callout can coalesce events and reduce number of interrupts as well as
potentially avoid scheduling a SWI thread.
- Introduce support for dispatching callouts directly from hardware
interrupt context, specifying an additional flag. This feature should be
used carefully, as long as interrupt context has some limitations
(e.g. no sleeping locks can be held).
- Enhance mechanisms to gather informations about callwheel, introducing
a new sysctl to obtain stats.
This change breaks the KBI. struct callout fields has been changed, in
particular 'int ticks' (4 bytes) has been replaced with 'sbintime_t'
(8 bytes) and another 'sbintime_t' field was added for precision.
Together with: mav
Reviewed by: attilio, bde, luigi, phk
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo (amd64, sparc64), marius (sparc64), ian (arm),
markj (amd64), mav, Fabian Keil