Reimplement and/or implement vnet list locking by replacing a mostly
unused custom mutex/condvar-based sleep locks with two locks: an
rwlock (for non-sleeping use) and sxlock (for sleeping use). Either
acquired for read is sufficient to stabilize the vnet list, but both
must be acquired for write to modify the list.
Replace previous no-op read locking macros, used in various places
in the stack, with actual locking to prevent race conditions. Callers
must declare when they may perform unbounded sleeps or not when
selecting how to lock.
Refactor vnet sysinits so that the vnet list and locks are initialized
before kernel modules are linked, as the kernel linker will use them
for modules loaded by the boot loader.
Update various consumers of these KPIs based on whether they may sleep
or not.
sam [Sat, 18 Jul 2009 20:19:53 +0000 (20:19 +0000)]
Move code that does payload realigment to a new routine, ieee80211_realign,
so it can be reused. While here rewrite the logic to always use a single mbuf.
Fix a problem, whereby misbehaving IPv6 applications, which don't include
a valid zone ID or interface identifier in a v6 multicast leave, would
trigger a fairly paranoid KASSERT().
Observed with Boost++ regression tests on ref8.freebsd.org.
- Fix the issue with read access count modification on RAID-5 plexes properly.
If the access counts were not increased and decreased in equal numbers by
gvinum consumers, the read access count would be inconsistent with the write
access count. Instead, modify the read access count with the write access
count directly to prevent any inconsistencies.
An addendum to r195649, "Add support to the virtual memory system for
configuring machine-dependent memory attributes...":
Don't set the memory attribute for a "real" page that is allocated to
a device object in vm_page_alloc(). It is a pointless act, because
the device pager replaces this "real" page with a "fake" page and sets
the memory attribute on that "fake" page.
Eliminate pointless code from pmap_cache_bits() on amd64.
Employ the "Self Snoop" feature supported by some x86 processors to
avoid cache flushes in the pmap.
Store accurate offset information in CTF data. A large number of
structs had incorrect member offsets, limiting dtrace's usefulness
when working with them. An example of incorrect info (struct
rtentry) from before this fix:
<1738> STRUCT rtentry (200 bytes)
rt_nodes type=1731 off=0
rt_gateway type=849 off=65280 <== WRONG, should be 8 * 96
rt_flags type=3 off=65344 <== wrong again, and so on..
...
Approved by: re (kib), gnn (mentor)
MFC after: 2 weeks
Patch the regular nfs client in a manner analagous to
r195704 for the experimental client. The patch avoids calling vn_lock()
for the case where nfs_nget() has acquired the same vnode as dvp,
since nfs_nget() has already locked the vnode.
Reviewed by: kib, jhb
Approved by: re (kensmith), kib (mentor)
Import OpenBSM 1.1p1 from vendor branch to 8-CURRENT, populating
contrib/openbsm and a subset also imported into sys/security/audit.
This patch release addresses several minor issues:
- Fixes to AUT_SOCKUNIX token parsing.
- IPv6 support for au_to_me(3).
- Improved robustness in the parsing of audit_control, especially long
flags/naflags strings and whitespace in all fields.
- Add missing conversion of a number of FreeBSD/Mac OS X errnos to/from BSM
error number space.
MFC after: 3 weeks
Obtained from: TrustedBSD Project
Sponsored by: Apple, Inc.
Approved by: re (kib)
dtrace_gethrtime: improve scaling of TSC ticks to nanoseconds
Currently dtrace_gethrtime uses formula similar to the following for
converting TSC ticks to nanoseconds:
rdtsc() * 10^9 / tsc_freq
The dividend overflows 64-bit type and wraps-around every 2^64/10^9 = 18446744073 ticks which is just a few seconds on modern machines.
Now we instead use precalculated scaling factor of
10^9*2^N/tsc_freq < 2^32 and perform TSC value multiplication separately
for each 32-bit half. This allows to avoid overflow of the dividend
described above.
The idea is taken from OpenSolaris.
This has an added feature of always scaling TSC with invariant value
regardless of TSC frequency changes. Thus the timestamps will not be
accurate if TSC actually changes, but they are always proportional to
TSC ticks and thus monotonic. This should be much better than current
formula which produces wildly different non-monotonic results on when
tsc_freq changes.
Also drop write-only 'cp' variable from amd64 dtrace_gethrtime_init()
to make it identical to the i386 twin.
r195699 introduced an assertion regarding when progbits data in kernel
modules was present, which turns out to be false in some situations.
Back out the assertion.
Reported by: Luiz Otavio O Souza <lists.br at gmail.com>,
Florian Smeets <flo at kasimir.com>
Approved by: re (kensmith) (implicit)
Fix the experimental nfs client so that it does not cause a
"share->excl" panic when doing a lookup of dotdot at the root
of a server's file system. The patch avoids calling vn_lock()
for that case, since nfscl_nget() has already acquired a lock
for the vnode.
Use PBDRY flag for msleep(9) in NFS and NLM when sleeping thread owns
kernel resources that block other threads, like vnode locks. The SIGSTOP
sent to such thread (process, rather) shall not stop it until thread
releases the resources.
Tested by: pho
Reviewed by: jhb
Approved by: re (kensmith)
Add new msleep(9) flag PBDY that shall be specified together with
PCATCH, to indicate that thread shall not be stopped upon receipt of
SIGSTOP until it reaches the kernel->usermode boundary.
Also change thread_single(SINGLE_NO_EXIT) to only stop threads at
the user boundary unconditionally.
Tested by: pho
Reviewed by: jhb
Approved by: re (kensmith)
Move the repeated code to calculate the number of the threads in the
process that still need to be suspended or exited from thread_single
into the new function calc_remaining().
Tested by: pho
Reviewed by: jhb
Approved by: re (kensmith)
When wakeup(9) is going to notify swapper, assert that wait channel is not
equal to &proc0. It shall be not, since proc0 stack is not swappable, and
kick_proc0() is wakeup(&proc0).
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
kan [Tue, 14 Jul 2009 21:19:13 +0000 (21:19 +0000)]
Second attempt at eliminating .text relocations in shared libraries
compiled with stack protector.
Use libssp_nonshared library to pull __stack_chk_fail_local symbol into
each library that needs it instead of pulling it from libc. GCC
generates local calls to this function which result in absolute
relocations put into position-independent code segment, making dynamic
loader do extra work every time given shared library is being relocated
and making affected text pages non-shareable.
- Change mmap() to fail requests with EINVAL that pass a length of 0. This
behavior is mandated by POSIX.
- Do not fail requests that pass a length greater than SSIZE_MAX
(such as > 2GB on 32-bit platforms). The 'len' parameter is actually
an unsigned 'size_t' so negative values don't really make sense.
Submitted by: Alexander Best alexbestms at math.uni-muenster.de
Reviewed by: alc
Approved by: re (kib)
MFC after: 1 week
- Do aggresive saturation on various polynomial interpolators.
This dramatically pushing 99.9% interpolations and quantizations
error _below_ -180dB on 32bit dynamic range, resulting extremely
high quality conversion.
- Use BSPLINE interpolator for filter oversampling factor greater or
equal than 64 (log2 6).
sam [Tue, 14 Jul 2009 17:11:06 +0000 (17:11 +0000)]
Updates, mostly to add 802.11s support:
o add missing Status and Reason codes
o parse/display Action frames
o parse/display Mesh data frames
o parse/display BA frames
Fix a buglet that slipped into r195654. My buildworld/buildkernel sanity
check missed this because cxgb's TOM is currently commented out of the build
system.
Submitted by: Navdeep Parhar <np at FreeBSD dot org>
Approved by: re (kensmith), kensmith (mentor temporarily unavailable)
Correct an error of omission in r195649 ("Add support to the virtual memory
system for configuring machine-dependent memory attributes: ..."). In
r195649, the "vm_cache_mode_t/vm_memattr_t" parameter was removed from
vm_phys_alloc_contig().
Fix a race in the manipulation of the V_tcp_sack_globalholes global variable,
which is currently not protected by any type of lock. When triggered, the bug
would sometimes cause a panic when the TCP activity to an affected machine
eventually slowed during a lull. The panic only occurs if INVARIANTS is compiled
into the kernel, and has laid dormant for some time as a result of INVARIANTS
being off by default except in FreeBSD-CURRENT.
Switch to atomic operations in the locations where the variable is changed.
Reads have not been updated to be protected by atomics, so there is a
possibility of accounting errors in any given calculation where the variable is
read. This is considered unlikely to occur in the wild, and will not cause
serious harm on rare occasions where it does.
Thanks to Robert Watson for debugging help.
Reported by: Kamigishi Rei <spambox at haruhiism dot net>
Tested by: Kamigishi Rei <spambox at haruhiism dot net>
Reviewed by: silby
Approved by: re (rwatson), kensmith (mentor temporarily unavailable)
Replace struct tcpopt with a proxy toeopt struct in the TOE driver interface to
the TCP syncache. This returns struct tcpopt to being private within the TCP
implementation, thus allowing it to be modified without ABI concerns.
The patch breaks the ABI. Bump __FreeBSD_version to 800103 accordingly. The cxgb
driver is the only TOE consumer affected by this change, and needs to be
recompiled along with the kernel.
1) Use our vendor domain at the pool.
2) Point people at the pool website and encourage
people to provide a server in the pool (as a
courtesy to the pool guys).
3) Fix a spelling.
4) Comment out the local clock and include a link
to documentation for use of the local clock on
the ntp.org site.
Add support to the virtual memory system for configuring machine-
dependent memory attributes:
Rename vm_cache_mode_t to vm_memattr_t. The new name reflects the
fact that there are machine-dependent memory attributes that have
nothing to do with controlling the cache's behavior.
Introduce vm_object_set_memattr() for setting the default memory
attributes that will be given to an object's pages.
Introduce and use pmap_page_{get,set}_memattr() for getting and
setting a page's machine-dependent memory attributes. Add full
support for these functions on amd64 and i386 and stubs for them on
the other architectures. The function pmap_page_set_memattr() is also
responsible for any other machine-dependent aspects of changing a
page's memory attributes, such as flushing the cache or updating the
direct map. The uses include kmem_alloc_contig(), vm_page_alloc(),
and the device pager:
kmem_alloc_contig() can now be used to allocate kernel memory with
non-default memory attributes on amd64 and i386.
vm_page_alloc() and the device pager will set the memory attributes
for the real or fictitious page according to the object's default
memory attributes.
Update the various pmap functions on amd64 and i386 that map pages to
incorporate each page's memory attributes in the mapping.
Notes: (1) Inherent to this design are safety features that prevent
the specification of inconsistent memory attributes by different
mappings on amd64 and i386. In addition, the device pager provides a
warning when a device driver creates a fictitious page with memory
attributes that are inconsistent with the real page that the
fictitious page is an alias for. (2) Storing the machine-dependent
memory attributes for amd64 and i386 as a dedicated "int" in "struct
md_page" represents a compromise between space efficiency and the ease
of MFCing these changes to RELENG_7.
This patch adds a host route to an interface address (that is assigned
to a non loopback/ppp link type) through the loopback interface. Prior
to the new L2/L3 rewrite, this host route was explicitly created when
processing the IPv6 address assignment. This loopback host route is
deleted when that IPv6 address is removed from the interface.
Add calls to the experimental nfs client for the case of an "intr" mount,
so that signals that aren't supposed to terminate RPCs in progress are
masked off during the RPC.
For single segment allocations the boundary field
of the BUSDMA tag should be zero. Currently all
single segment allocations are less than or equal
to 4096 bytes, so the limit does not kick in. If
any single segment USB allocations would be greater
than 4K, then it would be a problem.
When VM_MAP_WIRE_HOLESOK is not specified and vm_map_wire(9) encounters
non-readable and non-executable map entry, the entry is skipped from
wiring and loop is aborted. But, since MAP_ENTRY_WIRE_SKIPPED was not
set for the map entry, its wired_count is later erronously decremented.
vm_map_delete(9) for such map entry stuck in "vmmaps".
Properly set MAP_ENTRY_WIRE_SKIPPED when aborting the loop.
Reported by: John Marshall <john.marshall riverwillow com au>
Approved by: re (kensmith)
The patch breaks the ABI. Bump __FreeBSD_version to 800102 accordingly. User
space tools that rely on the size of any of these structs (e.g. sockstat) need
to be recompiled.
Rename option USBVERBOSE to USB_VERBOSE for 2 reasons:
1. USB_VERBOSE is more consistent with USB_DEBUG,
2. sys/dev/usb/usb_device.c uses option USB_VERBOSE and
not USBVERBOSE.
POLA with the USBVERBOSE option as it's found in 7-STABLE
has been considered but found insignificant in the face
of the USB stack overhaul.
Increase the size of the page table on 64-bit PowerPC machines as a
bandaid to prevent exhaustion of the primary and secondary hash groups
in the event of extreme stress on the PMAP layer (e.g. a forkbomb). This
wastes memory, and should be revised to properly handle PTEG spills instead.
Revert rev 192323 (nfs_common.c only):
The D-cache flushing added here was to deal with I-cache
incoherency observed on ia64. However, the problem was
in the implementation of pmap_enter_object() for ia64:
it was missing I-cache coherency logic for prefaulted
pages. After this got added in rev 195625, testing showed
that no D-cache flushing was required.
The SIGILL that was observed on Book-E (see commit log
for rev 192323) ended up not being related to I-cache
incoherency, but was found to be caused by bad memory.
This discovery further undermined the need for D-cache
flushing in the NFS I/O code, triggering the reversal.
In nvpair_native_embedded_array(), meaningless pointers are zeroed.
The programmer was aware that alignment was not guaranteed in the
packed structure and used bzero() to NULL out the pointers.
However, on ia64, the compiler is quite agressive in finding ILP
and calls to bzero() are often replaced by simple assignments (i.e.
stores). Especially when the width or size in question corresponds
with a store instruction (i.e. st1, st2, st4 or st8).
The problem here is not a compiler bug. The address of the memory
to zero-out was given by '&packed->nvl_priv' and given the type of
the 'packed' pointer the compiler could assume proper alignment for
the replacement of bzero() with an 8-byte wide store to be valid.
The problem is with the programmer. The programmer knew that the
address did not have the alignment guarantees needed for a regular
assignment, but failed to inform the compiler of that fact. In
fact, the programmer told the compiler the opposite: alignment is
guaranteed.
The fix is to avoid using a pointer of type "nvlist_t *" and
instead use a "char *" pointer as the basis for calculating the
address. This tells the compiler that only 1-byte alignment can
be assumed and the compiler will either keep the bzero() call
or instead replace it with a sequence of byte-wise stores. Both
are valid.
Remove build timestamps from the following files:
/boot/kernel/hptrr.ko
/etc/mail/*.cf
/lib/libcrypto.so.5
/usr/bin/ntpq
/usr/sbin/amd
/usr/sbin/iasl
/usr/sbin/ntpd
/usr/sbin/ntpdate
/usr/sbin/ntpdc
There does not appear to be any purpose to having these timestamps, and
they have the irritating consequence that the aforementioned files will
be different every time they are rebuilt.
After this commit, the only remaining build timestamps are in the kernel,
the boot loaders, /usr/include/osreldate.h (the year in the copyright
notice), and lib*.a (the timestamps on all of the included .o files).
Reviewed by: scottl (hptrr), gshapiro (sendmail), simon (openssl),
roberto (ntp), jkim (acpica)
Approved by: re (kib)
On exec(2), when loading the ELF image, pmap_enter_object() is
called to prefault pages. This is an obvious place for making
sure the I-cache is coherent. It was missing though. As such,
execution over NFS and ZFS file systems was failing. NFS was
fixed the wrong way (by flushing the D-cache as part of the
NFS code) in a previous commit. ZFS problems were encountered
after that and indicated that something else was wrong...
Re-factoring for adding weighted routes introduced a
fairly irritating bug where the system will panic
when RADIX_MPATH is enabled. This change fixes this.
Fix .Dd value -- our mdoc macros don't know how to parse the $Mdocdate$
tag, so the file was being treated as having no date (i.e., the current
date was being inserted).
Implementation of the upcoming Wireless Mesh standard, 802.11s, on the
net80211 wireless stack. This work is based on the March 2009 D3.0 draft
standard. This standard is expected to become final next year.
This includes two main net80211 modules, ieee80211_mesh.c
which deals with peer link management, link metric calculation,
routing table control and mesh configuration and ieee80211_hwmp.c
which deals with the actually routing process on the mesh network.
HWMP is the mandatory routing protocol on by the mesh standard, but
others, such as RA-OLSR, can be implemented.
Authentication and encryption are not implemented.
There are several scripts under tools/tools/net80211/scripts that can be
used to test different mesh network topologies and they also teach you
how to setup a mesh vap (for the impatient: ifconfig wlan0 create
wlandev ... wlanmode mesh).
A new build option is available: IEEE80211_SUPPORT_MESH and it's enabled
by default on GENERIC kernels for i386, amd64, sparc64 and pc98.
Drivers that support mesh networks right now are: ath, ral and mwl.
More information at: http://wiki.freebsd.org/WifiMesh
Please note that this work is experimental. Also, please note that
bridging a mesh vap with another network interface is not yet supported.
Many thanks to the FreeBSD Foundation for sponsoring this project and to
Sam Leffler for his support.
Also, I would like to thank Gateworks Corporation for sending me a
Cambria board which was used during the development of this project.
Reviewed by: sam
Approved by: re (kensmith)
Obtained from: projects/mesh11s
Get correct maxio from the controller and drop the tunable.
The default (64K) is too pessimistic for "new comm" hardware.
Also, this is bad because multiple controllers get limited by
the global tunable.
When amd64 CPU cannot load segment descriptor during trap return to
usermode, it generates GPF, that is mirrored to user mode as SIGSEGV.
The offending register in mcontext should contain the value loading of
which generated the GPF, and it is so on i386. On amd64, we currently
report segment descriptor in tf_err, while segment register contains the
corrected value loaded by trap handler.
Fix the issue by behaving like i386, reloading segment register in trap
frame after signal frame is pushed onto user stack.
Noted and tested by: pho
Approved by: re (kensmith)
Separate the parallel scsi knowledge out of the core of the XPT, and
modularize it so that new transports can be created.
Add a transport for SATA
Add a periph+protocol layer for ATA
Add a driver for AHCI-compliant hardware.
Add a maxio field to CAM so that drivers can advertise their max
I/O capability. Modify various drivers so that they are insulated
from the value of MAXPHYS.
The new ATA/SATA code supports AHCI-compliant hardware, and will override
the classic ATA driver if it is loaded as a module at boot time or compiled
into the kernel. The stack now support NCQ (tagged queueing) for increased
performance on modern SATA drives. It also supports port multipliers.
ATA drives are accessed via 'ada' device nodes. ATAPI drives are
accessed via 'cd' device nodes. They can all be enumerated and manipulated
via camcontrol, just like SCSI drives. SCSI commands are not translated to
their ATA equivalents; ATA native commands are used throughout the entire
stack, including camcontrol. See the camcontrol manpage for further
details. Testing this code may require that you update your fstab, and
possibly modify your BIOS to enable AHCI functionality, if available.
This code is very experimental at the moment. The userland ABI/API has
changed, so applications will need to be recompiled. It may change
further in the near future. The 'ada' device name may also change as
more infrastructure is completed in this project. The goal is to
eventually put all CAM busses and devices until newbus, allowing for
interesting topology and management options.
Few functional changes will be seen with existing SCSI/SAS/FC drivers,
though the userland ABI has still changed. In the future, transports
specific modules for SAS and FC may appear in order to better support
the topologies and capabilities of these technologies.
The modularization of CAM and the addition of the ATA/SATA modules is
meant to break CAM out of the mold of being specific to SCSI, letting it
grow to be a framework for arbitrary transports and protocols. It also
allows drivers to be written to support discrete hardware without
jeopardizing the stability of non-related hardware. While only an AHCI
driver is provided now, a Silicon Image driver is also in the works.
Drivers for ICH1-4, ICH5-6, PIIX, classic IDE, and any other hardware
is possible and encouraged. Help with new transports is also encouraged.
Since the nfscl_getclose() function both decremented open counts and,
optionally, created a separate list of NFSv4 opens to be closed, it
was possible for the associated OpenOwner to be free'd before the Open
was closed. The problem was that the Open was taken off the OpenOwner
list before the Close RPC was done and OpenOwners can be free'd once the
list is empty. This patch separates out the case of doing the Close RPC
into a separate function called nfscl_doclose() and simplifies nfsrpc_doclose()
so that it closes a single open instead of a list of them. This avoids
removing the Open from the OpenOwner list before doing the Close RPC.
The control terminal revocation at the session leader exit does not
correctly checks for reclaimed vnode, possibly calling VOP_REVOKE for
such vnode. If the terminal is already revoked, or devfs mount was
forcibly unmounted, the revocation of doomed ctty vnode causes panic.
Reported and tested by: lstewart
Approved by: re (kensmith)
MFC after: 2 weeks
Extend the cn_flags field of the struct componentname to 64 bits to have
more space for the flags, that is too close to be exhausted. While changing
the KBI for name(9), use unsigned int for symlinks count.