Robert Watson [Thu, 14 Feb 2008 20:57:38 +0000 (20:57 +0000)]
Add open_to_operation, a security regression test that opens files with
various open flags and then tests various operations to make sure that
they are properly constrained by open flags. Various I/O mechansms
are tried, including aio if compiled into the kernel or loaded as a
module. There's more to be done here but it's a useful start, running
about 220 individual tests.
Yaroslav Tykhiy [Thu, 14 Feb 2008 20:12:23 +0000 (20:12 +0000)]
No network addresses in the system isn't a good excuse
for rpcbind(8) to crash.
The crash was due to a boolean variable initialized
improperly. Besides fixing the initialization, pick
a better name for the variable so that its meaning is
clear and no more coding errors appear around it.
John Baldwin [Thu, 14 Feb 2008 20:01:52 +0000 (20:01 +0000)]
Make netstat -rn more resilient to having the routing table change out from
under it while running. Note that this is still not perfect:
- Try to do something intelligent if kvm_read() fails to read a routing
table structure such as an rtentry, radix_node, or ifnet.
- Don't follow left and right node pointers in radix_nodes unless
RNF_ACTIVE is set in rn_flags. This avoids walking through freed
radix_nodes.
Marcel Moolenaar [Thu, 14 Feb 2008 18:46:50 +0000 (18:46 +0000)]
On Montecito processors, the instruction cache is in fact not
coherent with the data caches. Implement a quick fix to allow
us to boot on Montecito, while I'm working on a better fix in
the mean time.
Yaroslav Tykhiy [Thu, 14 Feb 2008 17:04:31 +0000 (17:04 +0000)]
In the new order of things dictated by nmount(2), a read-only mount
is to be requested via a "ro" option. At the same time, MNT_RDONLY
is gradually becoming an indicator of the current state of the FS
instead of a command flag. Today passing MNT_RDONLY alone to the
kernel's mount machinery will lead to various glitches. (See the
PRs for examples.)
Therefore mount the root FS with a "ro" option instead of the
MNT_RDONLY flag. (Note that MNT_RDONLY still is added to the mount
flags internally, by vfs_donmount(), if "ro" was specified.)
To be able to pass "ro" cleanly to kernel_vmount(), teach the latter
function to accept options with NULL values.
Also correct the comment explaining how mount_arg() handles length
of -1.
PR: bin/106636 kern/120319
Submitted by: Jaakko Heinonen <see PR kern/120319 for email> (originally)
Andrew Gallatin [Thu, 14 Feb 2008 16:24:14 +0000 (16:24 +0000)]
Now that mxge supports MSI-X interrupts, reverse the logic and flag
legacy interrupts rather than MSI as a special case. Prior to this
commit, the interrupt handler was doing the slow handshaking with
the device to ensure the legacy interrupt was lowered in both
the legacy and MSI-X case. This handshaking was not
required for MSI-X.
Bruce Evans [Thu, 14 Feb 2008 13:44:03 +0000 (13:44 +0000)]
Use the expression fabs(x+0.0)+fabs(y+0.0) instad of a+b (where a is
|x| or |y| and b is |y| or |x|) when mixing NaN arg(s).
hypot*() had its own foot shooting for mixing NaNs -- it swaps the
args so that |x| in bits is largest, but does this before quieting
signaling NaNs, so on amd64 (where the result of adding NaNs depends
on the order) it gets inconsistent results if setting the quiet bit
makes a difference, just like a similar ia64 and i387 hardware comparison.
The usual fix (see e_powf.c 1.13 for more details) of mixing using
(a+0.0)+-(b+0.0) doesn't work on amd64 if the args are swapped (since
the rder makes a difference with SSE). Fortunately, the original args
are unchanged and don't need to be swapped when we let the hardware
decide the mixing after quieting them, but we need to take their
absolute value.
hypotf() doesn't seem to have any real bugs masked by this non-bug.
On amd64, its maximum error in 2^32 trials on amd64 is now 0.8422 ulps,
and on i386 the maximum error is unchanged and about the same, except
with certain CFLAGS it magically drops to 0.5 (perfect rounding).
Bruce Evans [Thu, 14 Feb 2008 12:56:35 +0000 (12:56 +0000)]
Forced commit to note that the minus sign in the fancy expression
(x+0.0)-(y+0.0) for mixing NaNs documented in a previous log message
didn't actually get committed. Apparently, adding 0.0 uniformizes
the order enough to give consistent results.
Bruce Evans [Thu, 14 Feb 2008 10:23:51 +0000 (10:23 +0000)]
Fix the hi+lo decomposition for 2/(3ln2). The decomposition needs to
be into 12+24 bits of precision for extra-precision multiplication,
but was into 13+24 bits. On i386 with -O1 the bug was hidden by
accidental extra precision, but on amd64, in 2^32 trials the bug
caused about 200000 errors of more than 1 ulp, with a maximum error
of about 80 ulps. Now the maximum error in 2^32 trials on amd64
is 0.8573 ulps. It is still 0.8316 ulps on i386 with -O1.
The nearby decomposition of 1/ln2 and the decomposition of 2/(3ln2) in
the double precision version seem to be sub-optimal but not broken.
Bruce Evans [Thu, 14 Feb 2008 09:42:24 +0000 (09:42 +0000)]
Use the expression (x+0.0)-(y+0.0) instead of x+y when mixing NaN arg(s).
This uses 2 tricks to improve consistency so that more serious problems
aren't hidden in simple regression tests by noise for the NaNs:
- for a signaling NaN, adding 0.0 generates the invalid exception and
converts to a quiet NaN, and doesn't have too many effects for other
types of args (it converts -0 to +0 in some rounding modes, but that
hopefully doesn't change the result after adding the NaN arg). This
avoids some inconsistencies on i386 and ia64. On these arches, the
result of an operation on 2 NaNs is apparently the largest or the
smallest of the NaNs as bits (consistently largest or smallest for
each arch, but the opposite). I forget which way the comparison
goes and if the sign bit affects it. The quiet bit is is handled
poorly by not always setting it before the comparision or ignoring
it. Thus if one of the args was originally a signaling NaN and the
other was originally a quiet NaN, then the result depends too much
on whether the signaling NaN has been quieted at this point, which
in turn depends on optimizations and promotions. E.g., passing float
signaling NaNs to double functions must quiet them on conversion;
on i387, loading a signaling NaN of type float or double (but not
long double) into a register involves a conversion, so it quiets
signaling NaNs, so if the addition has 2 register operands than it
only sees quiet NaNs, but if the addition has a memory operand then
it sees a signaling NaN iff it is in the memory operand.
- subtraction instead of addition is used to avoid a dubious optimization
in old versions of gcc. For SSE operations, mixing of NaNs apparently
always gives the target operand. This is not as good as the i387
and ia64 behaviour. It doesn't mix NaNs at all, and makes addition
not quite commutative. Old versions of gcc sometimes rewrite x+y
to y+x and thus give different results (in bits) for NaNs. gcc-3.3.3
rewrites x+y to y+x for one of pow() and powf() but not the other,
so starting from float NaN args x and y, powf(x, y) was almost always
different from pow(x, y).
These tricks won't give consistency of 2-arg float and double functions
with long double ones on amd64, since long double ones use the i387
which has different semantics from SSE.
Pyun YongHyeon [Thu, 14 Feb 2008 01:10:48 +0000 (01:10 +0000)]
Nuke local jumbo allocator and switch to use of UMA backed page
allocator for jumbo frame.
o Removed unneeded jlist lock which was used to manage jumbo
buffers.
o Don't reinitialize hardware if MTU was not changed.
o Added additional check for minimal MTU size.
o Added a new tunable hw.skc.jumbo_disable to disable jumbo frame
support for the driver. The tunable could be set for systems that
do not need to use jumbo frames and it would save
(9K * number of Rx descriptors) bytes kernel memory.
o Jumbo buffer allocation failure is no longer critical error for
the operation of sk(4). If sk(4) encounter the allocation failure
it just disables jumbo frame support and continues to work without
user intervention.
With these changes jumbo frame performance of sk(4) was slightly
increased and users should not encounter jumbo buffer allocation
failure. Previously sk(4) tried to allocate physically contiguous
memory, 3388KB for 256 Rx descriptors. Sometimes that amount of
contiguous memory region could not be available for running systems
which in turn resulted in failure of loading the driver.
Robert Watson [Thu, 14 Feb 2008 00:30:06 +0000 (00:30 +0000)]
In Coda, flush the attribute cache for a cnode when its fid is
changed, as its synthesized inode number may have changed and we
want stat(2) to pick up the new inode number.
John Baldwin [Wed, 13 Feb 2008 23:36:56 +0000 (23:36 +0000)]
Mark sleepqueue chain spin mutexes are recursable since the sleepq code
now recurses on them in sleepq_broadcast() and sleepq_signal() when
resuming threads that are fully asleep.
John Baldwin [Wed, 13 Feb 2008 21:34:06 +0000 (21:34 +0000)]
Add an automatic kernel module version dependency to prevent loading
modules using invalid ABI versions (e.g. a 7.x module with an 8.x kernel)
for a given kernel:
- Add a 'kernel' module version whose value is __FreeBSD_version.
- Add a version dependency on 'kernel' in every module that has an
acceptable version range of __FreeBSD_version up to the end of the
branch __FreeBSD_version is part of. E.g. a module compiled on 701000
would work on kernels with versions between 701000 and 799999 inclusive.
Colin Percival [Wed, 13 Feb 2008 20:46:23 +0000 (20:46 +0000)]
Improve conformance to the HTTP specification by using case-insensitive
comparisons for header keywords. Apparently some proxies use creative
capitalization.
Attilio Rao [Wed, 13 Feb 2008 20:44:19 +0000 (20:44 +0000)]
- Add real assertions to lockmgr locking primitives.
A couple of notes for this:
* WITNESS support, when enabled, is only used for shared locks in order
to avoid problems with the "disowned" locks
* KA_HELD and KA_UNHELD only exists in the lockmgr namespace in order
to assert for a generic thread (not curthread) owning or not the
lock. Really, this kind of check is bogus but it seems very
widespread in the consumers code. So, for the moment, we cater this
untrusted behaviour, until the consumers are not fixed and the
options could be removed (hopefully during 8.0-CURRENT lifecycle)
* Implementing KA_HELD and KA_UNHELD (not surported natively by
WITNESS) made necessary the introduction of LA_MASKASSERT which
specifies the range for default lock assertion flags
* About other aspects, lockmgr_assert() follows exactly what other
locking primitives offer about this operation.
- Build real assertions for buffer cache locks on the top of
lockmgr_assert(). They can be used with the BUF_ASSERT_*(bp)
paradigm.
- Add checks at lock destruction time and use a cookie for verifying
lock integrity at any operation.
- Redefine BUF_LOCKFREE() in order to not use a direct assert but
let it rely on the aforementioned destruction time check.
KPI results evidently broken, so __FreeBSD_version bumping and
manpage update result necessary and will be committed soon.
Side note: lockmgr_assert() will be used soon in order to implement
real assertions in the vnode namespace replacing the legacy and still
bogus "VOP_ISLOCKED()" way.
Robert Watson [Wed, 13 Feb 2008 19:50:17 +0000 (19:50 +0000)]
Update cache flushing behavior in light of recent namecache and
access cache improvements:
- Flush just access control state on CODA_PURGEUSER, not the full
namecache for /coda.
- When replacing a fid on a cnode as a result of, e.g.,
reintegration after offline operation, we no longer need to
purge the namecache entries associated with its vnode.
Bruce Evans [Wed, 13 Feb 2008 18:16:43 +0000 (18:16 +0000)]
Forced commit to note that the lost log message for the previous commit
said that the previous commit was almost a null forced commit too. It
just converted to __FBSDID(). I was going to change `huge' from its
double precision value of 1e300, but that seems to be unnecessary since
`huge' is only used to set FE_INEXACT, and any value with an exponent
larger than LDBL_MANT_DIG will do for that, while initializing a really
huge value in a portable way would require more code.
Bruce Evans [Wed, 13 Feb 2008 16:56:52 +0000 (16:56 +0000)]
On arches where long double is the same as double, alias ceil(), floor()
and trunc() to the corresponding long double functions. This is not
just an optimization for these arches. The full long double functions
have a wrong value for `huge', and the arches without full long doubles
depended on it being wrong.
Robert Watson [Wed, 13 Feb 2008 15:45:12 +0000 (15:45 +0000)]
Implement a rudimentary access cache for the Coda kernel module,
modeled on the access cache found in NFS, smbfs, and the Linux coda
module. This is a positive access cache of a single entry per file,
tracking recently granted rights, but unlike NFS and smbfs,
supporting explicit invalidation by the distributed file system.
For each cnode, maintain a C_ACCCACHE flag indicating the validity
of the cache, and a cached uid and mode tracking recently granted
positive access control decisions.
Prefer the cache to venus_access() in VOP_ACCESS() if it is valid,
and when we must fall back to venus_access(), update the cache.
Allow Venus to clear the access cache, either the whole cache on
CODA_FLUSH, or just entries for a specific uid on CODA_PURGEUSER.
Unlike the Coda module on Linux, we don't flush all entries on a
user purge using a generation number, we instead walk present
cnodes and clear only entries for the specific user, meaning it is
somewhat more expensive but won't hit all users.
Since the Coda module is agressive about not keeping around
unopened cnodes, the utility of the cache is somewhat limited for
files, but works will for directories. We should make Coda less
agressive about GCing cnodes in VOP_INACTIVE() in order to improve
the effectiveness of in-kernel caching of attributes and access
rights.
Robert Watson [Wed, 13 Feb 2008 13:06:22 +0000 (13:06 +0000)]
Rather than having the Coda module use its own namecache, use the global
VFS namecache, as is done by the Coda module on Linux. Unlike the Coda
namecache, the global VFS namecache isn't tagged by credential, so use
ore conservative flushing behavior (for now) when CODA_PURGEUSER is
issued by Venus.
This improves overall integration with the FreeBSD VFS, including
allowing __getcwd() to work better, procfs/procstat monitoring, and so
on. This improves shell behavior in many cases, and improves ".."
handling. It may lead to some slowdown until we've implemented a
specific access cache, which should net improve performance, but in the
mean time, lookup access control now always goes to Venus, whereas
previously it didn't.
Attilio Rao [Wed, 13 Feb 2008 13:02:12 +0000 (13:02 +0000)]
Fix a lock leak in the ntfs locking scheme:
When ntfs_ntput() reaches 0 in the refcount the inode lockmgr is not
released and directly destroyed. Fix this by unlocking the lockmgr() even
in the case of zero-refcount.
Bruce Evans [Wed, 13 Feb 2008 10:44:44 +0000 (10:44 +0000)]
Fix exp2*(x) on signaling NaNs by returning x+x as usual.
This has the side effect of confusing gcc-4.2.1's optimizer into more
often doing the right thing. When it does the wrong thing here, it
seems to be mainly making too many copies of x with dependency chains.
This effect is tiny on amd64, but in some cases on i386 it is enormous.
E.g., on i386 (A64) with -O1, the current version of exp2() should
take about 50 cycles, but took 83 cycles before this change and 66
cycles after this change. exp2f() with -O1 only speeded up from 51
to 47 cycles. (exp2f() should take about 40 cycles, on an Athlon in
either i386 or amd64 mode, and now takes 42 on amd64). exp2l() with
-O1 slowed down from 155 cycles to 123 for some args; this is unimportant
since the i386 exp2l() is a fake; the wrong thing for it seems to
involve branch misprediction.
Bruce Evans [Wed, 13 Feb 2008 08:36:13 +0000 (08:36 +0000)]
Rearrange the polynomial evaluation for better parallelism. This is
faster on all machines tested (old Celeron (P2), A64 (amd64 and i386)
and ia64) except on ia64 when compiled with -O1. It takes 2 more
multiplications, so it will be slower on old machines. The speedup
is about 8 cycles = 17% on A64 (amd64 and i386) with best CFLAGS
and some parallelism in the caller.
Move the evaluation of 2**k up a bit so that it doesn't compete too
much with the new polynomial evaluation. Unlike the previous
optimization, this rearrangement cannot change the result, so compilers
and CPU schedulers can do it, but they don't do it quite right yet.
This saves a whole 1 or 2 cycles on A64.
- mention new firmware images used in multi-slice mode
- mention LRO support
- describe multi-slice related tunables.
- correct DIAGNOSTICS section to reflect that missing firmware
is non-fatal.
John Baldwin [Wed, 13 Feb 2008 00:04:58 +0000 (00:04 +0000)]
Consolidate the code to generate a new XID for a NFS request into a
nfs_xid_gen() function instead of duplicating the logic in both
nfsm_rpchead() and the NFS3ERR_JUKEBOX handling in nfs_request().
MFC after: 1 week
Submitted by: mohans (a long while ago)
Make sure we restrict Linux only IPC calls from being executed
through the FreeBSD ABI. IPC_INFO, SHM_INFO, SHM_STAT were added
specifically for Linux binary support. They are not documented
as being a part of the FreeBSD ABI, also, the structures necessary
for them have been hidden away from the users for a long time.
Also, the Linux ABI layer uses it's own structures to populate the
responses back to the user to ensure that the ABI is consistent.
I think there is a bit more separation work that needs to happen.
Marcel Moolenaar [Tue, 12 Feb 2008 18:14:46 +0000 (18:14 +0000)]
Add PIC support for IPIs. When registering an interrupt handler,
the PIC also informs the platform at which IRQ level it can start
assigning IPIs, since this can depend on the number of IRQs
supported for external interrupts.
Bruce Evans [Tue, 12 Feb 2008 17:11:36 +0000 (17:11 +0000)]
Fix remainder() and remainderf() in round-towards-minus-infinity mode
when the result is +-0. IEEE754 requires (in all rounding modes) that
if the result is +-0 then its sign is the same as that of the first
arg, but in round-towards-minus-infinity mode an uncorrected implementation
detail always reversed the sign. (The detail is that x-x with x's
sign positive gives -0 in this mode only, but the algorithm assumed
that x-x always has positive sign for such x.)
remquo() and remquof() seem to need the same fix, but I cannot test them
yet.
Use long doubles when mixing NaN args. This trick improves consistency
of results on at least amd64, so that more serious problems like the
above aren't hidden in simple regression tests by noise for the NaNs.
On amd64, hardware remainder should be used since it is about 10 times
faster than software remainder and is already used for remquo(), but
it involves using the i387 even for floats and doubles, and the i387
does NaN mixing which is better than but inconsistent with SSE NaN mixing.
Software remainder() would probably have been inconsistent with
software remainderl() for the same reason if the latter existed.
Signaling NaNs cause further inconsistencies on at least ia64 and i386.
Scott Long [Tue, 12 Feb 2008 16:24:30 +0000 (16:24 +0000)]
If busdma is being used to realign dynamic buffers and the alignment is set to
PAGE_SIZE or less, the bounce page counting logic was flawed and wouldn't
reserve any pages. Adjust to be correct. Review of other architectures is
forthcoming.
Rafal Jaworowski [Tue, 12 Feb 2008 11:03:29 +0000 (11:03 +0000)]
Eliminate BUS_DMA <-> cache incoherencies in USB transfers.
With write-allocate cache we get into the following scenario:
1. data has been updated in the memory by the USB HC, but
2. D-cache holds an un-flushed value of it
3. when affected cache line is being replaced, the old (un-flushed) value is
flushed and overwrites the newly arrived
This is possible due to how write-allocate works with virtual caches (ARM for
example).
In case of USB transfers it leads to fatal tags discrepancies in umass(4)
operation, which look like the following:
umass0: Invalid CSW: tag 1 should be 2
(probe0:umass-sim0:0:0:0): Request completed with CAM_REQ_CMP_ERR
(probe0:umass-sim0:0:0:0): Retrying Command
umass0: Invalid CSW: tag 1 should be 3
(probe0:umass-sim0:0:0:0): Request completed with CAM_REQ_CMP_ERR
(probe0:umass-sim0:0:0:0): Retrying Command
umass0: Invalid CSW: tag 1 should be 4
(probe0:umass-sim0:0:0:0): Request completed with CAM_REQ_CMP_ERR
(probe0:umass-sim0:0:0:0): Retrying Command
umass0: Invalid CSW: tag 1 should be 5
(probe0:umass-sim0:0:0:0): Request completed with CAM_REQ_CMP_ERR
(probe0:umass-sim0:0:0:0): Retrying Command
umass0: Invalid CSW: tag 1 should be 6
(probe0:umass-sim0:0:0:0): Request completed with CAM_REQ_CMP_ERR
(probe0:umass-sim0:0:0:0): error 5
(probe0:umass-sim0:0:0:0): Retries Exausted
To eliminate this, a BUS_DMASYNC_PREREAD sync operation is required in
usbd_start_transfer().
Credits for nailing this down go to Grzegorz Bernacki gjb AT semihalf DOT com.
Kris Kennaway [Mon, 11 Feb 2008 23:23:21 +0000 (23:23 +0000)]
Switch the default NFS mount mode from UDP to TCP. UDP mounts are a
historical relic, and are no longer appropriate for either LAN or WAN
mounting. At modern (gigabit and 10 gigabit) LAN speeds packet loss
from socket buffer fill events is common, and sequence numbers wrap
quickly enough that data corruption is possible. TCP solves both of
these problems without imposing significant overhead.
Marius Strobl [Mon, 11 Feb 2008 21:40:22 +0000 (21:40 +0000)]
The Sun disk label only uses 16-bit fields for cylinders, heads and
sectors so the geometry of large IDE disks has to be adjusted. This
corresponds to what the OpenSolaris dad(7D) driver does except that
the latter only tweaks sectors and effectively limits the mediasize
to 128GB so the cylinders and heads fields won't ever overflow. Not
limiting the mediasize is a compromise between allowing to use Sun
disk label as far as possible and being able to use the entire disk
with another disk label.
This allows to use the full capacity of large IDE disks if they were
not labeled under (Open)Solaris (in both ways of the meaning).
Rafal Jaworowski [Mon, 11 Feb 2008 12:30:32 +0000 (12:30 +0000)]
Clean up PowerPC loader(8) build config.
Turn off TFTP support by default: when both TFTP and NFS are enabled in the
loader, strange interactions occur in the pure netbooting scenario (i.e.
loader is TFTP-ed, kernel+world mounted over NFS), leading to very slow access
to the NFS-exported files.
Bruce Evans [Mon, 11 Feb 2008 05:20:02 +0000 (05:20 +0000)]
Use double precision for z and thus for the entire calculation of
exp2(i/TBLSIZE) * p(z) instead of only for the final multiplication
and addition. This fixes the code to match the comment that the maximum
error is 0.5010 ulps (except on machines that evaluate float expressions
in extra precision, e.g., i386's, where the evaluation was already
in extra precision).
Fix and expand the comment about use of double precision.
The relative roundoff error from evaluating p(z) in non-extra precision
was about 16 times larger than in exp2() because the interval length
is 16 times smaller. Its maximum was at least P1 * (1.0 ulps) *
max(|z|) ~= log(2) * 1.0 * 1/32 ~= 0.0217 ulps (1.0 ulps from the
addition in (1 + P1*z) with a cancelation error when z ~= -1/32). The
actual final maximum was 0.5313 ulps, of which 0.0303 ulps must have
come from the additional roundoff error in p(z). I can't explain why
the additional roundoff error was almost 3/2 times larger than the rough
estimate.
Andrew Thompson [Mon, 11 Feb 2008 03:05:11 +0000 (03:05 +0000)]
Add a geom class to map Linux LVM logical volumes.
The logical disks will appear as /dev/lvm/<vol group>-<logical vol>, for
instance /dev/lvm/vg0-home. GLVM currently supports linear stripes with
segments on multiple physical disks. The metadata is read only, logical
volumes can not be allocated or resized.
Attilio Rao [Sun, 10 Feb 2008 15:50:21 +0000 (15:50 +0000)]
- Revert last ehci.c change
- Include lock.h in lockmgr.h as nested header in order to safely use
LOCK_FILE and LOCK_LINE. As long as this code will be replaced soon
we can tollerate for a while this namespace pollution even if the real
fix would be to let lockmgr() depend by lock.h as a separate header.
Robert Watson [Sun, 10 Feb 2008 11:18:12 +0000 (11:18 +0000)]
Since we're now actively maintaining the Coda module in the FreeBSD source
tree, restyle everything but coda.h (which is more explicitly shared
across systems) into a closer approximation to style(9).
Mitsuru IWASAKI [Sun, 10 Feb 2008 06:21:52 +0000 (06:21 +0000)]
Add `hw.ciss.nop_message_heartbeat' tunable (default disabled) for
NOP-message polling in ciss_periodic().
Note that setting the tunable to non-zero can be workaround only for
`ADAPTER HEARTBEAT FAILED' problem, and may freeze the system w/o
the problem.
Reviewed by: scottl
Reported by: Attila Nagy
MFC after: 3 days