Kirk McKusick [Tue, 22 Oct 2002 01:23:00 +0000 (01:23 +0000)]
This update further fine tunes the locking of snapshot vnodes in
the ffs_copyonwrite routine to avoid a deadlock between the syncer
daemon trying to sync out a snapshot vnode and the bufdaemon
trying to write out a buffer containing the snapshot inode.
With any luck this will be the last snapshot race condition.
Kirk McKusick [Tue, 22 Oct 2002 01:14:25 +0000 (01:14 +0000)]
This update is a performance improvement when allocating blocks on
a full filesystem. Previously, if the allocation failed, we had to
fsync the file before rolling back any partial allocation of indirect
blocks. Most block allocation requests only need to allocate a single
data block and if that allocation fails, there is nothing to unroll.
So, before doing the fsync, we check to see if any rollback will
really be necessary. If none is necessary, then we simply return.
This update eliminates the flurry of disk activity that got triggered
whenever a filesystem would run out of space.
Kirk McKusick [Tue, 22 Oct 2002 01:06:44 +0000 (01:06 +0000)]
This update removes a race between unmount and lookup. The lookup
locks the mount point directory while waiting for vfs_busy to clear.
Meanwhile the unmount which holds the vfs_busy lock tried to lock
the mount point vnode. The fix is to observe that it is safe for the
unmount to remove the vnode from the mount point without locking it.
The lookup will wait for the unmount to complete, then recheck the
mount point when the vfs_busy lock clears.
Kirk McKusick [Tue, 22 Oct 2002 00:59:49 +0000 (00:59 +0000)]
This checkin reimplements the io-request priority hack in a way
that works in the new threaded kernel. It was commented out of
the disksort routine earlier this year for the reasons given in
kern/subr_disklabel.c (which is where this code used to reside
before it moved to kern/subr_disk.c):
----------------------------
revision 1.65
date: 2002/04/22 06:53:20; author: phk; state: Exp; lines: +5 -0
Comment out Kirks io-request priority hack until we can do this in a
civilized way which doesn't cause grief.
The problem is that it is not generally safe to cast a "struct bio
*" to a "struct buf *". Things like ccd, vinum, ata-raid and GEOM
constructs bio's which are not entrails of a struct buf.
Also, curthread may or may not have anything to do with the I/O request
at hand.
The correct solution can either be to tag struct bio's with a
priority derived from the requesting threads nice and have disksort
act on this field, this wouldn't address the "silly-seek syndrome"
where two equal processes bang the diskheads from one edge to the
other of the disk repeatedly.
Alternatively, and probably better: a sleep should be introduced
either at the time the I/O is requested or at the time it is completed
where we can be sure to sleep in the right thread.
The sleep also needs to be in constant timeunits, 1/hz can be practicaly
any sub-second size, at high HZ the current code practically doesn't
do anything.
----------------------------
As suggested in this comment, it is no longer located in the disk sort
routine, but rather now resides in spec_strategy where the disk operations
are being queued by the thread that is associated with the process that
is really requesting the I/O. At that point, the disk queues are not
visible, so the I/O for positively niced processes is always slowed
down whether or not there is other activity on the disk.
On the issue of scaling HZ, I believe that the current scheme is
better than using a fixed quantum of time. As machines and I/O
subsystems get faster, the resolution on the clock also rises.
So, ten years from now we will be slowing things down for shorter
periods of time, but the proportional effect on the system will
be about the same as it is today. So, I view this as a feature
rather than a drawback. Hence this patch sticks with using HZ.
Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@critter.freebsd.dk>
Semen Ustimenko [Tue, 22 Oct 2002 00:57:51 +0000 (00:57 +0000)]
Remove the OpenBSD comatibility stuff. Many changes to be more style(9)
compilant. Split two pieces if code into separate functions to do not
exceed line length due to indentation.
Robert Watson [Mon, 21 Oct 2002 23:51:18 +0000 (23:51 +0000)]
Add mac(9), a man page providing a basic introduction to the concepts
associated with the TrustedBSD MAC Framework, as well as some credits
to developers and contributors.
Julian Elischer [Mon, 21 Oct 2002 22:27:36 +0000 (22:27 +0000)]
Remove the process state PRS_WAIT.
It is never used. I left it there from pre-KSE days as I didn't know
if I'd need it or not but now I know I don't.. It's functionality
is in TDI_IWAIT in the thread.
Robert Watson [Mon, 21 Oct 2002 20:55:39 +0000 (20:55 +0000)]
Introduce mac_biba_copy() and mac_mls_copy(), which conditionally
copy elements of one Biba or MLS label to another based on the flags
on the source label element. Use this instead of
mac_{biba,mls}_{single,range}() to simplify the existing code, as
well as support partial label updates (we don't update if none is
requested).
Ian Dowse [Mon, 21 Oct 2002 20:40:02 +0000 (20:40 +0000)]
Implement a new IP_SENDSRCADDR ancillary message type that permits
a server process bound to a wildcard UDP socket to select the IP
address from which outgoing packets are sent on a per-datagram
basis. When combined with IP_RECVDSTADDR, such a server process can
guarantee to reply to an incoming request using the same source IP
address as the destination IP address of the request, without having
to open one socket per server IP address.
Ian Dowse [Mon, 21 Oct 2002 20:10:05 +0000 (20:10 +0000)]
Remove the "temporary connection" hack in udp_output(). In order
to send datagrams from an unconnected socket, we used to first block
input, then connect the socket to the sendmsg/sendto destination,
send the datagram, and finally disconnect the socket and unblock
input.
We now use in_pcbconnect_setup() to check if a connect() would have
succeeded, but we never record the connection in the PCB (local
anonymous port allocation is still recorded, though). The result
from in_pcbconnect_setup() authorises the sending of the datagram
and selects the local address and port to use, so we just construct
the header and call ip_output().
GEOM does not (and shall not) propagate flags like D_MEMDISK, so we will
revert to checking the name to determine if our root device is a ramdisk,
md(4) specifically to determine if we should attempt the root-mount RW
Robert Watson [Mon, 21 Oct 2002 18:42:01 +0000 (18:42 +0000)]
Add compartment support to Biba and MLS policies. The logic of the
policies remains the same: subjects and objects are labeled for
integrity or sensitivity, and a dominance operator determines whether
or not subject/object accesses are permitted to limit inappropriate
information flow. Compartments are a non-hierarchal component to
the label, so add a bitfield to the label element for each, and a
set check as part of the dominance operator. This permits the
implementation of "need to know" elements of MLS.
Robert Watson [Mon, 21 Oct 2002 18:05:12 +0000 (18:05 +0000)]
Demote sockets to single-label objects rather than maintaining a
range on them, leaving process credentials as the only kernel
objects with label ranges in the Biba and MLS policies. We
weren't using the range in any access control decisions, so this
lets us garbage collect effectively unused code.
Robert Watson [Mon, 21 Oct 2002 16:39:12 +0000 (16:39 +0000)]
Since the Biba and MLS access checks are identical to the open checks,
collapse the two cases more cleanly: rather than wrapping an access
check around open, simply provide the open implementation for the
access vector entry. No functional change.
Robert Watson [Mon, 21 Oct 2002 16:35:54 +0000 (16:35 +0000)]
Cleanup of relabel authorization checks -- almost identical logic,
we just break out some of the tests better. Minor change in that
we now better support incremental update of labels.
Ian Dowse [Mon, 21 Oct 2002 13:55:50 +0000 (13:55 +0000)]
Replace in_pcbladdr() with a more generic inner subroutine for
in_pcbconnect() called in_pcbconnect_setup(). This version performs
all of the functions of in_pcbconnect() except for the final
committing of changes to the PCB. In the case of an EADDRINUSE error
it can also provide to the caller the PCB of the duplicate connection,
avoiding an extra in_pcblookup_hash() lookup in tcp_connect().
This change will allow the "temporary connect" hack in udp_output()
to be removed and is part of the preparation for adding the
IP_SENDSRCADDR control message.
Andrew Gallatin [Mon, 21 Oct 2002 12:54:13 +0000 (12:54 +0000)]
Add some documentation of FreeBSD's special synchronization quirks
which may surprise developers coming from Solaris, or other platforms
which have a similar interface, but slightly different rules.
Murray Stokely [Mon, 21 Oct 2002 10:53:35 +0000 (10:53 +0000)]
Update comment to note that the third floppy (for modules) has been
implemented. Add a note reminding developers to update drivers.conf.5
if they add new functionality here.
Murray Stokely [Mon, 21 Oct 2002 10:48:19 +0000 (10:48 +0000)]
Note that support for the third 'drivers floppy' has been implemented.
Also point to the AWK scripts instead of the older Perl ones, now that
they've been rewritten.
Marcel Moolenaar [Mon, 21 Oct 2002 04:21:12 +0000 (04:21 +0000)]
Implement working on ELF corefiles. Use kvm_read() when reading
memory while mapping a virtual address to a physical address.
This allows us to work with virtual addresses for page tables,
provided it doesn't cause infinite recursion. Currently all
page tables are direct mapped.
Robert Watson [Mon, 21 Oct 2002 04:15:40 +0000 (04:15 +0000)]
Add a twiddle to create PTY's with a biba/equal or mls/equal label
instead of the default biba/high, mls/low, making it easier to use
ptys with these policies. This isn't the final solution, but does
help.
Robert Watson [Mon, 21 Oct 2002 03:54:24 +0000 (03:54 +0000)]
Unhook the per-policy parsing/printing MAC modules in libc to prepare
to bring in the new MAC label management API. With the new API
revision, we have only policy-agnostic code in libc and the base
kernel.
Brooks Davis [Mon, 21 Oct 2002 02:54:50 +0000 (02:54 +0000)]
Use if_printf(ifp, "blah") and device_printf(dev, "blah") instead of
printf("%s%d: blah", ifp->if_name, ifp->if_xname). This eliminates the
need to store the unit number in the softc.
David E. O'Brien [Mon, 21 Oct 2002 00:26:48 +0000 (00:26 +0000)]
Unbreak Alpha world.
We are seeing "/usr/libexec/ld-elf.so.1: groff: too few PT_LOAD segments",
however it appears that there really is only one PT_LOAD segment in the groff
binary. It is unclear if `rtld' or `ld' is at fault here -- but using an
RELENG_4 `ld' binary allows one to build a working dynamic groff binary.
Marcel Moolenaar [Sun, 20 Oct 2002 23:39:43 +0000 (23:39 +0000)]
In cb_dumphdr() we were calling buf_write() with di->priv as the
pointer to a dumperinfo instead of di. A brainfart, surely. This
bug went unnoticed for all this time because the pointer is only
used by buf_write() when it can write a completely filled buffer
to the dump device. This depends on the number of memory chunks
that needs to be dumped. This has apparently been low enough that
it has never happened up until this point.
Thomas Moestl [Sun, 20 Oct 2002 23:13:05 +0000 (23:13 +0000)]
Fix the calculations of the length of the unread message buffer
contents. The code was subtracting two unsigned ints, stored the
result in a log and expected it to be the same as of a signed
subtraction; this does only work on platforms where int and long
have the same size (due to overflows).
Instead, cast to long before the subtraction; the numbers are
guaranteed to be small enough so that there will be no overflows
because of that.
Fix two instances of variant struct definitions in sys/netinet:
Remove the never completed _IP_VHL version, it has not caught on
anywhere and it would make us incompatible with other BSD netstacks
to retain this version.
Add a CTASSERT protecting sizeof(struct ip) == 20.
Don't let the size of struct ipq depend on the IPDIVERT option.
Robert Drehmel [Sun, 20 Oct 2002 22:50:13 +0000 (22:50 +0000)]
Do not try to work around ``poor (un)sign extension code''
creation by GCC-2.6.3. Casting pointers to unsigned char
to volatile pointers to unsigned char seemed to produce
better results on the ia32 architecture with old versions
of GCC.
The current FreeBSD system compiler GCC-3.2.1 emits
better sign extension code for non-volatile variables:
volatile char c;
int i = c;
is compiled to:
...
movb -1(%ebp), %al
movbsl %al, %eax
movl %eax, -8(%ebp)
...
char c;
int i = c;
is compiled to:
...
movbsl -1(%ebp), %eax
movl %eax, -8(%ebp)
...
The same holds for zero-extension of dereferenced pointers
to volatile unsigned char.
When compiled on alpha or sparc64, the code produced for the
two examples above does not differ.
Robert Watson [Sun, 20 Oct 2002 22:39:55 +0000 (22:39 +0000)]
When packets pass in and out of six-to-four (STF) tunnels, perform
labeling checks and operations as with other network interfaces.
Eventually, if it proves desirable, we might want to offer special
casing of this or other tunnel interfaces where we have an existing
label of interest, rather than treating it as though it's an
entirely fresh mbuf in the incoming/outgoing encapsulation directions.
Robert Watson [Sun, 20 Oct 2002 22:27:59 +0000 (22:27 +0000)]
When a packet is sent via a FDDI interface, perform appropriate MAC
transmission checks; when it is received, label the packet appropriately.
Although we don't have a local FDDI setup to test this with, the
labeling and checks are identical to other interface classes.
Robert Watson [Sun, 20 Oct 2002 22:20:48 +0000 (22:20 +0000)]
When a packet is destined for delivery via an ATM medium, perform
appropriate interface transmission checks and delivery labeling. While
we don't have a local ATM configuration, this code is almost identical
to all other interface classes.
Sam Leffler [Sun, 20 Oct 2002 22:19:37 +0000 (22:19 +0000)]
Another baby step toward getting sysinstall working:
o fillin media s/h/c fields from new XML phk just added; need this because
sysinstall uses them in the fdisk look-alike
o add new tags to xml parser
o cleanup parser a touch; remove unused tags and move tag parsing stuff to
a table to simplify future additions
o redo callback to pass 64-bit values since mediasize overflows u_int32_t
o loosen parsing sanity checks a touch to deal with new xml we must handle
o move sector size probing to non-geom handling since we now get it from xml
o remove WHOLE_DISK_SLICE buggery now that we get mediasize from xml
Warner Losh [Sun, 20 Oct 2002 22:15:17 +0000 (22:15 +0000)]
devd. A daemon that hooks into the kernel's /dev/devctl to produce
arbitrary commands when devices come and go in the device tree (which is
different than the /dev directory).
This is an initial version. Much of the planned power isn't here.
Instead of doing the full matching, we always run /etc/devd-generic.
/etc/devd.generic will go away at some point, I think.
I'm committing it in this early state so I can start getting feedback
from early adapters.
Robert Watson [Sun, 20 Oct 2002 22:11:13 +0000 (22:11 +0000)]
Rename _POSIX_FOO_PRESENT and friends from POSIX.1e to _PC_FOO_PRESENT
and related friends. This would have been corrected had POSIX.1e
progressed to a standard.