Robert Watson [Sun, 2 Dec 2012 22:09:16 +0000 (22:09 +0000)]
Specifically point at the Handbook instructions for world updates in
UPDATING by URL.
As there has been some confusion over the need to run "mergemaster -p",
part of our standard upgrade procedure, following the recent addition of
an "auditdistd" user, add a note about it to UPDATING explicitly.
Adrian Chadd [Sun, 2 Dec 2012 06:24:08 +0000 (06:24 +0000)]
Delete the per-TXQ locks and replace them with a single TX lock.
I couldn't think of a way to maintain the hardware TXQ locks _and_ layer
on top of that per-TXQ software queuing and any other kind of fine-grained
locks (eg per-TID, or per-node locks.)
So for now, to facilitate some further code refactoring and development
as part of the final push to get software queue ps-poll and u-apsd handling
into this driver, just do away with them entirely.
I may eventually bring them back at some point, when it looks slightly more
architectually cleaner to do so. But as it stands at the present, it's
not really buying us much:
* in order to properly serialise things and not get bitten by scheduling
and locking interactions with things higher up in the stack, we need to
wrap the whole TX path in a long held lock. Otherwise we can end up
being pre-empted during frame handling, resulting in some out of order
frame handling between sequence number allocation and encryption handling
(ie, the seqno and the CCMP IV get out of sequence);
* .. so whilst that's the case, holding the lock for that long means that
we're acquiring and releasing the TXQ lock _inside_ that context;
* And we also acquire it per-frame during frame completion, but we currently
can't hold the lock for the duration of the TX completion as we need
to call net80211 layer things with the locks _unheld_ to avoid LOR.
* .. the other places were grab that lock are reset/flush, which don't happen
often.
My eventual aim is to change the TX path so all rejected frame transmissions
and all frame completions result in any ieee80211_free_node() calls to occur
outside of the TX lock; then I can cut back on the amount of locking that
goes on here.
There may be some LORs that occur when ieee80211_free_node() is called when
the TX queue path fails; I'll begin to address these in follow-up commits.
Rick Macklem [Sun, 2 Dec 2012 01:16:04 +0000 (01:16 +0000)]
Add an nfssvc() option to the kernel for the new NFS client
which dumps out the actual options being used by an NFS mount.
This will be used to implement a "-m" option for nfsstat(1).
- Add support for Etron EJ168 USB 3.0 Host Controllers.
This brand of controllers expects that the number of
contexts specified in the input slot context points
to an active endpoint context, else it refuses to
operate.
- Ring the correct doorbell when streams mode is used.
- Wrap one or two long lines.
Tested by: Markus Pfeiffer (DragonFlyBSD)
MFC after: 1 week
Protect against DoS attacks, such as being described in CVE-2010-2632.
The changes were derived from what has been committed to NetBSD, with
modifications. These are:
1. Preserve the existsing GLOB_LIMIT behaviour by including the number
of matches to the set of parameters to limit.
2. Change some of the limits to avoid impacting normal use cases:
GLOB_LIMIT_STRING - change from 65536 to ARG_MAX so that glob(3)
can still provide a full command line of expanded names.
GLOB_LIMIT_STAT - change from 128 to 1024 for no other reason than
that 128 feels too low (it's not a limit that impacts the
behaviour of the test program listed in CVE-2010-2632).
GLOB_LIMIT_PATH - change from 1024 to 65536 so that glob(3) can
still provide a fill command line of expanded names.
3. Protect against buffer overruns when we hit the GLOB_LIMIT_STAT or
GLOB_LIMIT_READDIR limits. We append SEP and EOS to pathend in
those cases. Return GLOB_ABORTED instead of GLOB_NOSPACE when we
would otherwise overrun the buffer.
This change also modifies the existing behaviour of glob(3) in case
GLOB_LIMIT is specifies by limiting the *new* matches and not all
matches. This is an important distinction when GLOB_APPEND is set or
when the caller uses a non-zero gl_offs. Previously pre-existing
matches or the value of gl_offs would be counted in the number of
matches even though the man page states that glob(3) would return
GLOB_NOSPACE when gl_matchc or more matches were found.
The limits that cannot be circumvented are GLOB_LIMIT_STRING and
GLOB_LIMIT_PATH all others can be crossed by simply calling glob(3)
again and with GLOB_APPEND set.
The entire description above applies only when GLOB_LIMIT has been
specified of course. No limits apply when this flag isn't set!
Andriy Gapon [Sat, 1 Dec 2012 18:16:14 +0000 (18:16 +0000)]
ioapic_program_intpin: program high bits before low bits
Programming the low bits has a side-effect if unmasking the pin if it is
not disabled. So if an interrupt was pending then it would be delivered
with the correct new vector but to the incorrect old LAPIC.
This fix could be made clearer by preserving the mask bit while
programming the low bits and then explicitly resetting the mask bit
after all the programming is done.
Probability to trip over the fixed bug could be increased by bootverbose
because printing of the interrupt information in ioapic_assign_cpu
lengthened the time window during which an interrupt could arrive while
a pin is masked.
Reported by: Andreas Longwitz <longwitz@incore.de>
Tested by: Andreas Longwitz <longwitz@incore.de>
MFC after: 12 days
Andriy Gapon [Sat, 1 Dec 2012 18:12:55 +0000 (18:12 +0000)]
gfs_file_inactive: replace bad code with ugly code
Also, make it explicit that V_XATTRDIR is not properly supported in gfs
code yet.
The bad code was plain incorrect: (a) it spoiled handling of v_usecount
reaching zero and (b) it leaked v_holdcnt.
The ugly code employs potentially unsafe locking tricks.
Ideally we should separate vnode lifecycle and gfs node lifecycle.
A gfs node should have its own reference count where its child nodes
should be accounted.
PR: kern/151111
Reviewed by: kib
MFC after: 13 days
In globextend(), take advantage of the fact that realloc(NULL, size) is
equivalent to malloc(size). This eliminates the conditional expression
used for calling either realloc() or malloc() when realloc() will do
all the time.
In globextend() when the pathv vector cannot be (re-)allocated, don't
free and clear the gl_pathv pointer in the glob_t structure. Such
breaks the invariant of the glob_t structure, as stated in the comment
right in front of the globextend() function. If gl_pathv was non-NULL,
then gl_pathc was > 0. Making gl_pathv a NULL pointer without also
setting gl_pathc to 0 is wrong.
Since we otherwise don't free the memory associated with a glob_t in
error cases, it's unlikely that this change will cause a memory leak
that wasn't already there to begin with. Callers of glob(3) must
call globfree(3) irrespective of whether glob(3) returned an error
or not.
Robert Watson [Sat, 1 Dec 2012 15:11:46 +0000 (15:11 +0000)]
Merge a number of changes required to hook up OpenBSM 1.2-alpha2's
auditdistd (distributed audit daemon) to the build:
- Manual cross references
- Makefile for auditdistd
- rc.d script, rc.conf entrie
- New group and user for auditdistd; associated aliases, etc.
The audit trail distribution daemon provides reliable,
cryptographically protected (and sandboxed) delivery of audit tails
from live clients to audit server hosts in order to both allow
centralised analysis, and improve resilience in the event of client
compromises: clients are not permitted to change trail contents
after submission.
Submitted by: pjd
Sponsored by: The FreeBSD Foundation (auditdistd)
Robert Watson [Sat, 1 Dec 2012 13:46:37 +0000 (13:46 +0000)]
Merge OpenBSM 1.2-alpha2 changes from contrib/openbsm to
src/sys/{bsm,security/audit}. There are a few tweaks to help with the
FreeBSD build environment that will be merged back to OpenBSM. No
significant functional changes appear on the kernel side.
Obtained from: TrustedBSD Project
Sponsored by: The FreeBSD Foundation (auditdistd)
Jack F Vogel [Sat, 1 Dec 2012 01:24:40 +0000 (01:24 +0000)]
Patch #12 OK, I said there was only 11 patches, but unfortunately
the revamped sysctl code did not work, and needed a change. This
makes the limit get set at the time that all sysctl stats are
created and is actually more elegant imho anyway.
Jack F Vogel [Sat, 1 Dec 2012 00:11:24 +0000 (00:11 +0000)]
Patch #11 - The final patch: this one greatly improves the
TX hot path by getting rid of index calculations and simply
managing pointers. Much of the creative code is due to my
coworker here at Intel, Alex Duyck, thanks Alex!
Also, this whole series of patches was given the critical
eye of Gleb Smirnoff and is all the better for it, thanks
Gleb!
Robert Watson [Sat, 1 Dec 2012 00:02:31 +0000 (00:02 +0000)]
Merge a number of post-1.2-alpha2 changes to OpenBSM into the OpenBSM
vendor area; these sort out various post-release issues, largely to do
with integrating OpenBSM with the base FreeBSD build. All of these
changes will appear in a future 1.2-alpha3:
Change 219846 on 2012/11/26 by rwatson@rwatson_cinnamon
Update several instances of Apple Computer to Apple; a change made
in the FreeBSD tree some years ago but not propagated to OpenBSM.
Change 219845 on 2012/11/26 by rwatson@rwatson_cinnamon
Remove Apple acknowledgement clause from file with Christian
Peron copyright (with permission from Christian).
Change 219836 on 2012/11/23 by rwatson@rwatson_cinnamon
Replace further instances of <> with "" for local includes in
auditdistd.
Change 219834 on 2012/11/23 by rwatson@rwatson_cinnamon
For current-directory headers, use #include "" rather than #include
<>.
Change 219832 on 2012/11/23 by rwatson@rwatson_cinnamon
Be more consistent with the remainder of OpenBSM and include
config/config.h rather than config.h.
Don't include config.h from synch.h, which is included only from
.c files that already include config.h.
Change 219831 on 2012/11/23 by pjd@pjd_anger
Add Xref to auditdistd(8).
Suggested by: rwatson
Obtained from: TrustedBSD Project
Sponsored by: The FreeBSD Foundation (auditdistd)
Jilles Tjoelker [Fri, 30 Nov 2012 23:51:33 +0000 (23:51 +0000)]
libc: Allow setting close-on-exec in fopen/freopen/fdopen.
This commit adds a new mode option 'e' that must follow any 'b', '+' and/or
'x' options. C11 is clear about the 'x' needing to follow 'b' and/or '+' and
that is what we implement; therefore, require a strict position for 'e' as
well.
For freopen() with a non-NULL path argument and fopen(), the close-on-exec
flag is set iff the 'e' mode option is specified. For freopen() with a NULL
path argument and fdopen(), the close-on-exec flag is turned on if the 'e'
mode option is specified and remains unchanged otherwise.
Although the same behaviour for fopen() can be obtained by open(O_CLOEXEC)
and fdopen(), this needlessly complicates the calling code.
Apart from the ordering requirement, the new option matches glibc.
Robert Watson [Fri, 30 Nov 2012 23:50:07 +0000 (23:50 +0000)]
Import OpenBSM 1.2-alpha2:
OpenBSM 1.2 alpha 2
- auditdistd, a distributed audit trail management daemon, has now been
merged. This allows trail files to be securely and reliably synced from
audited hosts to an audit server, and employs TLS encryption. Where
available, it uses Capsicum to sandbox the service. This work was
contributed by Pawel Jakub Dawidek under sponsorship from the FreeBSD
Foundation.
OpenBSM 1.2 alpha 1
- Add Capsicum-related error numbers for FreeBSD: ENOTCAPABLE, ECAPMODE.
- Add Capsicum, process descriptor audit events for FreeBSD.
- Allow 0% minspace.
- Fixes from the clang static analyser.
- Fix expiration of trail files when the host parameter is used.
- Various typo fixes.
- Support for Solaris privilege and privilege set tokens.
- Documentation for getachost(), improvements for getacfilesz().
- Fix a directory descriptor leak that happened when audit trail partitions
filled.
- Support for more Linux distributions with a partial contemporary endian.h.
- Improved escaping of XML-encapsulated BSM.
- A variety of minor documentation, style, and functional.
Obtained from: TrustedBSD Project
Sponsored by: The FreeBSD Foundation (auditdistd)
Jack F Vogel [Fri, 30 Nov 2012 23:45:55 +0000 (23:45 +0000)]
Patch #8 Performance changes - this one improves locality,
moving some counters and data to the ring struct from
the adapter struct, also compressing some data in the
move.
Jack F Vogel [Fri, 30 Nov 2012 23:28:01 +0000 (23:28 +0000)]
Patch #7 This is primarily about processing limit control.
- add a limit for both RX and TX, change the default to 256
- change the sysctl usage to be common, and now to be called
during init for each ring.
- the TX limit is not yet used, but the changes in the last
patch in this series uses the value.
- the motivation behind these changes is to improve data
locality in the final code.
- rxeof interface changes since it now gets limit from the
ring struct
Before the change directory descriptor was totally ignored,
so the relative path argument was appended to current working
directory path and not to the path provided by descriptor, thus
wrong paths were stored in audit logs.
Now that we use directory descriptor in vfs_lookup, move
AUDIT_ARG_UPATH1() and AUDIT_ARG_UPATH2() calls to the place where
we hold file descriptors table lock, so we are sure paths will
be resolved according to the same directory in audit record and
in actual operation.
Jack F Vogel [Fri, 30 Nov 2012 23:13:56 +0000 (23:13 +0000)]
Patch #6 Whitespace cleanup, and removal of some very old
defines (at Gleb's request). Also, change the defines around
the old transmit code to IXGBE_LEGACY_TX, I do this to make
it possible to define this regardless of the OS level (it is
not defined by default). There are also a couple changed
comments for clarity.
Jack F Vogel [Fri, 30 Nov 2012 23:06:27 +0000 (23:06 +0000)]
Patch #5 Cleanup unused IEEE1588 code fragments, the day may
come when this feature gets implemented, but its not here yet
and I see no reason to leave this laying around.
Currently when we discover that trail file is greater than configured
limit we send AUDIT_TRIGGER_ROTATE_KERNEL trigger to the auditd daemon
once. If for some reason auditd didn't rotate trail file it will never
be rotated.
Change it by sending the trigger when trail file size grows by the
configured limit. For example if the limit is 1MB, we will send trigger
on 1MB, 2MB, 3MB, etc.
This is also needed for the auditd change that will be committed soon
where auditd may ignore the trigger - it might be ignored if kernel
requests the trail file to be rotated too quickly (often than once a second)
which would result in overwriting previous trail file.
Sponsored by: FreeBSD Foundation (auditdistd)
MFC after: 2 weeks
Currently on each record write we call VFS_STATFS() to get available space
on the file system as well as VOP_GETATTR() to get trail file size.
We can assume that trail file is only updated by the audit worker, so instead
of asking for file size on every write, get file size on trail switch only
(it should be zero, but it's not expensive) and use global variable audit_size
protected by the audit worker lock to keep track of trail file's size.
This eliminates VOP_GETATTR() call for every write. VFS_STATFS() is satisfied
from in-memory data (mount->mnt_stat), so shouldn't be expensive.
Sponsored by: FreeBSD Foundation (auditdistd)
MFC after: 2 weeks
Jack F Vogel [Fri, 30 Nov 2012 22:54:14 +0000 (22:54 +0000)]
Patch #4 - this does two things, it removes a number of statistics,
these are FCOE stats (fiber channel over ethernet), something that
FreeBSD does not yet have, they were mistaken for flow control by
the implementor I believe. Secondly, the real flow control stats
are oddly named with a 'link' tag on the front, it was requested
by my validation engineer to make these stats have the same name as
the igb driver for clarity and that seemed reasonable to me.
Jack F Vogel [Fri, 30 Nov 2012 22:33:21 +0000 (22:33 +0000)]
Patch #2 - remove OACTIVE and DEPLETED notions from the
multiqueue code, this functionality has proven to be more
trouble than it was worth. Thanks to Gleb for a second
critical look over my code and help in the patches!
Allow OpenSSL to use arc4random(3) on FreeBSD. arc4random(3) was modified
some time ago to use sysctl instead of /dev/random to get random data,
so is now much better choice, especially for sandboxed processes that have
no direct access to /dev/random.
* Global IPFW_DYN_LOCK() is changed to per-bucket mutex.
* State expiration is done in ipfw_tick every second.
* No expiration is done on forwarding path.
* hash table resize is done automatically and does not flush all states.
* Dynamic UMA zone is now allocated per each VNET
* State limiting is now done via UMA(9) api.
- Implement "fdt mres" sub-command that prints reserved memory regions
- Add "fdt addr" subcommand that lets you specify preloaded blob address
- Do not pre-initialize blob for "fdt addr"
- Do not try to load dtb every time fdt subcommand is issued,
do it only once
- Change the way DTB is passed to kernel. With introduction of "fdt addr"
actual blob address can be not virtual but physical or reside in
area higher then 64Mb. ubldr should create copy of it in kernel area
and pass pointer to this newly allocated buffer which is guaranteed to work
in kernel after switching on MMU.
- Convert memreserv FDT info to "memreserv" property of root node
FDT uses /memreserve/ data to notify OS about reserved memory areas.
Technically it's not real property, it's just data blob, sequence
of <start, size> pairs where both start and size are 64-bit integers.
It doesn't fit nicely with OF API we use in kernel, so in order to unify
thing ubldr converts this data to "memreserve" property using the same
format for addresses and sizes as /memory node.
Add fdt_get_reserved_regions function. API is simmilar to fdt_get_mem_regions
It returns memory regions restricted from being used by kernel. These
regions are dfined in "memreserve" property of root node in the same
format as "reg" property of /memory node
Navdeep Parhar [Thu, 29 Nov 2012 19:39:27 +0000 (19:39 +0000)]
cxgbe/tom: Handle the case where the chip falls out of DDP mode by
itself. The hole in the receive sequence space corresponds to the
number of bytes placed directly up to that point.
Navdeep Parhar [Thu, 29 Nov 2012 19:10:04 +0000 (19:10 +0000)]
cxgbe/tom: Add a flag to indicate that the L2 table entry for an
embryonic connection has been setup and never attempt to abort a tid
before this is done. This fixes a bad race where a listening socket is
closed when the driver is in the middle of step (b) here. The symptom
of this were "ARP miss" errors from the driver followed by tid leaks.
A hardware-offloaded passive open works this way:
a) A SYN "hits" the TCAM entry for a server tid and the chip delivers it
to the queue associated with the server tid (say, queue A). It waits
for a response from the driver telling it what to do.
b) The driver decides it is ok to proceed. It adds the new tid to the
list of embryonic connections associated with the server tid and then
hands off the SYN to the kernel's syncache to make sure that the kernel
okays it too. If it does then the driver provides an L2 table entry,
queue id (say, queue B), etc. and instructs the chip to send the SYN/ACK
response.
c) The chip delivers a status to queue B depending on how the third step
of the 3-way handshake goes. The driver removes the tid from its list
of embryonic connections and either expands the syncache entry or
destroys the tid. In any case all subsequent messages for the new tid
will be delivered to queue B, not queue A. Anything running in queue B
knows that the L2 entry has long been setup and the new flag is of no
interest from here on. If the listener is closed it will deal with
so_comp as normal.
- Use more appropriate loop (do { } while()) when generating ethernet address
for bridge interface.
- If we found a collision we can break the loop - only one collision is
possible and one is exactly enough to need to renegerate.
Marcel Moolenaar [Thu, 29 Nov 2012 03:48:39 +0000 (03:48 +0000)]
Fix LINT build for arm: NOTES defines LDFLAGS by way of a make option
but LDFLAGS is not (yet) passed on to the linker (via SYSTEM_LD et al).
Do so now. As such, any kernel configuration can now define linker
flags by setting LDFLAGS as normal and not have to revert to hacks
like setting DEBUG for flags that do not relate to debugging (see
sys/powerpc/conf/MPC85XX).
Devin Teske [Wed, 28 Nov 2012 18:35:46 +0000 (18:35 +0000)]
Discussed at-length on -arch.
Make the following interface changes to my beastie boot menu:
+ Move boot options to a submenu
+ Add a new "Boot Single" menu item
+ Make "Boot" item and new "Boot Single" item reverse when boot_single is set
+ Add new "Load Defaults" item (in new "Boot Options" submenu) for overridding
loader.conf(5) provided values with system defaults.
Reviewed by: adrian (co-mentor)
Approved by: adrian (co-mentor)
Alan Cox [Wed, 28 Nov 2012 18:29:34 +0000 (18:29 +0000)]
Add support for the (relatively) new object type OBJT_MGTDEVICE to
vm_object_set_memattr(). Also, add a "safety belt" so that
vm_object_set_memattr() doesn't silently modify undefined object types.
Pedro F. Giffuni [Wed, 28 Nov 2012 00:36:40 +0000 (00:36 +0000)]
Partially bring r242520 to ext2fs.
When a file is first being written, the dynamic block reallocation
(implemented by ext2_reallocblks) relocates the file's blocks
so as to cluster them together into a contiguous set of blocks on
the disk.
When the cluster crosses the boundary into the first indirect block,
the first indirect block is initially allocated in a position
immediately following the last direct block. Block reallocation
would usually destroy locality by moving the indirect block out of
the way to keep the data blocks contiguous.
The issue was diagnosed long ago by Bruce Evans on ffs and surfaced
on ext2fs when block reallocaton was ported. This is only a partial
solution based on the similarities with FFS. We still require more
review of the allocation details that vary in ext2fs.
Devin Teske [Tue, 27 Nov 2012 22:11:53 +0000 (22:11 +0000)]
Change self-initialization to occur when loaded versus the previous behavior
which was to self-initialize during the first function-call. This didn't work
so well because the first call was may or may-not be within a sub-shell
(which prevented proper setup of the pass-thru file descriptor, resulting in
dialogs that would not display).
Andre Oppermann [Tue, 27 Nov 2012 21:19:58 +0000 (21:19 +0000)]
Base the mbuf related limits on the available physical memory or
kernel memory, whichever is lower. The overall mbuf related memory
limit must be set so that mbufs (and clusters of various sizes)
can't exhaust physical RAM or KVM.
The limit is set to half of the physical RAM or KVM (whichever is
lower) as the baseline. In any normal scenario we want to leave
at least half of the physmem/kvm for other kernel functions and
userspace to prevent it from swapping too easily. Via a tunable
kern.maxmbufmem the limit can be upped to at most 3/4 of physmem/kvm.
At the same time divorce maxfiles from maxusers and set maxfiles to
physpages / 8 with a floor based on maxusers. This way busy servers
can make use of the significantly increased mbuf limits with a much
larger number of open sockets.
Tidy up ordering in init_param2() and check up on some users of
those values calculated here.
Out of the overall mbuf memory limit 2K clusters and 4K (page size)
clusters to get 1/4 each because these are the most heavily used mbuf
sizes. 2K clusters are used for MTU 1500 ethernet inbound packets.
4K clusters are used whenever possible for sends on sockets and thus
outbound packets. The larger cluster sizes of 9K and 16K are limited
to 1/6 of the overall mbuf memory limit. When jumbo MTU's are used
these large clusters will end up only on the inbound path. They are
not used on outbound, there it's still 4K. Yes, that will stay that
way because otherwise we run into lots of complications in the
stack. And it really isn't a problem, so don't make a scene.
Normal mbufs (256B) weren't limited at all previously. This was
problematic as there are certain places in the kernel that on
allocation failure of clusters try to piece together their packet
from smaller mbufs.
The mbuf limit is the number of all other mbuf sizes together plus
some more to allow for standalone mbufs (ACK for example) and to
send off a copy of a cluster. Unfortunately there isn't a way to
set an overall limit for all mbuf memory together as UMA doesn't
support such a limiting.
NB: Every cluster also has an mbuf associated with it.
Andre Oppermann [Tue, 27 Nov 2012 20:04:52 +0000 (20:04 +0000)]
Fix a race on listen socket teardown where while draining the
accept queues a new socket/connection may be added to the queue
due to a race on the ACCEPT_LOCK.
The submitted patch is slightly changed in comments, teardown
and locking order and extended with KASSERT's.
Submitted by: Vijay Singh <vijju.singh-at-gmail-dot-com>
Found by: His team.
MFC after: 1 week