ed [Sat, 20 Jun 2009 14:50:32 +0000 (14:50 +0000)]
Improve nested jail awareness of devfs by handling credentials.
Now that we start to use credentials on character devices more often
(because of MPSAFE TTY), move the prison-checks that are in place in the
TTY code into devfs.
Instead of strictly comparing the prisons, use the more common
prison_check() function to compare credentials. This means that
pseudo-terminals are only visible in devfs by processes within the same
jail and parent jails.
Even though regular users in parent jails can now interact with
pseudo-terminals from child jails, this seems to be the right approach.
These processes are also capable of interacting with the jailed
processes anyway, through signals for example.
kan [Sat, 20 Jun 2009 14:16:41 +0000 (14:16 +0000)]
Allow order of initialization of loaded shared objects to be
altered through their .init code. This might happen if init
vector calls dlopen on its own and that dlopen causes some not
yet initialized object to be initialized earlier as part of that
dlopened DAG.
Do not reset module reference counts to zero on final fini vector
run when process is exiting. Just add an additional parameter to
force fini vector invocation regardless of current reference count
value if object was not destructed yet. This allows dlclose called
from fini vector to proceed normally instead of failing with handle
validation error.
marcel [Sat, 20 Jun 2009 05:36:53 +0000 (05:36 +0000)]
Drop the high FP state of an exiting thread in cpu_thread_exit() and
not in cpu_exit(). The latter is called after td_md.md_highfp_mtx
has been destroyed, which results in a race condition when another
thread wants to use the high FP registers on the CPU that still has
the high FP registers in question.
rmacklem [Sat, 20 Jun 2009 00:54:57 +0000 (00:54 +0000)]
Change the size of the nfsc_groups[] array in the experimental nfs
client to RPCAUTH_UNIXGIDS + 1 (17), since that is what can go on
the wire for AUTH_SYS authentication.
kmacy [Fri, 19 Jun 2009 23:34:32 +0000 (23:34 +0000)]
Greatly simplify cxgb by removing almost all of the custom mbuf management logic
- remove mbuf iovec - useful, but adds too much complexity when isolated to
the driver
- remove driver private caching - insufficient benefit over UMA to justify
the added complexity and maintenance overhead
- remove separate logic for managing multiple transmit queues, with the
new drbr routines the control flow can be made to much more closely resemble
legacy drivers
- remove dedicated service threads, with per-cpu callouts one can get the same
benefit much more simply by registering a callout 1 tick in the future if there
are still buffered packets
- remove embedded mbuf usage - Jeffr's changes will (I hope) soon be integrated
greatly reducing the overhead of using kernel APIs for reference counting
clusters
- add hysteresis to descriptor coalescing logic
- add coalesce threshold sysctls to allow users to decide at run-time
between optimizing for forwarding / UDP or optimizing for TCP
- add once per second watchdog to effectively close the very rare races
occurring from coalescing
- incorporate Navdeep's changes to the initialization path required to
convert port and adapter locks back to ordinary mutexes (silencing BPF
LOR complaints)
jilles [Fri, 19 Jun 2009 22:09:55 +0000 (22:09 +0000)]
Fix some issues with quoted output and shorten it in some cases.
Output quoted suitable for re-input to the shell occurs in
various cases such as 'set', 'trap'.
Bugfix: *, ? and [ must be quoted (except sole [)
Bugfix: ~ and # must be quoted (really only sometimes, but keep it simple)
Bugfix: space, tab and newline must always be quoted
Shortening: other IFS characters do not need quoting
Bugfix: send to correct output file, not hard-coded stdout
Shortening: avoid unnecessary '' with \'
bz [Fri, 19 Jun 2009 21:01:55 +0000 (21:01 +0000)]
Move setting of ports from NAT-T below key_getsah() and actually
below key_setsaval().
Without that, the lookup for the SA had failed as we were looking for
a SA with the new, updated port numbers instead of the old ones and
were comparing the ports in key_cmpsaidx().
This makes updating the remote -> local SA on the initiator work again.
csjp [Fri, 19 Jun 2009 20:31:44 +0000 (20:31 +0000)]
Implement the -z (zero counters) option for the various bpf counters.
Add necessary changes to the kernel for this (basically introduce a
bpf_zero_counters() function). As well, update the man page.
brooks [Fri, 19 Jun 2009 17:10:35 +0000 (17:10 +0000)]
Rework the credential code to support larger values of NGROUPS and
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively. (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)
The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer. Do the equivalent in
kinfo_proc.
Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively. Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary. In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.
Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups. When feasible, truncate
the group list rather than generating an error.
Minor changes:
- Reduce the number of hand rolled versions of groupmember().
- Do not assign to both cr_gid and cr_groups[0].
- Modify ipfw to cache ucreds instead of part of their contents since
they are immutable once referenced by more than one entity.
Submitted by: Isilon Systems (initial implementation)
X-MFC after: never
PR: bin/113398 kern/133867
brian [Fri, 19 Jun 2009 17:07:38 +0000 (17:07 +0000)]
When running pkg_add -r, check & install our dependencies for each
package rather than expecting our top level package to get all of
the dependencies correct.
Previously, the code depended on the top level package having all
of the pkgdep lines in +CONTENTS correct and in the right order,
but that doesn't always happen due to code such as this (in
security/gnutls/Makefile):
gnutls doesn't depend on lzo2 initially, but lzo2 is dragged into the
mix via other dependencies and is built by the initial 'make'. The
subsequent package generation for gnutls adds a pkgdep line for lzo2
to gnutls' +CONTENTS but the pkgdeps in sophox-packages' +CONTENTS
has gnutls *before* lzo2.
As a result, sophox-packages cannot install; gnutls fails because lzo2
is missing, 82 more packages fail because gnutls is missing and the
whole thing spirals into a super-confusing mess!
brooks [Fri, 19 Jun 2009 15:58:24 +0000 (15:58 +0000)]
In preparation for raising NGROUPS and NGROUPS_MAX, change base
system callers of getgroups(), getgrouplist(), and setgroups() to
allocate buffers dynamically. Specifically, allocate a buffer of size
sysconf(_SC_NGROUPS_MAX)+1 (+2 in a few cases to allow for overflow).
This (or similar gymnastics) is required for the code to actually follow
the POSIX.1-2008 specification where {NGROUPS_MAX} may differ at runtime
and where getgroups may return {NGROUPS_MAX}+1 results on systems like
FreeBSD which include the primary group.
In id(1), don't pointlessly add the primary group to the list of all
groups, it is always the first result from getgroups(). In principle
the old code was more portable, but this was only done in one of the two
places where getgroups() was called to the overall effect was pointless.
Document the actual POSIX requirements in the getgroups(2) and
setgroups(2) manpages. We do not yet support a dynamic NGROUPS, but we
may in the future.
brooks [Fri, 19 Jun 2009 15:52:35 +0000 (15:52 +0000)]
When checking if we can write to a file, use access() instead of a
manual permission check based on stat output. Also, get rid of the
executability check since it is not used.
edwin [Fri, 19 Jun 2009 07:18:45 +0000 (07:18 +0000)]
The "original" PR said that there were two issues with the motd
(Eyes of the daemon not synced and the motd not displayed properly
on black-on-white screens): The first one was not valid anymore
since the text and logo were swapped already, the second one is
fixed by resetting the whole colourscheme instead of only the
background colour.
(also removed svn:keywords from motd since it doesn't have the
string $FreeBSD$ in it)
jhb [Thu, 18 Jun 2009 20:56:22 +0000 (20:56 +0000)]
Fix a deadlock in the getpeername() method for UNIX domain sockets.
Instead of locking the local unp followed by the remote unp, use the same
locking model as accept() and read lock the global link lock followed by
the remote unp while fetching the remote sockaddr.
thompsa [Thu, 18 Jun 2009 20:42:37 +0000 (20:42 +0000)]
Track the kernel mapping of a physical page by a new entry in vm_page
structure. When the page is shared, the kernel mapping becomes a special
type of managed page to force the cache off the page mappings. This is
needed to avoid stale entries on all ARM VIVT caches, and VIPT caches
with cache color issue.
Submitted by: Mark Tinguely
Reviewed by: alc
Tested by: Grzegorz Bernacki, thompsa
alc [Thu, 18 Jun 2009 17:59:04 +0000 (17:59 +0000)]
Utilize the new function kmem_alloc_contig() to implement the UMA back-end
allocator for the jumbo frames zones. This change has two benefits: (1) a
custom back-end deallocator is no longer required. UMA's standard
deallocator suffices. (2) It eliminates a potentially confusing artifact
of using contigmalloc(): The malloc(9) statistics contain bogus information
about the usage of jumbo frames. Specifically, the malloc(9) statistics
report all jumbo frames in use whereas the UMA zone statistics report the
"truth" about the number in use vs. the number free.
kan [Thu, 18 Jun 2009 17:10:43 +0000 (17:10 +0000)]
Re-do r192913 in less intrusive way. Only do IP_RECVDSTADDR/IP_SENDSRCADDR
dace for UPDv4 sockets bound to INADDR_ANY. Move the code to set
IP_RECVDSTADDR/IP_SENDSRCADDR into svc_dg.c, so that both TLI and non-TLI
users will be using it.
Back out my previous commit to mountd. Turns out the problem was affecting
more than one binary so it needs to me addressed in generic rpc code in
libc in order to fix them all.
n_hibma [Thu, 18 Jun 2009 11:35:29 +0000 (11:35 +0000)]
Reverse some stuff I accidentally committed in the previous commit:
- creation of sparse files to speed up the build process. This was
discussed with phk 2 years ago and he disagreed with this change.
- handling of negative data partition sizes.
n_hibma [Thu, 18 Jun 2009 10:39:08 +0000 (10:39 +0000)]
Allow building world into a separate dir (for reuse in multiple images):
- buildworld and buildkernel are built into MAKEOBJDIRPREFIX
- installworld and installkernel are performed on NANO_OBJ.
No change of functionality if MAKEOBJDIRPREFIX is not set. If it is sea,t
clean_world deletes NANO_OBJ instead of NANO_WORLDDIR. By starting nanobsd.sh
with the -b option the existing world can be reused to build a new world
reducing time and disk space considerably.
While there:
- Fix two cases where (in comments) MAKEOBJDIRPREFIX should have been
NANO_DISKIMGDIR.
- Simplify an 'if (not wrong); then true; else action; fi' into
'if wrong; then action; fi'. 'if ! false; then echo hello; fi' produces hello.
Note: Make sure you use NANO_OBJ were you use MAKEOBJDIRPREFIX now in your
nanobsd.conf files if you want to split out.
rmacklem [Wed, 17 Jun 2009 22:50:26 +0000 (22:50 +0000)]
Since svc_[dg|vc|tli|tp]_create() did not hold a reference count on the
SVCXPTR structure returned by them, it was possible for the structure
to be free'd before svc_reg() had been completed using the structure.
This patch acquires a reference count on the newly created structure
that is returned by svc_[dg|vc|tli|tp]_create(). It also
adds the appropriate SVC_RELEASE() calls to the callers, except the
experimental nfs subsystem. The latter will be committed separately.
jilles [Wed, 17 Jun 2009 21:58:32 +0000 (21:58 +0000)]
Properly flush input after an error in backquotes in interactive mode.
For parsing an old-style backquote substitution (`...`),
a string "file" is used to store the contents of the
substitution (with the special backslash processing done).
If an error occurs, the shell cleans up all these files
(returning to the top level) and flush the top level
file. Erroneously, it first flushed the current file and
then cleaned up all extra files, so that the top level
file (i.e. the terminal) was not flushed.
Example (in interactive mode):
echo `for` echo This should not be printed
Also noticeable in (in interactive mode):
echo `(`
The old version prints an extraneous prompt.
jhb [Wed, 17 Jun 2009 19:50:38 +0000 (19:50 +0000)]
- Add the ability to mix multiple flags seperated by pipe ('|') characters
in the type field of system call tables. Specifically, one can now use
the 'NO*' types as flags in addition to the 'COMPAT*' types. For example,
to tag 'COMPAT*' system calls as living in a KLD via NOSTD. The COMPAT*
type is required to be listed first in this case.
- Add new functions 'type()' and 'flag()' to the embedded awk script in
makesyscalls.sh that return true if a requested flag is found in the
type field ($3). The flag() function checks all of the flags in the
field, but type() only checks the first flag. type() is meant to be
used in the top-level "switch" statement and flag() should be used
otherwise.
- Retire the CPT_NOA type, it is now replaced with "COMPAT|NOARGS" using
the flags approach.
- Tweak the comment descriptions of COMPAT[46] system calls so that they
say "freebsd[46] foo" rather than "old foo".
- Document the COMPAT6 type.
- Sync comments in compat32 syscall table with the master table.
snb [Wed, 17 Jun 2009 18:55:29 +0000 (18:55 +0000)]
Keep dirhash tailq locked throughout the entirety of ufsdirhash_destroy() to fix
a potential race pointed out by pjd. Also use TAILQ_FOREACH_SAFE to iterate over
dirhashes in ufsdirhash_lowmem(), so that we can continue iterating even after a
dirhash is destroyed.
jhb [Wed, 17 Jun 2009 18:44:15 +0000 (18:44 +0000)]
- NOSTD results in lkmressys being used instead of lkmssys.
- Mark nfsclnt as UNIMPL. It should have been NOSTD instead of NOIMPL back
when it lived in nfsclient.ko, but it was removed from that a long time
ago.
sam [Wed, 17 Jun 2009 17:57:52 +0000 (17:57 +0000)]
Add workaround to get IXP435 NPE-A working: reseting NPE-A after NPE-C
causes both to become inoperative; this apparently was done by the original
IAL code as a workaround for IMEM parity errors which we've not seen so
just disable the reset.
Note this problem does not occur on IXP425 boards. The linux driver does
fuse-resets on each NPE but in the order NPE-A < NPE-B < NPE-C (when probing
for which NPE's are present/operational); we may want to switch to a similar
scheme but for now disable the resets until we see an issue.
alc [Wed, 17 Jun 2009 17:19:48 +0000 (17:19 +0000)]
Refactor contigmalloc() into two functions: a simple front-end that deals
with the malloc tag and calls a new back-end, kmem_alloc_contig(), that
allocates the pages and maps them.
The motivations for this change are two-fold: (1) A cache mode parameter
will be added to kmem_alloc_contig(). In other words, kmem_alloc_contig()
will be extended to support the allocation of memory with caller-specified
caching. (2) The UMA allocation function that is used by the two jumbo
frames zones can use kmem_alloc_contig() in place of contigmalloc() and
thereby avoid having free jumbo frames held by the zone counted as live
malloc()ed memory.
nwhitehorn [Wed, 17 Jun 2009 16:34:40 +0000 (16:34 +0000)]
Teach cpu_est_clockrate() about the G5's slightly different PMC. This
allows the boot messages to include the CPU speed and makes possible
the forthcoming cpufreq support for the PPC 970.
kib [Wed, 17 Jun 2009 12:47:27 +0000 (12:47 +0000)]
For dotdot lookup in nfs_lookup, inline the vn_vget_ino() to prevent
operating on the unmounted mount point and freed mount data in case of
forced unmount performed while dvp is unlocked to nget the target vnode.
Add missed calls to m_freem(mrep) there on error exits [1].
rrs [Wed, 17 Jun 2009 12:34:56 +0000 (12:34 +0000)]
Changes to the NR-Sack code so that:
1) All bit disappears
2) The two sets of gaps (nr and non-nr) are
disjointed, you don't have gaps struck in
both places.
This adjusts us to coorespond to the new draft. Still
to-do, cleanup the code so that there are only one set
of sack routines (original NR-Sack done by E cloned all
sack code).
alc [Wed, 17 Jun 2009 04:57:32 +0000 (04:57 +0000)]
Make the maintenance of a page's valid bits by contigmalloc() more like
kmem_alloc() and kmem_malloc(). Specifically, defer the setting of the
page's valid bits until contigmapping() when the mapping is known to be
successful.
weongyo [Wed, 17 Jun 2009 04:15:19 +0000 (04:15 +0000)]
reorders the sequence when the device is detached. After detaching the
interface is completed then it'll process other parts to avoid a race
condition.
sam [Wed, 17 Jun 2009 02:55:53 +0000 (02:55 +0000)]
remove IAL vestige for defining the max data/instruction memory size;
instead of defining them according to ixp46x add new defines so we can
do this at run time
sam [Wed, 17 Jun 2009 02:53:05 +0000 (02:53 +0000)]
o correct default miibase for NPE-B and NPE-C; these values are
normally taken from the hints file so this should have no effect
o set the port address "just in case"
o add NPE-A support to the tx done qmgr callback
attilio [Wed, 17 Jun 2009 01:55:42 +0000 (01:55 +0000)]
Introduce support for adaptive spinning in lockmgr.
Actually, as it did receive few tuning, the support is disabled by
default, but it can opt-in with the option ADAPTIVE_LOCKMGRS.
Due to the nature of lockmgrs, adaptive spinning needs to be
selectively enabled for any interested lockmgr.
The support is bi-directional, or, in other ways, it will work in both
cases if the lock is held in read or write way. In particular, the
read path is passible of further tunning using the sysctls
debug.lockmgr.retries and debug.lockmgr.loops . Ideally, such sysctls
should be axed or compiled out before release.
Addictionally note that adaptive spinning doesn't cope well with
LK_SLEEPFAIL. The reason is that many (and probabilly all) consumers
of LK_SLEEPFAIL are mainly interested in knowing if the interlock was
dropped or not in order to reacquire it and re-test initial conditions.
This directly interacts with adaptive spinning because lockmgr needs
to drop the interlock while spinning in order to avoid a deadlock
(further details in the comments inside the patch).
Final note: finding someone willing to help on tuning this with
relevant workloads would be either very important and appreciated.
jhb [Tue, 16 Jun 2009 18:58:50 +0000 (18:58 +0000)]
- Change members of tcpcb that cache values of ticks from int to u_int:
t_rcvtime, t_starttime, t_rtttime, t_bw_rtttime, ts_recent_age,
t_badrxtwin.
- Change t_recent in struct timewait from u_long to u_int32_t to match
the type of the field it shadows from tcpcb: ts_recent.
- Change t_starttime in struct timewait from u_long to u_int to match
the t_starttime field in tcpcb.
kan [Tue, 16 Jun 2009 16:38:54 +0000 (16:38 +0000)]
FreeBSD returns main object handle from dlopen(NULL, ...) calls.
dlsym seaches using this handle are expected to look for symbol
definitions in all objects loaded at the program start time along
with all objects currently in RTLD_GLOBAL scope.