MFC r233809:
When process exists, not only the children shall be reparented to
init, but also the orphans shall be removed from the orphan list,
because the list header is destroyed.
Major pmap performance, concurrency, and correctness improvements, mostly
for the 64-bit PMAP module (64-bit-capable CPUs with either a 32-bit or
64-bit kernel). Thanks to alc for his help and prodding.
marius [Wed, 4 Apr 2012 21:19:19 +0000 (21:19 +0000)]
MFC: r233747, r233748
- Fix panic on kernel traps having a mapping in trap_sig b0rked in r206086.
Reported by: David E. Cross
- Remove checks that are redundant due to tf_type being unsigned.
MFC r233692:
Reenable unsolicited responses on CODEC if hdaa_sense_init() called again.
This fixes jack connection events handling after suspend/resume.
229988: Fix prototype formatting (indentation, long lines, and
continued lines).
230011: More prototype formatting fixes, struct member formatting fixes,
and namespace fix for property_find() prototype.
230037: Move struct pidfh definition into pidfile.c, and leave a forward
declaration for pidfh in libutil.h in its place.
This allows us to hide the contents of the pidfh structure, and also
allowed removal of the "#ifdef _SYS_PARAM_H" guard from around the
pidfile_* function prototypes.
230233: Fix more disorder in prototypes and constants.
Fix header comments for each section of constants.
Fix whitespace in #define lines.
Fix unnecessary parenthesis in constants.
230599: Restore the parenthesis that are necessary around the
constant values.
230600: Make the comments consistent (capitalization, punctuation, and
format).
230601: Consensus between bde and pjd seemed to be that if the function
names are lined up, then any * after a long type should appear after
the type instead of being in front of the function name on the
following line.
Prevent tmpfs_rename() deadlock in a way similar to UFS. Unlock
vnodes and try to lock them one by one. Relookup fvp and tvp.
Don't enforce LK_RETRY to get existing vnode in tmpfs_alloc_vp().
Doomed vnode is hardly of any use here, besides all callers handle
error case. vfs_hash_get() does the same. Don't mess with vnode
holdcount, vget() takes care of it already.
MFC r233231:
Fix several problems with our ELF filters implementation.
Do not relocate twice an object which happens to be needed by loaded
binary (or dso) and some filtee opened due to symbol resolution when
relocating need objects. Record the state of the relocation
processing in Obj_Entry and short-circuit relocate_objects() if
current object already processed.
Do not call constructors for filtees loaded during the early
relocation processing before image is initialized enough to run
user-provided code. Filtees are loaded using dlopen_object(), which
normally performs relocation and initialization. If filtee is
lazy-loaded during the relocation of dso needed by the main object,
dlopen_object() runs too earlier, when most runtime services are not
yet ready.
Postpone the constructors call to the time when main binary and
depended libraries constructors are run, passing the new flag
RTLD_LO_EARLY to dlopen_object(). Symbol lookups callers inform
symlook_* functions about early stage of initialization with
SYMLOOK_EARLY. Pass flags through all functions participating in
object relocation.
Use the opportunity and fix flags argument to find_symdef() in
arch-specific reloc.c to use proper name SYMLOOK_IN_PLT instead of
true, which happen to have the same numeric value.
MFC r233777 (by kan):
Do not try to adjust stacks if dlopen_object is called too early.
MFC r233778 (by kan):
Remove extra blank line from revious commit.
MFC note: the ARM and MIPS TLS support is not merged back, so the chunks
from r233231 which fix misuse of flags in calls to find_symdef() in
the corresponding relocation type handlers were not applied. When TLS
support is merged, the rest of r233231 should be applied too.
MFC r233746:
Be more conservative in using READ CAPACITY(16) command. Previous code
checked PROTECT bit in INQUIRY data for all SPC devices, while it is defined
only since SPC-3. But there are some SPC-2 USB devices were reported, that
have PROTECT bit set, return no error for READ CAPACITY(16) command, but
return wrong sector count value in response.
MFC 232700:
Add a new sched_clear_name() method to the scheduler interface to clear
the cached name used for KTR_SCHED traces when a thread's name changes.
This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's
name changes, most commonly via execve().
marius [Mon, 2 Apr 2012 20:14:32 +0000 (20:14 +0000)]
MFC: r233701
- Remove erroneous trailing semicolon. [1]
- Correctly determine the maximum payload size for setting the TX link
frequent NACK latency and replay timer thresholds.
Properly mask off bits that are not supported in the IAP counters.
This fixes a bug where users would see massively large counts, near
to 2**64 -1, due to the bits not being cleared.
When detaching a unix domain socket, uipc_detach() checks
unp->unp_vnode pointer to detect if there is a vnode associated with
(binded to) this socket and does necessary cleanup if there is.
The issue is that after forced unmount this check may be too late as
the unp_vnode is reclaimed and the reference is stale.
To fix this provide a helper function that is called on a socket vnode
reclamation to do necessary cleanup.
MFC r233168:
If we ever allow for managed fictitious pages, the pages shall be
excluded from superpage promotions. At least one of the reason is
that pv_table is sized for non-fictitious pages only.
Consistently check for the page to be non-fictitious before accesing
superpage pv list.
MFC note: erronous chunks from r233168 which were fixed in r233185
are not included into the merge.
Improve the way to calculate available pages in tmpfs:
- Don't deduct wired pages from total usable counts because it does not
make any sense. To make things worse, on systems where swap size is
smaller than physical memory and use a lot of wired pages (e.g. ZFS),
tmpfs can suddenly have free space of 0 because of this;
- Count cached pages as available; [1]
- Don't count inactive pages as available, technically we could but that
might be too aggressive; [1]
Change the notes about the pidfile to include Doug's preference
for pre-creating the pidfile with appropriate owner and permissions.
Requested by dougb
r231909:
The pidfile_open(3) is going to be fixed to set close-on-exec in order
not to leak the descriptor after exec(3). This raises the issue for
daemon(3) of the pidfile lock to be lost when the child process
executes.
To solve this and also to have the pidfile cleaned up when the program
exits, if a pidfile is specified, spawn a child to exec the command
and wait in the parent keeping the pidfile locked until the child
process exits and remove the file.
If the supervising process receives SIGTERM, forward it to the spawned
process. Normally it will cause the child to exit followed by the
termination of the supervisor after removing the pidfile.
This looks like desirable behavior, because termination of a
supervisor usually supposes termination of its charge. Also it will
fix the issue with stale pid files after reboot due to init kills a
supervisor before its child exits.
r231911:
Add -r option to restart the program if it has been terminated.
Suggested by: Andrey Zonov <andrey zonov org>
r231912:
If permitted protect the supervisor against pageout kill.
marius [Sat, 31 Mar 2012 10:47:34 +0000 (10:47 +0000)]
MFC: r233427
Add cas(4), gem(4) and hme(4) to x86 GENERICs as suggested by netchild@ in
<20120222095239.Horde.0hpYHJjmRSRPRKzXsoFRbYk@webmail.leidinger.net>.
According to some private emails received, it apparently is not unpopular
to use at least Quad GigaSwift cards driven by cas(4) in x86 machines.
kib [Sat, 31 Mar 2012 06:50:27 +0000 (06:50 +0000)]
MFC r233101:
Add sysctl vfs.nfs.nfs_keep_dirty_on_error to switch the nfs client
behaviour on error from write RPC back to behaviour of old nfs client.
When set to not zero, the pages for which write failed are kept dirty.
kib [Sat, 31 Mar 2012 06:44:48 +0000 (06:44 +0000)]
MFC r233100:
In vm_object_page_clean(), do not clean OBJ_MIGHTBEDIRTY object flag
if the filesystem performed short write and we are skipping the page
due to this.
Propogate write error from the pager back to the callers of
vm_pageout_flush(). Report the failure to write a page from the
requested range as the FALSE return value from vm_object_page_clean(),
and propagate it back to msync(2) to return EIO to usermode.
While there, convert the clearobjflags variable in the
vm_object_page_clean() and arguments of the helper functions to
boolean.
marius [Sat, 31 Mar 2012 01:21:46 +0000 (01:21 +0000)]
MFC: r233423
Initialize the mutexes used for the NVM and the swflag as MTX_DUPOK in
order to avoid otherwise harmless witness warnings when these are acquired
at the same time and due to both using MTX_NETWORK_LOCK as their type.
The right fix actually would be to use different, descriptive types for
these. However, the latter would require undesirable changes to the shared
code base. Another approach would be to just supply NULL as the type, which
was deemed as less desirable though as it would cause the unique but cryptic
name also to be used for the type and to diverge from the type used by other
network device drivers.
marius [Sat, 31 Mar 2012 00:13:41 +0000 (00:13 +0000)]
MFC: r233421
Given that this is a host-PCI-Express bridge driver, create the parent
DMA tag with a 4 GB boundary as required by PCI-Express. With r232403
(MFC'ed to stable/9 in r233393) in place this actually is redundant.
However, the host-PCI-Express bridge driver is the more appropriate
place for implementing this restriction.
marius [Sat, 31 Mar 2012 00:10:16 +0000 (00:10 +0000)]
MFC: r233403, r233404
- Use the PCI ID macros from mpi_cnfg.h rather than duplicating them here.
Note that this driver additionally probes some device IDs for the most
part not know to other MPT drivers, if at all. So rename the macros not
present in mpi_cnfg.h to match the naming scheme in the latter and but
suffix them with a _FB in order to not cause conflicts.
- Like mpt_set_config_regs(), comment out mpt_read_config_regs() as the
content of the registers read isn't actually used and both functions
aren't exactly up to date regarding the possible layouts of the BARs
(these function might be helpful for debugging though, so don't remove
them completely).
- Use DEVMETHOD_END.
- Use NULL rather than 0 for pointers.
- Remove an unusual check for the softc being NULL.
- Remove redundant zeroing of the softc.
- Remove an overly banal and actually partly incorrect as well as partly
outdated comment regarding the allocation of the memory resource.
marius [Fri, 30 Mar 2012 23:50:16 +0000 (23:50 +0000)]
MFC: r233274
Remove remnants of ATA_LOCKING uses in the ATA_CAM case and wrap it
along with functions, SYSCTLs and tunables that are not used with
ATA_CAM in #ifndef ATA_CAM, similar to the existing #ifdef'ed ATA_CAM
code for the other way around. This makes it easier to understand
which parts of ata(4) actually are used in the new world order and
to later on remove the !ATA_CAM bits. It also makes it obvious that
there is something fishy with the C-bus front-end as well as in the
ATP850 support, as these used ATA_LOCKING which is defunct in the
ATA_CAM case. When fixing the former, ATA_LOCKING probably needs to
be brought back in some form or other.
kib [Fri, 30 Mar 2012 09:38:35 +0000 (09:38 +0000)]
MFC r232974:
Stop using strerror(3) in rtld, which brings in msgcat and stdio.
Directly access sys_errlist array of errno messages with private
rtld_strerror() function.
kib [Fri, 30 Mar 2012 09:36:12 +0000 (09:36 +0000)]
MFC r232861:
Provide rtld-private implementations of __stack_chk_guard,
__stack_chk_fail() and __chk_fail() symbols, to be used by functions
linked from libc_pic.a.
kib [Fri, 30 Mar 2012 09:34:19 +0000 (09:34 +0000)]
MFC r232831:
Add support for preinit, init and fini arrays to rtld.
Only binaries marked with proper ABI note gets array ctr/dtrs called.
MFC r232856:
When iterating over the dso program headers, the object is not initialized
yet, and object segments are not yet mapped. Only parse the notes that
appear in the first page of the dso (as it should be anyway), and use
the preloaded page content.
MFC r232857 (by dim):
Fix a warning/error with clang.
MFC r232859 (by dim):
Amend r232857, now dropping the casts entirely, as they were not
necessary at all.
rmh [Thu, 29 Mar 2012 13:01:29 +0000 (13:01 +0000)]
MFC r233096:
Hide a few declarations from userland (including `struct inpcbgroup'). This
removes the dependency on <machine/param.h> which was introduced with SVN
rev 222748 (due to CACHE_LINE_SIZE).
kib [Thu, 29 Mar 2012 09:16:10 +0000 (09:16 +0000)]
MFC r232835:
Do not fall back to slow synchronous i/o when low on memory or buffers.
The bawrite() schedules the write to happen immediately, and its use
frees the current thread to do more cleanups.
mckusick [Wed, 28 Mar 2012 21:34:55 +0000 (21:34 +0000)]
MFC of 232351, 233438, and 233629
MFC reviewed by: kib
MFC 232351:
This change avoids a kernel deadlock on "snaplk" when using
snapshots on UFS filesystems running with journaled soft updates.
This is the first of several bugs that need to be fixed before
removing the restriction added in -r230250 to prevent the use
of snapshots on filesystems running with journaled soft updates.
The deadlock occurs when holding the snapshot lock (snaplk)
and then trying to flush an inode via ffs_update(). We become
blocked by another process trying to flush a different inode
contained in the same inode block that we need. It holds the
inode block for which we are waiting locked. When it tries to
write the inode block, it gets blocked waiting for the our
snaplk when it calls ffs_copyonwrite() to see if the inode
block needs to be copied in our snapshot.
The most obvious place that this deadlock arises is in the
ffs_copyonwrite() routine when it updates critical metadata
in a snapshot and tries to write it out before proceeding.
The fix here is to write the data and indirect block pointer
for the snapshot, but to skip the call to ffs_update() to
write the snapshot inode. To ensure that we will never have
to update a pointer in the inode itself, the ffs_snapshot()
routine that creates the snapshot has to ensure that all the
direct blocks are allocated as part of the creation of the
snapshot.
A less obvious place that this deadlock occurs is when we hold
the snaplk because we are deleting a snapshot. In the course of
doing the deletion, we need to allocate various soft update
dependency structures and allocate some journal space. If we
hit a resource limit while doing this we decrease the resources
in use by flushing out an existing dirty file to get it to give
up the soft dependency resources that it holds. The flush can
cause an ffs_update() to be done on the inode for the file that
we have selected to flush resulting in the same deadlock as
described above when the inode that we have chosen to flush
resides in the same inode block as the snapshot inode that we hold.
The fix is to defer cleaning up any time that the inode on which
we are operating is a snapshot.
Help and review by: Jeff Roberson
Tested by: Peter Holm
MFC 233438:
Add a third flags argument to ffs_syncvnode to avoid a possible conflict
with MNT_WAIT flags that passed in its second argument.
Discussed with: kib
MFC 233629:
A refinement of change 232351 to avoid a race with a forcible unmount.
While we have a snapshot vnode unlocked to avoid a deadlock with another
inode in the same inode block being updated, the filesystem containing
it may be forcibly unmounted. When that happens the snapshot vnode is
revoked. We need to check for that condition and fail appropriately.
gibbs [Wed, 28 Mar 2012 19:40:58 +0000 (19:40 +0000)]
MFC Revision 233465
Correct failure to attach the PV block front device on Citrix
XenServer configurations that advertise the multi-page ring extension,
but only allow a single page of ring space.
sys/dev/xen/blkfront/blkfront.c:
If only one page of ring space is being used, do not publish
in the XenStore the number of pages in use (1), via either
of the supported multi-page ring extension schemes.
Single page operation is the same with or without the
ring-page extension being negotiated. Relying on the
legacy behavior avoids an incompatible difference in how
the two ring-page extension schemes that are out in the
wild, deal with the base case of a single page. The
Amazon/Red Hat drivers use the same XenStore variable as
if the extension was not negotiated. The Citrix drivers
assume the new ring reference XenStore variables will be
available
Reported by: Oliver Schonefeld <schonefeld@ids-mannheim.de>
zec [Wed, 28 Mar 2012 12:45:35 +0000 (12:45 +0000)]
MFC r232517:
Change SYSINIT priorities so that ip_mroute_modevent() is executed
before vnet_mroute_init(), since vnet_mroute_init() depends on mfchashsize
tunable to be set, and that is done in in ip_mroute_modevent().
Apparently I broke that ordering with r208744 almost 2 years ago...
mav [Wed, 28 Mar 2012 11:37:06 +0000 (11:37 +0000)]
MFC r232207, r232454:
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
mav [Wed, 28 Mar 2012 10:15:42 +0000 (10:15 +0000)]
MFC r232852:
Tune cpuset macros to optimize cases when CPU_SETSIZE fits into single
machine word. For example, it turns CPU_SET() into expected shift and OR,
removing extra shift, AND and additional index on memory access.
Generated code checked for kernel (optimized) and user-level (unoptimized)
cases with GCC and CLANG.
jkim [Tue, 27 Mar 2012 23:59:48 +0000 (23:59 +0000)]
MFC: r233042, r233054, r233056, r233187
- Do not unnecessarily clear display memory when switching modes.
- Remove unnecessary static variable initializations and duplicate codes.
- Consistently use bcopy(9) over memcpy(9).
- Save and restore linear frame buffer between suspend and resume.
yongari [Mon, 26 Mar 2012 05:14:04 +0000 (05:14 +0000)]
MFC r232951,232953,233158:
r232951:
fxp(4) does not handle deferred dma map loading. Tell
bus_dmamap_load(9) that it should return immediately with error
when there are insufficient mapping resources.
r232953:
Fix white space nits.
r233158:
Do not change current media when driver is already running. If
driver is running driver would have already completed flow control
configuration. This change removes unnecessary media changes in
controller reconfiguration cases such that it does not trigger link
reestablishment for configuration change requests like promiscuous
mode change.
Reported by: Many
Tested by: Mike Tancsa <mike <> sentex dot net>
yongari [Mon, 26 Mar 2012 04:47:06 +0000 (04:47 +0000)]
MFC r232849-232850:
r232849:
Show PCI bus speed and width as well as running mode of PCI-X
device in device attach. This would help to narrow down issue to a
specific controller and operating mode of the controller.
While I'm here rename BGE_MISCCFG_BOARD_ID with
BGE_MISCCFG_BOARD_ID_MASK.
r232850:
Make if_ierrors updated whenever any of the following counters are
updated.
o Number of times NIC ran out of RX buffer descriptors
o Number of inbound packet errors
o Number of inbound packets that were chosen to be discarded
Previously only the discarded packet counter was used to update
if_ierrors. This change fixes wrong if_ierrors counter on
BCM570[0-4] controllers. For BCM5705 and later controllers bge(4)
already correctly counted it.
yongari [Mon, 26 Mar 2012 04:36:22 +0000 (04:36 +0000)]
MFC r232848:
Add workaround for PCI-X BCM5704 controller that live behind
AMD-8131 PCI-X bridge. The bridge seems to reorder write access to
mailbox registers such that it caused watchdog timeouts by
out-of-order TX completions.
Tested by: Michael L. Squires <mikes <> siralan dot org >
yongari [Mon, 26 Mar 2012 04:27:01 +0000 (04:27 +0000)]
MFC r232246:
Prefer RL_GMEDIASTAT register to RGEPHY_MII_SSR register to
extract a link status of PHY when parent driver is re(4).
RGEPHY_MII_SSR register does not seem to report correct PHY status
on some integrated PHYs used with re(4).
Unfortunately, RealTek PHYs have no additional information to
differentiate integrated PHYs from external ones so relying on PHY
model number is not enough to know that. However, it seems
RGEPHY_MII_SSR register exists for external RealTek PHYs so
checking parent driver would be good indication to know which PHY
was used. In other words, for non-re(4) controllers, the PHY is
external one and its revision number is greater than or equal to 2.
This change fixes intermittent link UP/DOWN messages reported on
RTL8169 controller.
Also, mii_attach(9) is tried after setting interface name since
rgephy(4) have to know parent driver name.
yongari [Mon, 26 Mar 2012 03:54:19 +0000 (03:54 +0000)]
MFC r232145:
Use correct Config registers for RTL8139 family. Unlike RTL8168 and
RTL810x family , RTL8139 has different register map for Config
registers.
While here, follow the lead of re(4) in WOL configuration.
- Disable WOL_UCAST and WOL_MCAST capabilities by default.
- Config5 register write does not need to unlock EEPROM access
on RTL8139 family but unlocking EEPROM access does not affect
its operation and make it consistent with re(4).
Reported by: Matt Renzelmann mjr <> cs dot wisc dot edu
yongari [Mon, 26 Mar 2012 03:45:46 +0000 (03:45 +0000)]
MFC r232019,232021,232025,232027,232029,232031,232040:
r232019:
Give hardware chance to drain active DMA cycles.
r232021:
If there are not enough RX buffers, release partially allocated RX
buffers.
r232025:
Introduce sf_ifmedia_upd_locked() and have driver reset PHY before
switching to selected media. While here, set if_drv_flags before
switching to selected media.
r232027:
No need to reprogram hardware RX filter when driver is not running.
r232029:
Remove taskqueue based MII stat change handler.
Driver does not need deferred link state change processing.
While I'm here, do not report current link status if interface is
not UP.
r232031:
With r232015, sf(4) gets correct speed/duplex of established link.
Add more strict speed check in sf_miibus_statchg() and do not touch
MAC config registers when driver lost a link.
r232040:
Add check for IFF_DRV_RUNNING flag after serving an interrupt and
don't give RX path more priority than TX path.
Also remove infinite loop in interrupt handler and limit number of
iteration to 32. This change addresses system load fluctuations
under high network load.