davidxu [Thu, 3 May 2012 03:05:18 +0000 (03:05 +0000)]
Merge 233103, 233912 from head:
233103:
Some software think a mutex can be destroyed after it owned it, for
example, it uses a serialization point like following:
pthread_mutex_lock(&mutex);
pthread_mutex_unlock(&mutex);
pthread_mutex_destroy(&muetx);
They think a previous lock holder should have already left the mutex and
is no longer referencing it, so they destroy it. To be maximum compatible
with such code, we use IA64 version to unlock the mutex in kernel, remove
the two steps unlocking code.
233912:
umtx operation UMTX_OP_MUTEX_WAKE has a side-effect that it accesses
a mutex after a thread has unlocked it, it event writes data to the mutex
memory to clear contention bit, there is a race that other threads
can lock it and unlock it, then destroy it, so it should not write
data to the mutex memory if there isn't any waiter.
The new operation UMTX_OP_MUTEX_WAKE2 try to fix the problem. It
requires thread library to clear the lock word entirely, then
call the WAKE2 operation to check if there is any waiter in kernel,
and try to wake up a thread, if necessary, the contention bit is set again
by the operation. This also mitgates the chance that other threads find
the contention bit and try to enter kernel to compete with each other
to wake up sleeping thread, this is unnecessary. With this change, the
mutex owner is no longer holding the mutex until it reaches a point
where kernel umtx queue is locked, it releases the mutex as soon as
possible.
Performance is improved when the mutex is contensted heavily. On Intel
i3-2310M, the runtime of a benchmark program is reduced from 26.87 seconds
to 2.39 seconds, it even is better than UMTX_OP_MUTEX_WAKE which is
deprecated now. http://people.freebsd.org/~davidxu/bench/mutex_perf.c
Special code for stable/8:
And add code to detect if the UMTX_OP_MUTEX_WAKE2 is available.
mav [Wed, 2 May 2012 07:22:58 +0000 (07:22 +0000)]
MFC r234415:
Some improvements to GEOM MULTIPATH:
- Implement "configure" command to allow switching operation mode of
running device on-fly without destroying and recreation.
- Implement Active/Read mode as hybrid of Active/Active and Active/Passive.
In this mode all paths not marked FAIL may handle reads same time,
but unlike Active/Active only one path handles write requests at any
point in time. It allows to closer follow original write request order
if above layers need it for data consistency (not waiting for requisite
write completion before sending dependent write).
- Hide duplicate messages about device status change.
- Remove periodic thread wake up with 10Hz rate.
mav [Wed, 2 May 2012 07:05:20 +0000 (07:05 +0000)]
MFC r222643:
When possible, join ranges of subsequest BIO_DELETE requests to handle more
(up to 2048 instead of 256 or even 64) of them with single TRIM request.
OCZ Vertex2/Vertex3 SSDs can handle no more then 64 ranges per TRIM request.
Due to lack of BIO_DELETE clustering now, it means that we could delete no
more then 2MB per request (on FS with 32K block) with limited request rate.
This change increases delete rate on Vertex2 from 250MB/s to 950MB/s.
MFC r222643:
Increase maximum supported number of ranges per TRIM command from 256 to 512
to use full potential of Intel X25-M SSDs. On synthetic test with 32K ranges
it gives about 20% speedup, which probably costs more then 2K of RAM.
mav [Wed, 2 May 2012 06:52:00 +0000 (06:52 +0000)]
Merge ATA_CAM compatibility shims. While 8-STABLE doesn't have ATA_CAM
enabled by default, this should make migration easier for users enabling
it manually.
r221071:
Add shim to simplify migration to the CAM-based ATA. For each new adaX
device in /dev/ create symbolic link with adY name, trying to mimic old ATA
numbering. Imitation is not complete, but should be enough in most cases to
mount file systems without touching /etc/fstab.
r221384:
Do not report legacy unit numbers (do not create legacy aliases) for disks
on port multiplier ports above first two. They don't fit into ATA_STATIC_ID
scheme and so can't be mapped properly. No need to pollute dev.
delphij [Wed, 2 May 2012 00:31:09 +0000 (00:31 +0000)]
MFC r233770:
Eliminate two cases of unwanted strncpy(). The name is not required
by the current code, and the results would get overwritten anyway
by subsequent memset().
kib [Tue, 1 May 2012 11:45:16 +0000 (11:45 +0000)]
MFC r234657:
Take the spinlock around clearing of the fp->_flags in fclose(3), which
indicates the avaliability of FILE, to prevent possible reordering of
the writes as seen by other CPUs.
MFC r233507:
Use program exit status as pam_exec return code (optional)
pam_exec(8) now accepts a new option "return_prog_exit_status". When
set, the program exit status is used as the pam_exec return code. It
allows the program to tell why the step failed (eg. user unknown).
However, if it exits with a code not allowed by the calling PAM service
module function (see $PAM_SM_FUNC below), a warning is logged and
PAM_SERVICE_ERR is returned.
The following changes are related to this new feature but they apply no
matter if the "return_prog_exit_status" option is set or not.
The environment passed to the program is extended:
o $PAM_SM_FUNC contains the name of the PAM service module function
(eg. pam_sm_authenticate).
o All valid PAM return codes' numerical values are available
through variables named after the return code name. For instance,
$PAM_SUCCESS, $PAM_USER_UNKNOWN or $PAM_PERM_DENIED.
pam_exec return code better reflects what went on:
o If the program exits with !0, the return code is now
PAM_PERM_DENIED, not PAM_SYSTEM_ERR.
o If the program fails because of a signal (WIFSIGNALED) or doesn't
terminate normally (!WIFEXITED), the return code is now
PAM_SERVICE_ERR, not PAM_SYSTEM_ERR.
o If a syscall in pam_exec fails, the return code remains
PAM_SYSTEM_ERR.
waitpid(2) is called in a loop. If it returns because of EINTR, do it
again. Before, it would return PAM_SYSTEM_ERR without waiting for the
child to exit.
Several log messages now include the PAM service module function name.
MFC r234459:
Fix a bug where we copy out more data from a mbuf chain that are
ctually in it. This happens when SCTP receives an unknown chunk, which
requires the sending of an ERROR chunk, and there is no final padding but
the chunk is not 4-byte aligned.
Reported by yueting via rwatson@
MFC r234556:
When MAP_STACK mapping is created, the map entry is created only to
cover the initial stack size. For MCL_WIREFUTURE maps, the subsequent
call to vm_map_wire() to wire the whole stack region fails due to
VM_MAP_WIRE_NOHOLES flag.
Use the VM_MAP_WIRE_HOLESOK to only wire mapped part of the stack.
Export the udp_cksum sysctl for upcoming SCTP work. Rather than always,
SCTP will only do IPv4 UDP checksum calculation as defined by the host
policy. When tunneling SCTP always calculates the inner checksum already
so not doing the outer UDP can save cycles.
MFC r234038
If a page belonging a reservation is cached, then mark the reservation so
that it will be freed to the cache pool rather than the default pool.
Otherwise, the cached pages within the reservation may be recycled sooner
than necessary.
MFC r234692:
Read backup GPT header from the last LBA only when primary GPT header and
table aren't valid. If they are ok, use hdr_lba_alt value to read backup
header. This will make gptboot happy when GPT used atop of some GEOM
provider, e.g. GEOM_MIRROR.
Free mbuf in case when protocol in unknown in ng_ipfw_rcvdata().
This change fixes (theoretically) possible mbuf leak introduced in
r225586. Reorder code a bit and change return codes to be more specific
- Add ipfw eXtended tables permitting radix to be used for any kind of keys.
- Add support for IPv6 and interface extended tables
- Make number of tables to be changed in runtime in range 0..65534.
- Use IP_FW3 opcode for all new extended table cmds
No ABI changes are introduced. Old userland will see valid tables for
IPv4 tables and no entries otherwise. Flush works for any table.
IP_FW3 socket option is used to encapsulate all new opcodes:
/* IP_FW3 header/opcodes */
typedef struct _ip_fw3_opheader {
uint16_t opcode; /* Operation opcode */
uint16_t reserved[3]; /* Align to 64-bit boundary */
} ip_fw3_opheader;
New opcodes added:
IP_FW_TABLE_XADD, IP_FW_TABLE_XDEL, IP_FW_TABLE_XGETSIZE, IP_FW_TABLE_XLIST
ipfw(8) table argument parsing behavior is changed:
'ipfw table 999 add some-unqualified-host' now assumes
'some-unqualified-host' to be interface name instead of hostname.
New tunable:
net.inet.ip.fw.tables_max controls number of table supported by ipfw in given
VNET instance. 128 is still the default value.
Sysctl change:
net.inet.ip.fw.tables_max is now read-write.
New syntax:
ipfw add skipto tablearg ip from any to any via table(42) in
ipfw add skipto tablearg ip from any to any via table(4242) out
This is a bit hackish, special interface name '\1' is used to signal interface
table number is passed in p.glob field.
MFC r234121:
Back out r228476.
r228476 fixed superfluous link UP/DOWN messages but broke IPMI
access during boot. It's not clear why r228476 breaks IPMI and
should be revisited.
Reported by: Paul Guyot <paulguyot <> ieee dot org >
MFC r233387:
Use suspend/resume methods provided by net80211. This ensures that the
appropriate state handling takes place, not doing so results in the
device doing nothing until manual intervention.
Improve the way to calculate available pages in tmpfs:
- Don't deduct wired pages from total usable counts because it does not
make any sense. To make things worse, on systems where swap size is
smaller than physical memory and use a lot of wired pages (e.g. ZFS),
tmpfs can suddenly have free space of 0 because of this;
- Count cached pages as available; [1]
- Don't count inactive pages as available, technically we could but that
might be too aggressive; [1]
Don astbestos garment and remove the warning about TMPFS being experimental
-- highly experimental even. So far the closest to a bug in TMPFS that people
have gotten to relates to how ZFS can take away from the memory that TMPFS
needs. One can argue that such is not a bug in TMPFS. Irrespective, even if
there is a bug here and there in TMPFS, it's not in our own advantage to
scare people away from using TMPFS. I for one have been using it, even with
ZFS, very successfully.
Change SIGUSR1 to SIGTHR to properly wake up a process that is being
traced. The use of SIGUSR1 caused traced processes (those attached to
with dtrace -p) to exit when dtrace exited.
marius [Fri, 20 Apr 2012 14:45:57 +0000 (14:45 +0000)]
MFC: r222475
Fix read_ivar implementation for MMC and SD.
1. Both mmc_read_ivar() and sdhci_read_ivar() use the expression
'*(int *)result = val' to assign to result which is uintptr_t *.
This does not work on big-endian 64 bit systems.
2. The media_size ivar is declared as 'off_t' which does not fit
into uintptr_t in 32bit systems, change this to long.
Submitted by: kanthms at netlogicmicro com (initial version)
Export vinactive() from kern/vfs_subr.c (e.g., make it no longer
static and declare its prototype in sys/vnode.h) so that it can be
called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c)
instead of the body of vinactive() being cut and pasted into
process_deferred_inactive().
MFC r234172:
Add thread-private flag to indicate that error value is already placed
in td_errno. Flag is supposed to be used by syscalls returning
EJUSTRETURN because errno was already placed into the usermode frame
by a call to set_syscall_retval(9). Both ktrace and dtrace get errno
value from td_errno if the flag is set.
Use the flag to fix sigsuspend(2) error return ktrace records.
MFC r233176:
Add new GEOM_PART_LDM module that implements the Logical Disk Manager
scheme. The LDM is a logical volume manager for MS Windows NT and it
is also known as dynamic volumes. It supports about 2000 partitions
and also provides the capability for software RAID implementations.
This version implements only partitioning scheme capability and based
on the linux-ntfs project documentation and several publications across
the Web. NOTE: JBOD, RAID0 and RAID5 volumes aren't supported.
An access to the LDM metadata is read-only. When LDM is on the disk
partitioned with MBR we can also destroy metadata. For the GPT
partitioned disks destroy action is not supported.
MFC r233177:
Connect geom_part_ldm module to the build.
MFC r233178:
Connect geom_part_ldm to the kernel build.
MFC r233181:
Add CTLFLAG_TUN to sysctls.
MFC r233651:
Do proper cleanup for the GPT case when an error occurs.
MFC r233652:
VMDB offset should be greater than logical volume size only for MBR.
Some software think a mutex can be destroyed after it owned it, for
example, it uses a serialization point like following:
pthread_mutex_lock(&mutex);
pthread_mutex_unlock(&mutex);
pthread_mutex_destroy(&muetx);
They think a previous lock holder should have already left the mutex and
is no longer referencing it, so they destroy it. To be maximum compatible
with such code, we use IA64 version to unlock the mutex in kernel, remove
the two steps unlocking code.
MFC r232799:
- add comments to syscalls.master and linux(32)_dummy about which linux
kernel version introduced the sysctl (based upon a linux man-page)
- add comments to syscalls.master regarding some names of sysctls which are
different than the linux-names (based upon the linux unistd.h)
- add some dummy sysctls
- name an unimplemented sysctl
If hastd is invoked with "-P pidfile" option always create pidfile
regardless of whether -F (foreground) option is set or not.
Also, if -P option is specified, ignore pidfile setting from configuration
not only on start but on reload too. This fixes the issue when for hastd
run with -P option reload caused the pidfile change.
MFC r233000:
Add MODULE_DEPEND() to geom_part modules.
MFC r233342:
Check that scheme is not already registered. This may happens when a
KLD is preloaded with loader(8) and leads to infinity loop.
Also do not return EEXIST error code from MOD_LOAD handler, because
we have undocumented(?) ability replace kernel's module with preloaded one.
And if we have so, then preloaded module will be initialized first.
Thus error in MOD_LOAD handler will be triggered for the kernel.
MFC 233670,233671:
- Use VM_MEMATTR_UNCACHEABLE for the constant for UC memory rather than
VM_MEMATTR_UNCACHED on mips.
- Rename VM_MEMATTR_UNCACHED to VM_MEMATTR_WEAK_UNCACHEABLE on x86 to
be less ambiguous and more clearly identify what it means. An alias
from VM_MEMATTR_WEAK_UNCACHEABLE to VM_MEMATTR_WEAK_UNCACHED remains
on x86 to preserve the KPI.
- Remove the VM_MEMATTR_UNCACHED alias from powerpc.
MFC 219737,219740,233676:
Improve handling of MSI interrupts with HyperTransport devices:
- Always enable the HyperTransport MSI mapping window for HyperTransport
to PCI bridges (these show up as HyperTransport slave devices) on
powerpc.
- Enable the HyperTransport MSI mapping window for HyperTransport to PCI
bridges when an interrupt is configured via the PCIB_MAP_MSI() call
on x86.
MFC 232228,233613:
Move the DTrace return IDT vector back up from 0x20 to 0x92. The 0x20
vector is currently dedicated to servicing IRQ 0 from the 8259A's, so
it shouldn't be overloaded for DTrace.
MFC 232742:
MFamd64:
- Return failure for a suspend attempt if we have no wake address.
- Use intr_disable()/intr_restore() instead of ACPI_DISABLE_IRQS().
- Invoke intr_suspend() earlier and call intr_resume() if suspend
fails.
- Use pause in the loop waiting for CPU to suspend.
- Restore PAT MSR, switchtime, switchticks, and MTRRs on resume.
MFC r233688-233689:
r233688:
Remove task queue based link state change handler. Driver no longer
needs to defer link state handling.
While I'm here, mark IFF_DRV_RUNNING before changing media. If
link is established without any delay, that link state change
handling could be lost.
r233689:
Do not report current link status if driver is not running.
This change also workarounds dhclient's link state handling bug by
not giving current link status.
Unlike other controllers, ale(4)'s PHY hibernation perfectly works
such that driver does not see a valid link if the controller is not
brought up. If dhclient(8) runs on ale(4) it will blindly waits
until link UP and then gives up after 10 seconds. Because
dhclient(8) still thinks interface got a valid link when IFM_AVALID
is not set for selected media, this change makes dhclient initiate
DHCP without waiting for link UP.
MFC r233585-233587:
r233585:
Partially revert r223608 and selectively allow microcode loading
for 82550C. For 82550 controllers this change restores CPUSaver
microcode loading. Due to silicon bug on 82550 and 82550C with
server extension, these controllers seem to require CPUSaver
microcode to receive fragmented UDP datagrams. However the
microcode shouldn't be used on client featured 82550C as it locks
up the controller. In addition, client featured 82550C does not
have the silicon bug. Also clear temporary memory used for
microcode loading since the same memory area is used for other
commands.
While I'm here use 82550C in probe message instead of generic
82550.
Reported by: Andreas Longwitz <longwitz <> incore de>
Tested by: Andreas Longwitz <longwitz <> incore de>
r233586:
Load entire EEPROM contents in device attach time and verify
whether the checksum of EEPROM is valid or not. Because driver
heavily relies on EEPROM information when it selectively enables
features/workarounds, it would be helpful to know whether driver
sees valid EEPROM.
While I'm here remove all other EEPROM accesses since the entire
EEPROM is loaded at device attach time.
r233587:
Remove unnecessary #if as the software workaround for PCI protocol
violation should be activated unless the system is cold-booted
after updating EEPROM.
The PCI protocol violation happens only when established link is
10Mbps so the workaround should be updated whenever link state
change is detected. Previously the workaround was activated only
when user checks current media status with ifconfig(8).