neel [Sat, 17 Aug 2013 19:49:08 +0000 (19:49 +0000)]
Bump up the maximum addressable memory on amd64 systems from 1TB to 4TB.
Bump up the KVA size proportionally from 512GB to 2TB.
The number of page table pages used by the direct map is now calculated at
run time based on 'Maxmem'. This means the small memory systems will not
see any additional tax in terms of page table pages for the direct map.
However all amd64 systems, regardless of the memory size, will use 3 more
pages to accomodate the bump in the KVA size.
More details available here:
http://lists.freebsd.org/pipermail/freebsd-hackers/2013-June/043015.html
http://lists.freebsd.org/pipermail/freebsd-current/2013-July/043143.html
Tested with the following configurations:
- Sandybridge server with 64GB of memory.
- bhyve VM with 64MB of memory.
- bhyve VM with a 8GB of memory with the memory segment above 4GB cuddling
right up against the 4TB maximum memory limit.
Discussed on: hackers@, current@
Submitted by: Chris Torek (torek@torek.net)
jilles [Sat, 17 Aug 2013 19:24:58 +0000 (19:24 +0000)]
libc: Access _logname_valid more efficiently.
The variable _logname_valid is not exported via the version script;
therefore, change C and i386/amd64 assembler code to remove indirection
(which allowed interposition). This makes the code slightly smaller and
faster.
Also, remove #define PIC_GOT from i386/amd64 in !PIC mode. Without PIC,
there is no place containing the address of each variable, so there is no
possible definition for PIC_GOT.
andrew [Sat, 17 Aug 2013 18:51:38 +0000 (18:51 +0000)]
Rename device vfp to option VFP and retire the ARM_VFP_SUPPORT option. This
simplifies enabling as previously both options were required to be enabled,
now we only need a single option.
bryanv [Sat, 17 Aug 2013 17:02:43 +0000 (17:02 +0000)]
Do not use potentially stale thread in kthread_add()
When an existing process is provided, the thread selected to use
to initialize the new thread could have exited and be reaped.
Acquire the proc lock earlier to ensure the thread remains valid.
Reviewed by: jhb, julian (previous version)
MFC after: 3 days
andrew [Sat, 17 Aug 2013 14:52:19 +0000 (14:52 +0000)]
Remove unused FPE code. This is not enabled anywhere as it is the only
file I can find containing FAST_FPE. It appears this would not work as
want_resched is not defined anywhere.
andrew [Sat, 17 Aug 2013 14:36:32 +0000 (14:36 +0000)]
Silence a warning that is incorrect on ARMv6 and later. In the smull, umull,
smlal, and umlal the output registers are allowed to be the same as either
input registers, where in ARMv4 and ARMv5 they could only be the same as the
last input register.
hrs [Sat, 17 Aug 2013 07:12:52 +0000 (07:12 +0000)]
Unbreak rwhod(8):
- It did not work with GENERIC kernel after r250603 because
options PROCDESC was required for pdfork(2). It now just uses fork(2)
instead when this syscall is not available.
- Fix verify(). This function was broken in r250602 because the outermost
"()" was removed from the condition !(isalnum() || ispunct()).
It prevented hostnames including "-", for example.
ian [Fri, 16 Aug 2013 23:05:34 +0000 (23:05 +0000)]
Handle command retries for commands originating at the mmc layer, and
ensure that all such commands have a non-zero retry count except for those
that are expected to fail (for example, because they are used to probe for
feature support).
While it is possible to pass a retry count down to the hardware driver in
the command request structure, no hardware driver currently implements any
retry logic. The hardware doesn't know much about the context of a single
request, so it makes more sense to handle retries at a layer that does.
This adds retry loops to the mmc_wait_for_cmd() and mmc_wait_for_app_cmd()
functions. These functions are the gateway from other code within mmc.c
to the hardware. App commands are a sequence of two commands and a retry
has to rerun both of them in order, so it needs its own retry loop.
Retry looping is specifically NOT implemented in mmc_wait_for_request()
because it is the gateway for children on the bus, and they have to
implement their own retry logic depending on what makes sense for them.
jhb [Fri, 16 Aug 2013 21:13:55 +0000 (21:13 +0000)]
Add new mmap(2) flags to permit applications to request specific virtual
address alignment of mappings.
- MAP_ALIGNED(n) requests a mapping aligned on a boundary of (1 << n).
Requests for n >= number of bits in a pointer or less than the size of
a page fail with EINVAL. This matches the API provided by NetBSD.
- MAP_ALIGNED_SUPER is a special case of MAP_ALIGNED. It can be used
to optimize the chances of using large pages. By default it will align
the mapping on a large page boundary (the system is free to choose any
large page size to align to that seems best for the mapping request).
However, if the object being mapped is already using large pages, then
it will align the virtual mapping to match the existing large pages in
the object instead.
- Internally, VMFS_ALIGNED_SPACE is now renamed to VMFS_SUPER_SPACE, and
VMFS_ALIGNED_SPACE(n) is repurposed for specifying a specific alignment.
MAP_ALIGNED(n) maps to using VMFS_ALIGNED_SPACE(n), while
MAP_ALIGNED_SUPER maps to VMFS_SUPER_SPACE.
- mmap() of a device object now uses VMFS_OPTIMAL_SPACE rather than
explicitly using VMFS_SUPER_SPACE. All device objects are forced to
use a specific color on creation, so VMFS_OPTIMAL_SPACE is effectively
equivalent.
ian [Fri, 16 Aug 2013 20:32:56 +0000 (20:32 +0000)]
During card identification, run the bus at 400KHz, not the minimum
speed the bus claims to be capable of. The 400KHz speed is dictated
by the SD and MMC standards.
ian [Fri, 16 Aug 2013 19:44:49 +0000 (19:44 +0000)]
Add named constants for 8-bit bus support. The sdhci and mmc drivers
don't have support for this yet, but some low-level hardware is ready
for it when the higher layers catch up.
ian [Fri, 16 Aug 2013 19:40:00 +0000 (19:40 +0000)]
When the timeout clock is based on the SD clock, the timeout counter
has to be recalculated every time the SD clock frequency changes.
Also, tidy up the counter calculation... it makes no sense to calculate
a value one larger than the limit, then whine that it's too large and
truncate it to the limit. If the BROKEN_TIMEOUT quirk is set, don't
calculate the counter at all, just set it to the limit value.
sjg [Fri, 16 Aug 2013 16:26:23 +0000 (16:26 +0000)]
When we need to build using the in-tree make,
switch at the earliest opportunity.
In the case of fmake vs bmake, this helps ensure correct load handling.
ken [Fri, 16 Aug 2013 16:14:32 +0000 (16:14 +0000)]
Add unmapped I/O and larger I/O support to the sa(4) driver.
We now pay attention to the maxio field in the XPT_PATH_INQ CCB,
and if it is set, propagate it up to physio via the si_iosize_max
field in the cdev structure.
We also now pay attention to the PIM_UNMAPPED capability bit in the
XPT_PATH_INQ CCB, and set the new SI_UNMAPPED cdev flag when the
underlying SIM supports unmapped I/O.
scsi_sa.c: Add unmapped I/O support and propagate the SIM's
maximum I/O size up.
Adjust scsi_tape_read_write() in the same way that
scsi_read_write() was changed to support unmapped
I/O. We overload the readop parameter with bits
that tell us whether it's an unmapped I/O, and we
need to set the CAM_DATA_BIO CCB flag. This change
should be backwards compatible in source and
binary forms.
ken [Thu, 15 Aug 2013 22:52:39 +0000 (22:52 +0000)]
Change the way that unmapped I/O capability is advertised.
The previous method was to set the D_UNMAPPED_IO flag in the cdevsw
for the driver. The problem with this is that in many cases (e.g.
sa(4)) there may be some instances of the driver that can handle
unmapped I/O and some that can't. The isp(4) driver can handle
unmapped I/O, but the esp(4) driver currently cannot. The cdevsw
is shared among all driver instances.
So instead of setting a flag on the cdevsw, set a flag on the cdev.
This allows drivers to indicate support for unmapped I/O on a
per-instance basis.
sys/conf.h: Remove the D_UNMAPPED_IO cdevsw flag and replace it
with an SI_UNMAPPED cdev flag.
kern_physio.c: Look at the cdev SI_UNMAPPED flag to determine
whether or not a particular driver can handle
unmapped I/O.
geom_dev.c: Set the SI_UNMAPPED flag for all GEOM cdevs.
Since GEOM will create a temporary mapping when
needed, setting SI_UNMAPPED unconditionally will
work.
Remove the D_UNMAPPED_IO flag.
nvme_ns.c: Set the SI_UNMAPPED flag on cdevs created here
if NVME_UNMAPPED_BIO_SUPPORT is enabled.
vfs_aio.c: In aio_qphysio(), check the SI_UNMAPPED flag on a
cdev instead of the D_UNMAPPED_IO flag on the cdevsw.
sys/param.h: Bump __FreeBSD_version to 1000045 for the switch from
setting the D_UNMAPPED_IO flag in the cdevsw to setting
SI_UNMAPPED in the cdev.
cperciva [Thu, 15 Aug 2013 20:19:17 +0000 (20:19 +0000)]
Change the queue of locks in kern_rangelock.c from holding lock requests in
the order that they arrive, to holding
(a) granted write lock requests, followed by
(b) granted read lock requests, followed by
(c) ungranted requests, in order of arrival.
This changes the stopping condition for iterating through granted locks to
see if a new request can be granted: When considering a read lock request,
we can stop iterating as soon as we see a read lock request, since anything
after that point is either a granted read lock request or a request which
has not yet been granted. (For write lock requests, we must still compare
against all granted lock requests.)
For workloads with R parallel reads and W parallel writes, this improves
the time spent from O((R+W)^2) to O(W*(R+W)); i.e., heavy parallel-read
workloads become significantly more scalable.
No statistically significant change in buildworld time has been measured,
but synthetic tests of parallel 'dd > /dev/null' and 'openssl enc >/dev/null'
with the input file cached yield dramatic (up to 10x) improvement with high
(up to 128 processes) levels of parallelism.
ken [Thu, 15 Aug 2013 16:41:27 +0000 (16:41 +0000)]
Export the maxio field in the CAM XPT_PATH_INQ CCB in the isp(4)
driver.
This tells consumers up the stack the maximum I/O size that the
controller can handle.
The I/O size is bounded by the number of scatter/gather segments
the controller can handle and the page size. For an amd64 system,
it works out to around 5MB.
attilio [Thu, 15 Aug 2013 11:01:25 +0000 (11:01 +0000)]
On the recovery path for vm_page_alloc(), if a page had been requested
wired, unwind back the wiring bits otherwise we can end up freeing a
page that is considered wired.
markj [Thu, 15 Aug 2013 04:08:55 +0000 (04:08 +0000)]
Specify SDT probe argument types in the probe definition itself rather than
using SDT_PROBE_ARGTYPE(). This will make it easy to extend the SDT(9) API
to allow probes with dynamically-translated types.
tuexen [Wed, 14 Aug 2013 21:51:32 +0000 (21:51 +0000)]
Don't send uninitialized memory (two instances of 4 bytes) in
every cookie on the wire. This bug was reported in
https://bugzilla.mozilla.org/show_bug.cgi?id=905080
rmacklem [Wed, 14 Aug 2013 21:11:26 +0000 (21:11 +0000)]
Fix several performance related issues in the new NFS server's
DRC for NFS over TCP.
- Increase the size of the hash tables.
- Create a separate mutex for each hash list of the TCP hash table.
- Single thread the code that deletes stale cache entries.
- Add a tunable called vfs.nfsd.tcphighwater, which can be increased
to allow the cache to grow larger, avoiding the overhead of frequent
scans to delete stale cache entries.
(The default value will result in frequent scans to delete stale cache
entries, analagous to what the pre-patched code does.)
- Add a tunable called vfs.nfsd.cachetcp that can be used to disable
DRC caching for NFS over TCP, since the old NFS server didn't DRC cache TCP.
It also adjusts the size of nfsrc_floodlevel dynamically, so that it is
always greater than vfs.nfsd.tcphighwater.
For UDP the algorithm remains the same as the pre-patched code, but the
tunable vfs.nfsd.udphighwater can be used to allow the cache to grow
larger and reduce the overhead caused by frequent scans for stale entries.
UDP also uses a larger hash table size than the pre-patched code.
Reported by: wollman
Tested by: wollman (earlier version of patch)
Submitted by: ivoras (earlier patch)
Reviewed by: jhb (earlier version of patch)
MFC after: 1 month
jeff [Tue, 13 Aug 2013 22:40:43 +0000 (22:40 +0000)]
- Add a statically allocated memguard arena since it is needed very early
on.
- Pass the appropriate flags to vmem_xalloc() when allocating space for
the arena from kmem_arena.
jeff [Tue, 13 Aug 2013 21:56:16 +0000 (21:56 +0000)]
Improve pageout flow control to wakeup more frequently and do less work while
maintaining better LRU of active pages.
- Change v_free_target to include the quantity previously represented by
v_cache_min so we don't need to add them together everywhere we use them.
- Add a pageout_wakeup_thresh that sets the free page count trigger for
waking the page daemon. Set this 10% above v_free_min so we wakeup before
any phase transitions in vm users.
- Adjust down v_free_target now that we're willing to accept more pagedaemon
wakeups. This means we process fewer pages in one iteration as well,
leading to shorter lock hold times and less overall disruption.
- Eliminate vm_pageout_page_stats(). This was a minor variation on the
PQ_ACTIVE segment of the normal pageout daemon. Instead we now process
1 / vm_pageout_update_period pages every second. This causes us to visit
the whole active list every 60 seconds. Previously we would only maintain
the active LRU when we were short on pages which would mean it could be
woefully out of date.
peter [Tue, 13 Aug 2013 20:38:55 +0000 (20:38 +0000)]
vfork(2) was listed as deprecated in 1994 (r1573) and was the false
reports of its impending demise were removed in 2009 (r199257).
However, in 1996 (r16117) system(3) was switched from vfork(2) to
fork(2) based partly on this. Switch back to vfork(2). This has a
dramatic effect in cases of extreme mmap use - such as excessive
abuse (500+) of shared libraries.
popen(3) has used vfork(2) for a while. vfork(2) isn't going anywhere.