dteske [Mon, 16 Dec 2013 19:37:15 +0000 (19:37 +0000)]
Improve default ZFS disk layout (tested):
+ For GPT, always provision zfs# partition after swap [for resizability]
+ For MBR, always use a boot pool to relialy place root vdevs at EOD
NB: Fixes edge-cases where MBR combination failed boot (e.g. swap-less)
+ Generalize boot pool logic so it can be used for any scheme (namely MBR)
+ Update existing comments and some whitespace fixes
+ Change some variable names to make reading/debugging the code easier
in zfs_create_boot() (namely prepend zroot_ or bootpool_ to property)
+ Because zroot vdevs are at EOD, no longer need to calculate partsize
(vdev consumes remaining space after allocating swap)
+ Optimize processing of disks -- no reason to loop over the disks 3-4
separate times when we can logically use a single loop to do everything
dteske [Mon, 16 Dec 2013 18:53:09 +0000 (18:53 +0000)]
Bug-fixes and debugging improvments:
+ De-obfuscate debugging to show actual values
+ Change graid(8) syntax; s/destroy/delete/ [destroy is not invalid syntax]
+ Log commands that were previously quiet
+ Added some new comemnts and updated some existing ones
+ Add missing local for `disk' used in zfs_create_boot()
+ Use $disks instead of multiply-expanding $* in zfs_create_boot()
+ Pedantically unset variable holding geli(8) passphrase after use
+ Pedantically add double-quotes around zpool names and zfs datasets
+ Fix quotation expansion for zpool_cache entries of loader.conf(5)
+ Some limited whitespace changes
dteske [Mon, 16 Dec 2013 15:50:59 +0000 (15:50 +0000)]
Add a fix for Long-standing problem with VMware. Described in below links:
https://communities.vmware.com/thread/107230
https://communities.vmware.com/docs/DOC-11677
Basically, ignore the ``function 62'' and ``function 63'' interpretations
of the left/right command key when we're in the lengthiest portion of the
installation (initiated by the `auto' module).
The net effect is that you can now (once you've started the installer from
the media) escape the VM without prematurely terminating the current action
due to spurious escape sequence.
bjk [Mon, 16 Dec 2013 01:58:12 +0000 (01:58 +0000)]
tzfile.5: catch up to r204333
The stdtime sources were moved from lib/libc to contrib/tzcode, and tzfile.h
is not an installed header, so the man page refers to its location in the
source tree.
The documentation could be more clear about the internal nature of the
header, but those changes should go through upstream tzcode.
marcel [Mon, 16 Dec 2013 00:50:14 +0000 (00:50 +0000)]
Properly drain the TTY when both revoke(2) and close(2) end up closing
the TTY. In such a case, ttydev_close() is called multiple times and
each time, t_revokecnt is incremented and cv_broadcast() is called for
both the t_outwait and t_inwait condition variables.
Let's say revoke(2) comes in first and gets to call tty_drain() from
ttydev_leave(). Let's say that the revoke comes from init(8) as the
result of running "shutdown -r now". Since shutdown prints various
messages to the console before announing that the machine will reboot
immediately, let's also say that the output queue is not empty and
that tty_drain() has something to do. Let's assume this all happens
on a 9600 baud serial console, so it takes a time to drain.
The shutdown command will exit(2) and as such will end up closing
stdout. Let's say this close will come in second, bump t_revokecnt
and call tty_wakeup(). This has tty_wait() return prematurely and
the next thing that will happen is that the thread doing revoke(2)
will flush the TTY. Since the drain wasn't complete, the flush will
effectively drop whatever is left in t_outq.
This change takes into account that tty_drain() will return ERESTART
due to the fact that t_revokecnt was bumped and in that case simply
call tty_drain() again. The thread in question is already performing
the close so it can safely finish draining the TTY before destroying
the TTY structure.
Now all messages from shutdown will be printed on the serial console.
pjd [Sun, 15 Dec 2013 23:09:05 +0000 (23:09 +0000)]
Make use of Casper's system.pwd and system.grp services when the -r option
is given to convert uids and gids to user names and group names even when
running in capability mode sandbox.
While here log on stderr when we successfully enter the sandbox.
Get rid of the msg_peek() function, which has a problem. If there was less
data in the socket buffer than requested by the caller, the function would busy
loop, as select(2) will always return immediately.
We can just receive nvlhdr now, because some time ago we splitted receive of
data from the receive of descriptors.
gjb [Sun, 15 Dec 2013 20:47:27 +0000 (20:47 +0000)]
Export 'REPOS_DIR' when the selected source medium for package
installation is cdrom. This enables bsdconfig(8) to make use
of the on-disc pkg(8) repository configuration, which fixes
package selection and installation from the dvd installer.
MFC after: 3 days
M-MFC-With: r259426
X-MFC-Before: -RC3
Sponsored by: The FreeBSD Foundation
jhibbits [Sun, 15 Dec 2013 18:07:25 +0000 (18:07 +0000)]
Save r3 before using it for the trap check, else we end up saving the new r3,
containing the trap instruction encoding (0x7c810808), and restoring it back
with the frame on return. This caused it to panic on my ppc32 machine, but
somehow my ppc64 machine overlooked it, because I was using such a simple
dtrace probe.
nwhitehorn [Sun, 15 Dec 2013 16:58:23 +0000 (16:58 +0000)]
Set max_lun to zero. This field is ignored unless we are manually probing
LUNs anyway, and we certainly don't want to probe 2^32 values by hand in
that case.
hrs [Sun, 15 Dec 2013 16:17:00 +0000 (16:17 +0000)]
Replace Sun RPC license for TI-RPC library with a 3-clause BSD license,
with the explicit permission of Sun Microsystems in 2009.
The code in question in this file was copied from lib/libc/rpc/pmap_getport.c.
alfred [Sun, 15 Dec 2013 07:07:13 +0000 (07:07 +0000)]
Defer start/stop port to workqueues.
We need to do this because the Linux compat layer uses sx(9) for
mutex, however the lagg code uses rmlocks and calls into the mellanox
driver. This causes deadlock due to sleeping while holding a rmlock.
Submitted by: Shahar Klein (shahark mellanox.com)
MFC After: 3 days.
nwhitehorn [Sat, 14 Dec 2013 22:07:40 +0000 (22:07 +0000)]
Widen lun_id_t to 64 bits. This is a follow-on to r257345 to let the kernel
support all valid SAM-5 LUN IDs. CAM_VERSION is bumped, as the CAM ABI
(though not API) is changed. No behavior is changed relative to r257345
except that LUNs with non-zero high 32 bits will no longer be ignored
during device enumeration for SIMs that have set PIM_EXTLUNS.
gavin [Sat, 14 Dec 2013 18:49:59 +0000 (18:49 +0000)]
Fix several panics when initialization of an ISA or PC-CARD device fails:
o Assign sc->an_dev in an_probe() (which isn't really a probe function in
the standard newbus sense) as we may need it for printing errors.
o Use device_printf() rather than if_printf() in an_reset() - this is
called from an_probe() long before the ifp structure is initialised
in an_attach().
o Initialize the ifp structure early in an_attach() as we use if_printf()
in cases where allocation of descriptors etc fails.
np [Sat, 14 Dec 2013 03:08:03 +0000 (03:08 +0000)]
Read card capabilities after firmware initialization, instead of setting
them up as part of firmware initialization (which the driver gets to do
only if it's the master driver).
Read the range of tids available for the ETHOFLD functionality if it's
enabled.
New is_ftid() and is_etid() functions to test whether a tid falls within
the range of filter tids or ETHOFLD tids respectively.
asomers [Fri, 13 Dec 2013 22:58:57 +0000 (22:58 +0000)]
sbin/devd/devd.cc
Promoting the SIGINFO handler's log message from LOG_INFO to
LOG_NOTICE, and promoting the "Processing event ..." message from
LOG_DEBUG to LOG_INFO. Setting the logfile to LOG_NOTICE with this
change will have the same result as setting it to LOG_INFO without
this change. Setting it to LOG_INFO with this change will include
the useful "Processing event ..." messages that were previously at
LOG_DEBUG, without including useless messages like "Pushing table".
The intent of this change is that one can log "Processing event ..."
without logging "Pushing table" and related messages that are sent
for every event. The number of lines actually logged is reduced by
about 75% by making this change and setting syslog to LOG_INFO vs
setting syslog to LOG_DEBUG.
etc/syslog.conf
Changing the recommended loglevel to notice instead of info.
asomers [Fri, 13 Dec 2013 21:49:41 +0000 (21:49 +0000)]
sbin/devd/devd.cc
Increase the size of devd's client socket's send buffer from the
default (8k) to 128k. This prevents clients from getting
POLLHUPped during event storms. For example, during zpool creation,
the kernel emits a resource.fs.zfs.statechange event for every vdev
in the pool. A 128k buffer is large enough to hold the statechange
events for a pool with nearly 800 drives.
bjk [Fri, 13 Dec 2013 03:09:29 +0000 (03:09 +0000)]
Apply patch from upstream Heimdal for encoding fix
RFC 4402 specifies the implementation of the gss_pseudo_random()
function for the krb5 mechanism (and the C bindings therein).
The implementation uses a PRF+ function that concatenates the output
of individual krb5 pseudo-random operations produced with a counter
and seed. The original implementation of this function in Heimdal
incorrectly encoded the counter as a little-endian integer, but the
RFC specifies the counter encoding as big-endian. The implementation
initializes the counter to zero, so the first block of output (16 octets,
for the modern AES enctypes 17 and 18) is unchanged. (RFC 4402 specifies
that the counter should begin at 1, but both existing implementations
begin with zero and it looks like the standard will be re-issued, with
test vectors, to begin at zero.)
This is upstream's commit f85652af868e64811f2b32b815d4198e7f9017f6,
from 13 October, 2013:
% Fix krb5's gss_pseudo_random() (n is big-endian)
%
% The first enctype RFC3961 prf output length's bytes are correct because
% the little- and big-endian representations of unsigned zero are the
% same. The second block of output was wrong because the counter was not
% being encoded as big-endian.
%
% This change could break applications. But those applications would not
% have been interoperating with other implementations anyways (in
% particular: MIT's).
Approved by: hrs (mentor, src committer)
MFC after: 3 days
dteske [Thu, 12 Dec 2013 20:47:18 +0000 (20:47 +0000)]
I caught the following snippet at the end of my /var/log/bsdinstall_log:
===
DEBUG: Running installation step: services
local: Not in a function
/usr/libexec/bsdinstall/services: cannot create : Read-only file system
/usr/libexec/bsdinstall/services: /tmp/bsdinstall/etc/rc.conf.services: \
Permission denied
===
The `local: Not in a function' is obvious, and was introduced by myself in
SVN revision 256348.
The latter two are caused by the attempt to use "\" to continue the line
after using the ">>" redirect. This appears to attempt to write a file with
the name " " in the current directory and subsequently attempts to execute
the file that was originally intended for writing (which is not executable;
hence the `Permission denied'). That was introduced in SVN r228192 about
2 years ago, apparently unnoticed until I started going over the debug
outputs very carefully.
bdrewery [Thu, 12 Dec 2013 17:59:09 +0000 (17:59 +0000)]
Fix multi-repository support by properly respecting 'enabled' flag.
This will read the REPOS_DIR env/config setting (default is /etc/pkg
and /usr/local/etc/pkg/repos) and use the last enabled repository.
This can be changed in the environment using a comma-separated list,
or in /usr/local/etc/pkg.conf with JSON array syntax of:
REPOS_DIR: ["/etc/pkg", "/usr/local/etc/pkg/repos"]
mav [Thu, 12 Dec 2013 11:05:48 +0000 (11:05 +0000)]
Fix long known bug with handling device aliases residing not in devfs root.
Historically creation of device aliases created symbolic links using only
name of target device as a link target, not considering current directory.
Fix that by adding number of "../" chunks to the terget device name,
required to get out of the current directory to devfs root first.
asomers [Thu, 12 Dec 2013 00:27:22 +0000 (00:27 +0000)]
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
When a da or ada device dissappears, outstanding IOs fail with
ENXIO, not EIO. The check for EIO was probably copied from Illumos,
where that is indeed the correct errno.
Without this change, pulling a busy drive from a zpool would usually
turn it into UNAVAIL, even though pulling an idle drive would turn
it into REMOVED. With this change, it is REMOVED every time.
Also, vdev_geom_io_intr shouldn't do zfs_post_remove, because that
results in devd getting two resource.fs.zfs.removed events. The
comment said that the event had to be sent directly instead of
through the async removal thread because "the DE engine is using
this information to discard prevoius I/O errors". However, the fact
that vdev_geom_io_intr was never actually sending the events until
now, and that vdev_geom_orphan never sent them at all, and that
vdev_geom_orphan usually gets called about 2 seconds after the
actual removal, means that FreeBSD's userland can cope with a late
event just fine.
mav [Wed, 11 Dec 2013 21:48:04 +0000 (21:48 +0000)]
Create own free list for each of the first 32 possible allocation sizes.
In case of 4K allocation quantum that means for allocations up to 128K.
With growth of memory fragmentation these lists may grow to quite a large
sizes (tenths and hundreds of thousands items). Having in one list items
of different sizes in worst case may require full linear list traversal,
that may be very expensive. Having lists for items of single size means
that unless user specify some alignment or border requirements (that are
very rare cases) first item found on the list should satisfy the request.
While running SPEC NFS benchmark on top of ZFS on 24-core machine with
84GB RAM this change reduces CPU time spent in vmem_xalloc() from 8%
and lock congestion spinning around it from 20% to invisible levels.
And that all is by the cost of just 26 more pointers per vmem instance.
If at some point our kernel will start to actively use KVA allocations
with odd sizes above 128K, something may need to be done to bigger lists
also.
jhb [Wed, 11 Dec 2013 21:21:03 +0000 (21:21 +0000)]
- Use <x86/mptable.h> instead of duplicating its definitions.
- Switch to mmaping the table from RAM instead of reading it out of
/dev/mem via read(2).
jhb [Wed, 11 Dec 2013 21:19:04 +0000 (21:19 +0000)]
Use fixed-width types for all fields in MP Table structures and pack
all the structures. While here, move a helper struct only used in
the kernel parser out of this header since it is not part of the MP
specification itself.
gnn [Wed, 11 Dec 2013 17:18:10 +0000 (17:18 +0000)]
Fix a panic when booting with kernels that have FREEBBSD_COMPAT
4, 5, 6 or 43 by only thunking the data parameter for old ioctls
compatability ioctls instead of doing it for all of them.
imp [Wed, 11 Dec 2013 05:32:29 +0000 (05:32 +0000)]
Fix one race and one fence post error. When the TX buffer was
completely full, we'd not complete any of the mbufs due to the fence
post error (this creates a large leak). When this is fixed, we still
leak, but at a much smaller rate due to a race between ateintr and
atestart_locked as well as an asymmetry where atestart_locked is
called from elsewhere. Ensure that we free in-flight packets that
have completed there as well. Also remove needless check for NULL on
mb, checked earlier in the loop and simplify a redundant if.
jmmv [Wed, 11 Dec 2013 04:09:17 +0000 (04:09 +0000)]
Migrate tools/regression/bin/ tests to the new layout.
This change is a proof of concept on how to easily integrate existing
tests from the tools/regression/ hierarchy into the /usr/tests/ test
suite and on how to adapt them to the new layout for src.
To achieve these goals, this change:
- Moves tests from tools/regression/bin/<tool>/ to bin/<tool>/tests/.
- Renames the previous regress.sh files to legacy_test.sh.
- Adds Makefiles to build and install the tests and all their supporting
data files into /usr/tests/bin/.
- Plugs the legacy_test test programs into the test suite using the new
TAP backend for Kyua (appearing in 0.8) so that the code of the test
programs does not have to change.
- Registers the new directories in the BSD.test.dist mtree file.
jmmv [Wed, 11 Dec 2013 03:41:07 +0000 (03:41 +0000)]
Make bsd.progs.mk work in directories with SCRIPTS but no PROGS.
This change fixes some cases where bsd.progs.mk would fail to handle
directories with SCRIPTS but no PROGS. In particular, "install" did
not handle such scripts nor dependent files when bsd.subdir.mk was
added to the mix.
jmmv [Wed, 11 Dec 2013 03:39:50 +0000 (03:39 +0000)]
Add tap.test.mk.
This file provides support to build test programs that comply with the
Test Anything Protocol. Its main goal is to support the painless
integration of existing tests from tools/regression/ into the Kyua-based
test suite.
neel [Tue, 10 Dec 2013 22:56:51 +0000 (22:56 +0000)]
Fix x2apic support in bhyve.
When the guest is bringing up the APs in the x2APIC mode a write to the
ICR register will now trigger a return to userspace with an exitcode of
VM_EXITCODE_SPINUP_AP. This gets SMP guests working again with x2APIC.
Change the vlapic timer lock to be a spinlock because the vlapic can be
accessed from within a critical section (vm run loop) when guest is using
x2apic mode.
kib [Tue, 10 Dec 2013 22:33:02 +0000 (22:33 +0000)]
The opt_*.h headers must be included before any system header, except
sys/cdefs.h. In particular, in case of COMPAT_43, param.h includes
sys/types.h, which includes sys/select.h, which includes
sys/_sigset.h. The _sigset.h customizes the provided definions based
on COMPAT_43, eliminating osigset_t if symbol is not defined. The
sys/proc.h is included after opt_compat.h and needs osigset_t.
kib [Tue, 10 Dec 2013 21:15:18 +0000 (21:15 +0000)]
Fix detection of EOF in kern_physio(). If bio_length was clipped by
the excess code in g_io_check(), bio_resid is also truncated by
g_io_deliver(). As result, bufdonebio() assigns truncated value to
the buffer b_resid field.
Use the residual bio_completed to calculate buffer b_resid from
b_bcount in bufdonebio(), instead of bio_resid, calculated from
bio_length in g_io_deliver().
The issue is seemingly caused by the code rearrange into g_io_check(),
which is not present in stable/10. The change still looks as the
useful change to have in 10 nevertheless.
Reported by: Stefan Hegnauer <stefan.hegnauer@gmx.ch>
Tested by: pho, Stefan Hegnauer <stefan.hegnauer@gmx.ch>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
kib [Tue, 10 Dec 2013 20:52:31 +0000 (20:52 +0000)]
Only assert the length of the passed bio in the mdstart_vnode() when
the bio is unmapped, so we must map the bio pages into pbuf. This
works around the geom classes which do not follow the MAXPHYS limit on
the i/o size, since such classes do not know about unmapped bios
either.
Reported by: Paolo Pinto <paolo.pinto@netasq.com>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
mav [Tue, 10 Dec 2013 20:25:43 +0000 (20:25 +0000)]
Do not DELAY() for P-state transition unless we want to see the result.
Intel manual says: "If a transition is already in progress, transition to
a new value will subsequently take effect. Reads of IA32_PERF_CTL determine
the last targeted operating point." So seems it should be fine to just
trigger wanted transition and go. Linux does the same.
trociny [Tue, 10 Dec 2013 20:05:07 +0000 (20:05 +0000)]
In remote_send_thread, if sending a request fails don't take the
request back from the receive queue -- it might already be processed
by remote_recv_thread, which lead to crashes like below:
(primary) Unable to receive reply header: Connection reset by peer.
(primary) Unable to send request (Connection reset by peer):
WRITE(954662912, 131072).
(primary) Disconnected from kopusha:7772.
(primary) Increasing localcnt to 1.
(primary) Assertion failed: (old > 0), function refcnt_release,
file refcnt.h, line 62.
Taking the request back was not necessary (it would properly be
processed by the remote_recv_thread) and only complicated things.