royger [Thu, 15 Jan 2015 16:27:20 +0000 (16:27 +0000)]
loader: implement multiboot support for Xen Dom0
Implement a subset of the multiboot specification in order to boot Xen
and a FreeBSD Dom0 from the FreeBSD bootloader. This multiboot
implementation is tailored to boot Xen and FreeBSD Dom0, and it will
most surely fail to boot any other multiboot compilant kernel.
In order to detect and boot the Xen microkernel, two new file formats
are added to the bootloader, multiboot and multiboot_obj. Multiboot
support must be tested before regular ELF support, since Xen is a
multiboot kernel that also uses ELF. After a multiboot kernel is
detected, all the other loaded kernels/modules are parsed by the
multiboot_obj format.
The layout of the loaded objects in memory is the following; first the
Xen kernel is loaded as a 32bit ELF into memory (Xen will switch to
long mode by itself), after that the FreeBSD kernel is loaded as a RAW
file (Xen will parse and load it using it's internal ELF loader), and
finally the metadata and the modules are loaded using the native
FreeBSD way. After everything is loaded we jump into Xen's entry point
using a small trampoline. The order of the multiboot modules passed to
Xen is the following, the first module is the RAW FreeBSD kernel, and
the second module is the metadata and the FreeBSD modules.
Since Xen will relocate the memory position of the second
multiboot module (the one that contains the metadata and native
FreeBSD modules), we need to stash the original modulep address inside
of the metadata itself in order to recalculate its position once
booted. This also means the metadata must come before the loaded
modules, so after loading the FreeBSD kernel a portion of memory is
reserved in order to place the metadata before booting.
In order to tell the loader to boot Xen and then the FreeBSD kernel the
following has to be added to the /boot/loader.conf file:
The first argument contains the command line that will be passed to the Xen
kernel, while the second argument is the path to the Xen kernel itself. This
can also be done manually from the loader command line, by for example
typing the following set of commands:
OK unload
OK load /boot/xen dom0_mem=1024M dom0_max_vcpus=2 dom0pvh=1 console=com1,vga
OK load kernel
OK load zfs
OK load if_tap
OK load ...
OK boot
hselasky [Thu, 15 Jan 2015 15:32:30 +0000 (15:32 +0000)]
Major callout subsystem cleanup and rewrite:
- Close a migration race where callout_reset() failed to set the
CALLOUT_ACTIVE flag.
- Callout callback functions are now allowed to be protected by
spinlocks.
- Switching the callout CPU number cannot always be done on a
per-callout basis. See the updated timeout(9) manual page for more
information.
- The timeout(9) manual page has been updated to reflect how all the
functions inside the callout API are working. The manual page has
been made function oriented to make it easier to deduce how each of
the functions making up the callout API are working without having
to first read the whole manual page. Group all functions into a
handful of sections which should give a quick top-level overview
when the different functions should be used.
- The CALLOUT_SHAREDLOCK flag and its functionality has been removed
to reduce the complexity in the callout code and to avoid problems
about atomically stopping callouts via callout_stop(). If someone
needs it, it can be re-added. From my quick grep there are no
CALLOUT_SHAREDLOCK clients in the kernel.
- A new callout API function named "callout_drain_async()" has been
added. See the updated timeout(9) manual page for a complete
description.
- Update the callout clients in the "kern/" folder to use the callout
API properly, like cv_timedwait(). Previously there was some custom
sleepqueue code in the callout subsystem, which has been removed,
because we now allow callouts to be protected by spinlocks. This
allows us to tear down the callout like done with regular mutexes,
and a "td_slpmutex" has been added to "struct thread" to atomically
teardown the "td_slpcallout". Further the "TDF_TIMOFAIL" and
"SWT_SLEEPQTIMO" states can now be completely removed. Currently
they are marked as available and will be cleaned up in a follow up
commit.
- Bump the __FreeBSD_version to indicate kernel modules need
recompilation.
- There has been several reports that this patch "seems to squash a
serious bug leading to a callout timeout and panic".
Kernel build testing: all architectures were built
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D1438
Sponsored by: Mellanox Technologies
Reviewed by: jhb, adrian, sbruno and emaste
bz [Thu, 15 Jan 2015 02:22:52 +0000 (02:22 +0000)]
Fix cpsw(4) after r277203 which folded 'struct m_hdr' into 'struct mbuf'.
While in theory this should have been a transparent change (and was for all
other drivers), cpsw(4) never used the proper accessor macros in a few
places but spelt the indirect m_hdr.mh_* out itself. Convert those to
use m_len and m_data and unbreak the driver build.
imp [Thu, 15 Jan 2015 00:46:30 +0000 (00:46 +0000)]
Reserve and ignore the a new module metadata type MDT_PNP_INFO for
associating an optional PNP hint table with this module. In the
future, when these are added, these changes will silently ignore the
new type they would otherwise warn about. It will always be safe to
ignore this data. Get this into the builds today for some future
proofing.
imp [Thu, 15 Jan 2015 00:42:06 +0000 (00:42 +0000)]
New MINIMAL kernel config. The goal with this configuration is to
only compile in those options in GENERIC that cannot be loaded as
modules. ufs is still included because many of its options aren't
present in the kernel module. There's some other exceptions documented
in the file. This is part of some work to get more things
automatically loading in the hopes of obsoleting GENERIC one day.
rwatson [Wed, 14 Jan 2015 23:44:00 +0000 (23:44 +0000)]
In order to support ongoing work to implement variable-size mbufs, and
more generally make it easier to extend 'struct mbuf in the future', make
a number of changes to the data structure:
- As we anticipate embedding mbufs headers within variable-size regions of
memory in the future, change the definitions of byte arrays embedded in
mbufs to be of size [0] rather than [MLEN] and [MHLEN]. In fact, the
cxgbe driver already uses 'struct mbuf' on the front of other storage
sizes, but we would like the global mbuf allocator do be able to do this
as well.
- Fold 'struct m_hdr' into 'struct mbuf' itself, eliminating a set of
macros that aliased 'mh_foo' field names to 'm_foo' names such as
'm_next'. These present a particular problem as we would like to add
new mbuf-header fields -- e.g., 'm_size' -- that, if similarly named via
macros, would introduce collisions with many other variable names in the
kernel.
- Rename 'struct m_ext' to 'struct struct_m_ext' so that we can add
compile-time assertions without bumping into the still-extant 'm_ext'
macro.
- Remove the MSIZE compile-time assertion for 'struct mbuf', but add new
assertions for alignment of embedded data arrays (64-bit alignment even
on 32-bit platforms), and for the sizes the mbuf header, packet header,
and m_ext structure.
- Document that these assertions exist in comments in mbuf.h.
This change is not intended to cause (non-trivial) behavioural
differences, but is a precursor to further mbuf-allocator work.
dim [Wed, 14 Jan 2015 22:37:11 +0000 (22:37 +0000)]
Remove the <netinet/ip_compat.h> include from one of the newly added
sanitizer sources. It is apparently unnecessary, and causes trouble for
people using WITHOUT_IPFILTER.
Reported by: Pawel Biernacki <pawel.biernacki@gmail.com>, Kurt Lidl <lidl@pix.net>
hselasky [Wed, 14 Jan 2015 22:07:13 +0000 (22:07 +0000)]
Avoid race with "dev_rel()" when using the recently added
"delist_dev()" function. Make sure the character device structure
doesn't go away until the end of the "destroy_dev()" function due to
concurrently running cleanup code inside "devfs_populate()".
hselasky [Wed, 14 Jan 2015 14:04:29 +0000 (14:04 +0000)]
Add a kernel function to delist our kernel character devices, so that
the device name can be re-used right away in case we are destroying
the character devices in the background.
ed [Wed, 14 Jan 2015 13:03:03 +0000 (13:03 +0000)]
Make sure CAP_BINDAT and CAP_CONNECTAT are part of CAP_ALL0.
This makes sure that file descriptors of opened directories will
actually get these capabilities. Without this change, bindat() and
connectat() don't seem to work for me.
rrs [Wed, 14 Jan 2015 12:46:58 +0000 (12:46 +0000)]
Update the hwpmc driver to have the new type HASWELL_XEON. Also
go back through HASWELL, IVY_BRIDGE, IVY_BRIDGE_XEON and SANDY_BRIDGE
to straighten out all the missing PMCs. We also add a new pmc tool
pmcstudy, this allows one to run the various formulas from
the documents "Using Intel Vtune Amplifier XE on XXX Generation platforms" for
IB/SB and Haswell. The tool also allows one to postulate your own
formulas with any of the various PMC's. At some point I will enahance
this to work with Brendan Gregg's flame-graphs so we can flamegraph
various PMC interactions. Note the manual page also needs some
work (lots of work) but gnn has committed to help me with that ;-)
Reviewed by: gnn
MFC after:1 month
Sponsored by: Netflix Inc.
mav [Wed, 14 Jan 2015 09:39:57 +0000 (09:39 +0000)]
Reimplement TRIM throttling added in r248577.
Previous throttling implementation approached problem from the wrong side.
It significantly limited useful delaying of TRIM requests and aggregation
potential, while not so much controlled TRIM burstiness under heavy load.
With this change random 4K write benchmarks (probably the worst case for
TRIM) show me IOPS increase by 20%, average latency reduction by 30%, peak
TRIM bursts reduction by 3 times and same peak TRIM map size (memory usage).
Also the new logic does not force map size down so heavily, really allowing
to keep deleted data for 32 TXG or 30 seconds under moderate load. It was
practically impossible with old throttling logic, which pushed map down to
only 64 segments.
imp [Wed, 14 Jan 2015 05:41:33 +0000 (05:41 +0000)]
Various interrelated fixes to make suspend / resume work better. We now
can suspend / resume and unload / load cbb and cardbus without errors
on my Lenovo T400, which wasn't possible before. Cards suspending
and resuming in the CardBus slot not yet tested.
o Enable memory cycles to the bridge early (as part of the new
cbb_pci_bridge_init). This fixes the Bad VCC errors which were
caused by the code accessing the device registers with this
cleared. The suspend / resume process clears it.
o Refactor suspend / resume into bus specific code (though the ISA
code is just stubbed). This isn't strictly necessary, but makes
the initializaiton code more uniform and should be more bullet
proof in the face of variant behavior among cardbus bridges.
o Fixup comments in the power-up sequence to reflect reality. These
comments were written for one regime of power-up, but not updated
as things were revised.
o Add a paranoid small delay (100ms) to cover noisy cards powering
down.
o Fix some debugging prints to be easier to grep from dmesg.
imp [Wed, 14 Jan 2015 05:41:28 +0000 (05:41 +0000)]
On x86 force NEW_PCIB, since that's the default. While this option
would be picked up for kernel builds, it isn't picked up for
old-fashioned builds. Without this option, PCI bus numbers are busted
for modules build iteratively.
neel [Tue, 13 Jan 2015 22:00:47 +0000 (22:00 +0000)]
'struct vm_exception' was intended to be used only as the collateral for the
VM_INJECT_EXCEPTION ioctl. However it morphed into other uses like keeping
track pending exceptions for a vcpu. This in turn causes confusion because
some fields in 'struct vm_exception' like 'vcpuid' make sense only in the
ioctl context. It also makes it harder to add or remove structure fields.
Fix this by using 'struct vm_exception' only to communicate information
from userspace to vmm.ko when injecting an exception.
Also, add a field 'restart_instruction' to 'struct vm_exception'. This
field is set to '1' for exceptions where the faulting instruction is
restarted after the exception is handled.
dim [Tue, 13 Jan 2015 20:37:57 +0000 (20:37 +0000)]
Since the merge of file 5.21 in r276415 and r276416, stable/9 and
stable/10 cannot be built from FreeBSD 8.x. This is because the
build-tools stage requires libmagic, but lib/libmagic/config.h was
generated on head, and it now enables using the xlocale.h APIs, which
are not supported on 8.x (and on 9.x before __FreeBSD_version 900506).
See also the start of this thread on -stable:
https://lists.freebsd.org/pipermail/freebsd-stable/2015-January/081521.html
To fix this, conditionalize the use of xlocale.h APIs to make
bootstrapping from older FreeBSD versions work correctly.
Reviewed by: delphij
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D1518
dim [Tue, 13 Jan 2015 19:54:47 +0000 (19:54 +0000)]
Connect libclang_rt to the build, for specific architectures. This
contains the libraries for Address Sanitizer (asan), Undefined Behavior
Sanitizer (ubsan) and Profile Guided Optimization.
ASan is a fast memory error detector. It can detect the following types
of bugs:
Out-of-bounds accesses to heap, stack and globals
Use-after-free
Use-after-return (to some extent)
Double-free, invalid free
Memory leaks (experimental)
Typical slowdown introduced by AddressSanitizer is 2x.
UBSan is a fast and compatible undefined behavior checker. It enables a
number of undefined behavior checks that have small runtime cost and no
impact on address space layout or ABI.
PLEASE NOTE: the sanitizers still have some rough edges on FreeBSD,
particularly on i386. These will hopefully be smoothed out in the
coming time.
hselasky [Tue, 13 Jan 2015 16:37:43 +0000 (16:37 +0000)]
Resolve a special case deadlock: When two or more threads are
simultaneously detaching kernel drivers on the same USB device we can
get stuck in the "usb_wait_pending_ref_locked()" function because the
conditions needed for allowing detach are not met. The "destroy_dev()"
function waits for all system calls involving the given character
device to return. Character device system calls may lock the USB
enumeration lock, which is also held when "destroy_dev()" is
called. This can sometimes lead to a deadlock not noticed by
WITNESS. The current solution is to ensure the calling thread is the
only one holding the USB enumeration lock and prevent other threads
from getting refs while a USB device detach is ongoing. This turned
out not to be sufficient. To solve this deadlock we could use
"destroy_dev_sched()" to schedule the device destruction in the
background, but then we don't know when it is safe to free() the
private data of the character device. Instead a callback function is
executed by the USB explore process to kill off any leftover USB
character devices synchronously after the USB device explore code is
finished and the USB enumeration lock is no longer locked. This makes
porting easier and also ensures us that character devices must
eventually go away after a USB device detach.
While at it ensure that "flag_iserror" is only written when "priv_mtx"
is locked, which is protecting it.
hselasky [Tue, 13 Jan 2015 13:32:18 +0000 (13:32 +0000)]
Don't use POLLNVAL as a return value from the client side poll
function. Many existing clients don't understand POLLNVAL and instead
relies on an error code from the read(), write() or ioctl() system
call. Also make sure we wakeup any client pollers before the cuse
server is closing, so they don't wait forever for an event.
delphij [Tue, 13 Jan 2015 05:32:51 +0000 (05:32 +0000)]
Use the common codepath to handle SIOCGIFADDR.
Before this change, the current code handles SIOCGIFADDR the same
way with SIOCSIFADDR, which involves full arp_ifinit, et al. They
should be unnecessary for SIOCGIFADDR case.
imp [Tue, 13 Jan 2015 00:20:35 +0000 (00:20 +0000)]
Explain a bit of tricky code dealing with trims and how it prevents
starvation. These side effects aren't obvious without extremely
careful study, and are important to do just so.
zbb [Tue, 13 Jan 2015 00:00:09 +0000 (00:00 +0000)]
Introduce ofw_bus_reg_to_rl() to replace part of common bus code
Instead of reusing the same reg parsing code, create one, common function
that puts reg contents to the resource list. Address cells and size cells
are passed rather than acquired here so that any bus can have different
default values.
Obtained from: Semihalf
Reviewed by: andrew, ian, nwhitehorn
Sponsored by: The FreeBSD Foundation
glebius [Mon, 12 Jan 2015 22:27:38 +0000 (22:27 +0000)]
In miibus(4) drivers provide functions that allow to get NIC
driver name and NIC driver softc via the device(9) tree,
instead of going dirty through the ifnet(9) layer.
jfv [Mon, 12 Jan 2015 18:43:34 +0000 (18:43 +0000)]
Intel I40E driver updates:
if_ixl to version 1.3.0, if_ixlv to version 1.2.0
- Major change in both drivers is to add RSS support
- In ixl fix some interface speed related issues, dual
speed was not changing correctly, KR/X media was not
displaying correctly (this has a workaround until a
more robust media handling is in place)
- Add a warning when using Dell NPAR and the speed is
less than 10G
- Wrap a queue hung message in IXL_DEBUG, as it is non-fatal,
and without tuning can display excessively
emaste [Mon, 12 Jan 2015 18:13:38 +0000 (18:13 +0000)]
Remove duplicate copies of trivial getcontextx.c
Only i386 and amd64 provide a non-trivial __getcontextx(). Use a common
trivial implementation in gen/ for other architectures, rather than
copying the file to each MD subdirectory.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D1472
glebius [Mon, 12 Jan 2015 18:06:22 +0000 (18:06 +0000)]
Do not go one layer down to check ifqueue length. First, not all drivers
use ifqueue at all. Second, there is no point in this lockless check.
Either positive or negative result of the check could be incorrect after
a tick.
glebius [Mon, 12 Jan 2015 14:52:43 +0000 (14:52 +0000)]
Do not go one layer down to check ifqueue length. First, not all drivers
use ifqueue at all. Second, there is no point in this lockless check.
Either positive or negative result of the check could be incorrect after
a tick.
glebius [Mon, 12 Jan 2015 09:48:45 +0000 (09:48 +0000)]
Remove the support for NGM_CISCO_GET_IPADDR message from ng_iface(4). The
legitimacy of removal is proved by the fact that implementation contained
a critical bug: the response allocated was sizeof(pointer), while should
had been 2*sizeof(struct ng_cisco_ipaddr). The reason for ng_iface(4) to
support ng_cisco(4) message isn't explained anywhere, and code comes from
original Whistle import.
glebius [Mon, 12 Jan 2015 09:41:12 +0000 (09:41 +0000)]
Remove incorrect layering violating code that:
a) assumed that ifqueue length is measured in bytes, instead of packets
b) assumed that any interface has working ifqueue
c) incremented global counter instead of ifi_oqdrops
hiren [Mon, 12 Jan 2015 08:33:04 +0000 (08:33 +0000)]
DCTCP (Data Center TCP) implementation.
DCTCP congestion control algorithm aims to maximise throughput and minimise
latency in data center networks by utilising the proportion of Explicit
Congestion Notification (ECN) marked packets received from capable hardware as a
congestion signal.
Highlights:
Implemented as a mod_cc(4) module.
ECN (Explicit congestion notification) processing is done differently from
RFC3168.
Takes one-sided DCTCP into consideration where only one of the sides is using
DCTCP and other is using standard ECN.
IETF draft: http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Thesis report by Midori Kato: https://eggert.org/students/kato-thesis.pdf
Submitted by: Midori Kato <katoon@sfc.wide.ad.jp> and
Lars Eggert <lars@netapp.com>
with help and modifications from
hiren
Differential Revision: https://reviews.freebsd.org/D604
Reviewed by: gnn
kib [Mon, 12 Jan 2015 07:48:22 +0000 (07:48 +0000)]
Fix several issues with /dev/mem and /dev/kmem devices on amd64.
For /dev/mem, when requested physical address is not accessible by the
direct map, do temporal remaping with the caching attribute
'uncached'. Limit the accessible addresses by MAXPHYADDR, since the
architecture disallowes writing non-zero into reserved bits of ptes
(or setting garbage into NX).
For /dev/kmem, only access existing kernel mappings for direct map
region. For all other addresses, obtain a physical address of the
mapping and fall back to the /dev/mem mechanism. This ensures that
/dev/kmem i/o does not fault even if the accessed region is changed in
parallel, by using either direct map or temporal mapping.
For both devices, operate on one page by iteration. Do not return
error if any bytes were moved around, return the (partial) bytes count
to userspace.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
yongari [Mon, 12 Jan 2015 07:43:19 +0000 (07:43 +0000)]
Receive filter configuration is done in nge_rxfilter(). Remove
unnecessary filter configuration code in nge_init_locked().
While I'm here add a check for driver running state for multicast
filter handling. Also remove unnecessary assignment to error
variable since it is cleared in the function entry.
kib [Mon, 12 Jan 2015 07:36:25 +0000 (07:36 +0000)]
For x86, read MAXPHYADDR, defined in SDM vol 3 4.1.4 Enumeration of Paging
Features by CPUID as CPUID.80000008H:EAX[7:0], into variable cpu_maxphyaddr.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
hselasky [Mon, 12 Jan 2015 06:34:23 +0000 (06:34 +0000)]
Increase the maximum number of dynamic USB quirks. USB memory stick
devices which don't support the synchronize cache SCSI command are
likely to also not support the prevent-allow medium removal SCSI
command.
loos [Mon, 12 Jan 2015 03:23:16 +0000 (03:23 +0000)]
Add support to turn off Beaglebone with poweroff(8) or shutdown(8) -p.
To cut off the power we need to start the shutdown sequence by writing
the OFF bit on PMIC.
Once the PMIC is programmed the SoC needs to toggle the PMIC_PWR_ENABLE
pin when it is ready for the PMIC to cut off the power. This is done by
triggering the ALARM2 interrupt on SoC RTC.
The RTC driver only works in power management mode which means it won't
provide any kind of time keeping functionality. It only implements a way
to trigger the ALARM2 interrupt when requested.
ian [Mon, 12 Jan 2015 02:42:33 +0000 (02:42 +0000)]
Handle dma mappings with more than one segment for rpi sdhci.
The driver inherently does dma in 512 byte chunks, but it's possible that
such a buffer can span two physically discontiguous pages (such as when
a userland program does IO on the raw /dev/mmcsdN devices). Now the driver
can handle a buffer that's split across two pages.
It could in theory handle any number of segments now, but as long as IO is
being done in 512 byte blocks it will never need more than two.