gjb [Fri, 26 May 2017 19:02:46 +0000 (19:02 +0000)]
MFC r314935 (thompsa):
Change ec2.conf to use the pkg tool from a chroot rather than trying
to bootstrap it and fail from the livecd readonly filesystem.
jhb [Fri, 26 May 2017 17:11:27 +0000 (17:11 +0000)]
MFC 315335,315336,315496,315497,315500,315502,315504,315509,315523,315524,
315525: Decode more system call arguments in truss.
315335:
Remove duplicate argument from linux_stat64() decoding.
315336:
Automate the handling of QUAD_ALIGN and QUAD_SLOTS.
Previously, the offset in a system call description specified the
array index of the start of a system call argument. For most system
call arguments this was the same as the index of the argument in the
function signature. 64-bit arguments (off_t and id_t values) passed
on 32-bit platforms use two slots in the array however. This was
handled by adding (QUAD_SLOTS - 1) to the slot indicies of any
subsequent arguments after a 64-bit argument (though written as ("{
Quad, 1 }, { Int, 1 + QUAD_SLOTS }" rather than "{ Quad, 1 }, { Int, 2
+ QUAD_SLOTS - 1 }"). If a system call contained multiple 64-bit
arguments (such as posix_fadvise()), then additional arguments would
need to use 'QUAD_SLOTS * 2' but remember to subtract 2 from the
initial number, etc. In addition, 32-bit powerpc requires 64-bit
arguments to be 64-bit aligned, so if the effective index in the array
of a 64-bit argument is odd, it needs QUAD_ALIGN added to the current
and any subsequent slots. However, if the effective index in the
array of a 64-bit argument was even, QUAD_ALIGN was omitted.
This approach was messy and error prone. This commit replaces it with
automated pre-processing of the system call table to do fixups for
64-bit argument offsets. The offset in a system call description now
indicates the index of an argument in the associated function call's
signature. A fixup function is run against each decoded system call
description during startup on 32-bit platforms. The fixup function
maintains an 'offset' value which holds an offset to be added to each
remaining system call argument's index. Initially offset is 0. When
a 64-bit system call argument is encountered, the offset is first
aligned to a 64-bit boundary (only on powerpc) and then incremented to
account for the second argument slot used by the argument. This
modified 'offset' is then applied to any remaining arguments. This
approach does require a few things that were not previously required:
1) Each system call description must now list arguments in ascending
order (existing ones all do) without using duplicate slots in the
register array. A new assert() should catch any future
descriptions which violate this rule.
2) A system call description is still permitted to omit arguments
(though none currently do), but if the call accepts 64-bit
arguments those cannot be omitted or incorrect results will be
displated on 32-bit systems.
315496:
Decode the arguments passed to cap_fcntls_get() and cap_fcntls_limit().
315497:
Decode arguments passed to posix_fadvise().
315500:
Decode file flags passed to *chflags*().
While here, decode arguments passed to fchflags() and chflagsat().
315502:
Decode flock() operation.
315504:
Decode arguments passed to getfsstat().
Note that this does not yet decode the statfs structures returned by
getfsstat().
315509:
Decode arguments passed to kldsym() and kldunloadf().
This does not currently decode the kld_sym_lookup structure passed to
kldsym().
315523:
Add a Sizet type for 'size_t' values and use it instead of Int.
Various size_t arguments were previously decoded as Int values instead
which would have truncated values above 2^31 on 64-bit systems.
315524:
Decode arguments to madvise().
315525:
Improve decoding of last arguments to ioctl() and sendto().
Decode the last argument to ioctl() as a pointer rather than an int.
Eventually this could use 'int' for the _IOWINT() case and pointers for
all others.
The last argument to sendto() is a socklen_t value, not a pointer.
lidl [Fri, 26 May 2017 15:13:46 +0000 (15:13 +0000)]
MFC r318755: Extend libblacklist support with new action types
The original blacklist library supported two notification types:
- failed auth attempt, which incremented the failed login count
by one for the remote address
- successful auth attempt, which reset the failed login count
to zero for that remote address
When the failed login count reached the limit in the configuration
file, the remote address would be blocked by a packet filter.
This patch implements a new notification type, "abusive behavior",
and accepts, but does not act on an additional type, "bad username".
It is envisioned that a system administrator will configure a small
list of "known bad usernames" that should be blocked immediately.
truckman [Thu, 25 May 2017 22:39:48 +0000 (22:39 +0000)]
MFC r318527
Fix the queue delay estimation in PIE/FQ-PIE when the timestamp
(TS) method is used. When packet timestamp is used, the "current_qdelay"
keeps storing the last queue delay value calculated in the dequeue
function. Therefore, when a burst of packets arrives followed by
a pause, the "current_qdelay" will store a high value caused by the
burst and stick to that value during the pause because the queue
delay measurement is done inside the dequeue function. This causes
the drop probability calculation function to calculate high drop
probability value instead of zero and prevents the burst allowance
mechanism from working properly. Fix this problem by resetting
"current_qdelay" inside the drop probability calculation function
when the queue length is zero and TS option is used.
truckman [Thu, 25 May 2017 17:22:13 +0000 (17:22 +0000)]
MFC r318511
The result of right shifting a negative signed value is implementation
defined. On machines without arithmetic shift instructions, zero bits
may be shifted in from the left, giving a large positive result instead
of the desired divide-by power-of-2. Fix this by operating on the
absolute value and compensating for the possible negation later.
Reverse the order of the underflow/overflow tests and the exponential
decay calculation to avoid the possibility of an erroneous overflow
detection if p is a sufficiently small non-negative value. Also
check for negative values of prob before doing the exponential decay
to avoid another instance of of right shifting a negative value.
dim [Thu, 25 May 2017 16:15:19 +0000 (16:15 +0000)]
MFC r318655:
Pull in r302416 from upstream llvm trunk (by Martin Storsjö):
[ARM] Clear the constant pool cache on explicit .ltorg directives
Multiple ldr pseudoinstructions with the same constant value will
reuse the same constant pool entry. However, if the constant pool is
explicitly flushed with a .ltorg directive, we should not try to
reference constants in the previous pool any longer, since they may
be out of range.
This fixes assembling hand-written assembler source which repeatedly
loads the same constant value, across a binary size larger than the
pc-relative fixup range for ldr instructions (4096 bytes). Such
assembler source already uses explicit .ltorg instructions to emit
constant pools with regular intervals. However if we try to reuse
constants emitted in earlier pools, they end up out of range.
This makes the output of the testcase match what binutils gas does
(prior to this patch, it would fail to assemble).
np [Thu, 25 May 2017 01:59:58 +0000 (01:59 +0000)]
MFC r318014, r318091, r318125, and r318263.
r318014:
cxgbe(4): Fixes related to the knob that controls link autonegotiation.
- Do not leak the adapter lock in sysctl_autoneg.
- Accept only 0 or 1 as valid settings for autonegotiation.
- A fixed speed must be requested by the driver when autonegotiation is
disabled otherwise the firmware will reject the l1cfg command. Use
the top speed supported by the port for now.
r318091:
cxgbe(4): Do not assume that if_qflush is always followed by inteface-down.
r318125:
Adjust whitespace and fix a comment. No functional change.
r318263:
cxgbe(4): netmap-only interrupts for a VI do not have an associated rxq
or ofld_rxq and should be ignored by vi_intr_iq.
np [Thu, 25 May 2017 01:40:40 +0000 (01:40 +0000)]
MFC r317702, r317847, r318307
r317702:
cxgbe(4): Support routines for Tx traffic scheduling.
- Create a new file, t4_sched.c, and move all of the code related to
traffic management from t4_main.c and t4_sge.c to this file.
- Track both Channel Rate Limiter (ch_rl) and Class Rate Limiter (cl_rl)
parameters in the PF driver.
- Initialize all the cl_rl limiters with somewhat arbitrary default
rates and provide routines to update them on the fly.
- Provide routines to reserve and release traffic classes.
r317847:
cxgbe(4): The Tx scheduler initialization either works or doesn't. It
doesn't need a refresh in either case.
r318307:
cxgbe(4): Avoid an out of bounds access when an attempt to unbind a tx
queue from a traffic class fails.
gjb [Thu, 25 May 2017 01:31:12 +0000 (01:31 +0000)]
MFC r308737, r308779:
r308737:
Pass SWAPSIZE in env(1) when invoking mk-vmimage.sh, otherwise
mkimg(1) does not create the second partition after r307008.
r308779:
Pass SWAPSIZE in env(1) when invoking mk-vmimage.sh for the
vm-image target, missed in r308737.
np [Thu, 25 May 2017 00:43:56 +0000 (00:43 +0000)]
MFC r317041:
cxgbe: Add tunables to control the number of LRO entries and the number
of rx mbufs that should be presorted before LRO. There is no change in
default behavior.
np [Thu, 25 May 2017 00:16:01 +0000 (00:16 +0000)]
MFC r316971:
cxgbe: Add a tunable to configure the SGE time scaler, which is
available starting with T6. The values in the timer holdoff registers
are multiplied by the scaling factor before use.
dev.<nexus>.<n>.holdoff_timers shows the final values of the
timers in microseconds.
asomers [Wed, 24 May 2017 20:52:47 +0000 (20:52 +0000)]
MFC r317755, r317758
r317755:
Various Coverity fixes in ifconfig(8)
* Exit early if kldload(2) fails (1011259). This is the only change that
affects ifconfig's behavior.
* Close memory and resource leaks (1305624, 1305205, 1007100)
* Mark usage() as _Noreturn (1305806, 1305750)
* Fix some dereference after null checks (1011474, 270774)
np [Wed, 24 May 2017 20:28:48 +0000 (20:28 +0000)]
MFC r313318:
cxgbe(4): Allow tunables that control the number of queues to be set to
'-n' to tell the driver to create _up to_ 'n' queues if enough cores are
available. For example, setting hw.cxgbe.nrxq10g="-32" will result in
16 queues if the system has 16 cores, 32 if it has 32.
There is no change in the default number of queues of any type.
np [Wed, 24 May 2017 19:57:22 +0000 (19:57 +0000)]
MFC r313346:
cxgbe/t4_tom: Fix CLIP entry refcounting on the passive side. Every
IPv6 connection being handled by the TOE should have a reference on its
CLIP entry.
r311880:
The iw_cxgb and iw_cxgbe drivers should not use a FreeBSD device_t where
a linuxkpi style device is expected. If OFED/linuxkpi actually starts
using this field then we'll have to figure out whether to create fake
devices for these drivers or have linuxkpi deal with NULL device.
This mismatch was first reported as part of D6585.
r314167:
cxgbe/iw_cxgbe: Minor changes for T6.
r316118:
cxgbe/iw_cxgbe: T6 has no limit on the amount of memory that can be
registered in one ib_reg_phys_mr.
r316571:
cxgbe/iw_cxgbe: Remove bad cast that resulted in incorrect length for
memory regions larger than 4GB.
r316573:
cxgbe/iw_cxgbe: Replace a magic constant with something more readable
(and accurate).
T4 and later have an extra bit for page shift so the maximum page size
is 8TB (shift of 12 + 31) instead of 128MB (12 + 15). This saves space
in the chip's PBL (physical buffer list) when registering very large
memory regions.
r316580:
cxgbe/iw_cxgbe: Remove another bad cast. This should have been
included in r316571.
ae [Wed, 24 May 2017 09:03:46 +0000 (09:03 +0000)]
MFC r318399:
Set M_BCAST and M_MCAST flags on mbuf sent via divert socket.
r290383 has changed how mbufs sent by divert socket are handled.
Previously they are always handled by slow path processing in ip_input().
Now ip_tryforward() is invoked from ip_input() before in_broadcast() check.
Since diverted packet lost all mbuf flags, it passes the broadcast check
in ip_tryforward() due to missing M_BCAST flag. In the result the broadcast
packet is forwarded to the wire instead of be consumed by network stack.
Add in_broadcast() check to the div_output() function. And restore the
M_BCAST flag if destination address is broadcast for the given network
interface.
mav [Tue, 23 May 2017 17:00:56 +0000 (17:00 +0000)]
MFC r309321:
Add `gmirror create` subcommand, alike to gstripe, gconcat, etc.
It is quite specific mode of operation without storing on-disk metadata.
It can be useful in some cases in combination with some external control
tools handling mirror creation and disks hot-plug.
badger [Tue, 23 May 2017 12:40:50 +0000 (12:40 +0000)]
move p_sigqueue to the end of struct proc
In order to preserve KBI in stable branches, replace the existing
p_sigqueue slot with padding and move the expanded (as of r315949)
p_sigqueue to the end of the struct.
This is a repeat of r317529 (which concerned td_sigqueue in struct
thread) for p_sigqueue in struct proc.
Virtualbox modules (and possibly others) are affected without this fix.
mmel [Tue, 23 May 2017 12:03:59 +0000 (12:03 +0000)]
MFC r318021,r318251:
r318021:
Introduce pmap_remap_vm_attr(), it allows to remap one VM memattr class to
another.
r318251:
Clarify usage rules for pmap_remap_vm_attr(). Not a functional change.
trasz [Tue, 23 May 2017 08:09:44 +0000 (08:09 +0000)]
MFC r318138:
Revert to pre-r318116 wording to not give the false impression
that setting the kernels' idea of terminal size is somehow an
alternative to environment variables.
trasz [Tue, 23 May 2017 08:07:39 +0000 (08:07 +0000)]
MFC rr317934:
Add resizewin -z. It makes resizewin not do anything if the terminal
size is already set to something other than zero. It's supposed to be
called from eg /etc/profile - it's not neccessary to query terminal
size when logging in over the network, because the protocol used already
takes care of this, but it's neccessary when logging over a serial line.
trasz [Tue, 23 May 2017 08:04:36 +0000 (08:04 +0000)]
MFC r317909:
Make resizewin(1) discard the terminal queues, to lower the chance
for "unable to parse response" error which happens when youre typing
too fast for the machine you're running it on.
rmacklem [Mon, 22 May 2017 21:41:34 +0000 (21:41 +0000)]
MFC: r317931
Fix mount_nfs so that it doesn't create mounttab entries for NFSv4 mounts.
The NFSv4 protocol doesn't use the Mount protocol, so it doesn't make sense
to add an entry for an NFSv4 mount to /var/db/mounttab. Also, r308871
modified umount so that it doesn't remove any entry created by mount_nfs.
rmacklem [Mon, 22 May 2017 19:34:37 +0000 (19:34 +0000)]
MFC: r317906
Fix the client side krpc from doing TCP reconnects for ERESTART from sosend().
When sosend() replies ERESTART in the client side krpc, it indicates that
the RPC message hasn't yet been sent and that the send queue is full or
locked while a signal is posted for the process.
Without this patch, this would result in a RPC_CANTSEND reply from
clnt_vc_call(), which would cause clnt_reconnect_call() to create a new
TCP transport connection. For most NFS servers, this wasn't a serious problem,
although it did imply retries of outstanding RPCs, which could possibly
have missed the DRC.
For an NFSv4.1 mount to AmazonEFS, this caused a serious problem, since
AmazonEFS often didn't retain the NFSv4.1 session and would reply with
NFS4ERR_BAD_SESSION. This implies to the client a crash/reboot which
requires open/lock state recovery.
Three options were considered to fix this:
- Return the ERESTART all the way up to the system call boundary and then
have the system call redone. This is fraught with risk, due to convoluted
code paths, asynchronous I/O RPCs etc. cperciva@ worked on this, but it
is still a work in prgress and may not be feasible.
- Set SB_NOINTR for the socket buffer. This fixes the problem, but makes
the sosend() completely non interruptible, which kib@ considered
inappropriate. It also would break forced dismount when a thread
was blocked in sosend().
- Modify the retry loop in clnt_vc_call(), so that it loops for this case
for up to 15sec. Testing showed that the sosend() usually succeeded by
the 2nd retry. The extreme case observed was 111 loop iterations, or
about 100msec of delay.
This third alternative is what is implemented in this patch, since the
change is:
- localized
- straightforward
- forced dismount is not broken by it.
This patch has been tested by cperciva@ extensively against AmazonEFS.
davidcs [Mon, 22 May 2017 19:22:06 +0000 (19:22 +0000)]
MFC r318382
1. Move Rx Processing to fp_taskqueue(). With this CPU utilization for
processing interrupts drops to around 1% for 100G and under 1% for
other speeds.
2. Use sysctls for TRACE_LRO_CNT and TRACE_TSO_PKT_LEN
3. remove unused mtx tx_lock
4. bind taskqueue kernel thread to the appropriate cpu core
5. when tx_ring is full, stop further transmits till at least 1/16th of
the Tx Ring is empty. In our case 1K entries. Also if there are
rx_pkts to process, put the taskqueue thread to sleep for 100ms,
before enabling interrupts.
6. Use rx_pkt_threshold of 128.
gjb [Mon, 22 May 2017 16:07:17 +0000 (16:07 +0000)]
MFC r307469 (imp):
Allow root_rw_mount to be both lower and upper case. Before, if it was
upper case, you'd wind up with a read-only filesystem when you should
sometimes.
MSDOS and Windows GNU grep uses -u to mean "print byte offsets as if
running on an UNIX system." The option has no effect on systems that
do not use CRLF line endings.
asomers [Mon, 22 May 2017 15:12:49 +0000 (15:12 +0000)]
MFC r318189:
vdev_geom may associate multiple vdevs per g_consumer
vdev_geom.c currently uses the g_consumer's private field to point to a
vdev_t. That way, a GEOM event can cause a change to a ZFS vdev. For
example, when you remove a disk, the vdev's status will change to REMOVED.
However, vdev_geom will sometimes attach multiple vdevs to the same GEOM
consumer. If this happens, then geom events will only be propagated to one
of the vdevs.
Fix this by storing a linked list of vdevs in g_consumer's private field.
Fix the output of very large rebind, renew and lease time options in
lease file.
Some routers set very large values for rebind time (Netgear) and these
are erroneously reported as negative in the leasefile. This was due to a
wrong printf format specification of %ld for an unsigned long on 32-bit
platforms.
They would overflow a signed 32-bit time_t on 32 bit architectures. This
was taken care of, but a compiler optimisation makes this behave
erratically. This could be resolved by adding a -fwrapv flag, but
instead we can check the value before adding the current timestamp to
it.
hselasky [Mon, 22 May 2017 08:17:07 +0000 (08:17 +0000)]
MFC r318531:
mlx4: Use the CQ quota for SRIOV when creating completion EQs
When creating EQs to handle CQ completion events for the PF or for
VFs, we create enough EQE entries to handle completions for the max
number of CQs that can use that EQ.
When SRIOV is activated, the max number of CQs a VF (or the PF) can
obtain is its CQ quota (determined by the Hypervisor resource
tracker). Therefore, when creating an EQ, the number of EQE entries
that the VF should request for that EQ is the CQ quota value (and not
the total number of CQs available in the firmware).
Under SRIOV, the PF, also must use its CQ quota, because the resource
tracker also controls how many CQs the PF can obtain.
Using the firmware total CQs instead of the CQ quota when creating EQs
resulted wasting MTT entries, due to allocating more EQEs than were
needed.
ngie [Mon, 22 May 2017 06:24:43 +0000 (06:24 +0000)]
MFC r317594:
usb(4): manpage cleanup
1. Wrap at <80 columns for readability when editing. Rewrap some lines
prematurely wrapped to better fit in <80 columns and not waste
vertical space.
2. Fix SEE ALSO sorting (sort by section first, then manpage name).
3. Tweak the compound device description slightly by adding soft stops
via commas.