sbruno [Wed, 6 Jun 2018 15:45:57 +0000 (15:45 +0000)]
Load balance sockets with new SO_REUSEPORT_LB option.
This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.
Most of the code was copied from a similar patch for DragonflyBSD.
However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.
Required changes to structures:
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.
Limitations:
As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or
threads sharing the same socket).
This is a substantially different contribution as compared to its original
incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@
for the substantive feedback that is included in this commit.
hselasky [Wed, 6 Jun 2018 15:06:21 +0000 (15:06 +0000)]
Rename two structure field members while keeping backwards compatibility in
the LinuxKPI. Add a comment saying in which Linux version this change was made.
Make in_delayed_cksum() be similar to IPv6 implementation.
Use m_copyback() function to write checksum when it isn't located
in the first mbuf of the chain. Handmade analog doesn't handle the
case when parts of checksum are located in different mbufs.
Also in case when mbuf is too short, m_copyback() will allocate new
mbuf in the chain instead of making out of bounds write.
Also wrap long line and remove now useless KASSERTs.
thj [Wed, 6 Jun 2018 07:04:40 +0000 (07:04 +0000)]
Use UDP len when calculating UDP checksums
The length of the IP payload is normally equal to the UDP length, UDP Options
(draft-ietf-tsvwg-udp-options-02) suggests using the difference between IP
length and UDP length to create space for trailing data.
Correct checksum length calculation to use the UDP length rather than the IP
length when not offloading UDP checksums.
ian [Tue, 5 Jun 2018 22:13:45 +0000 (22:13 +0000)]
Remove comments and assertions that are no longer valid after r330809.
r330809 replaced duplication of devdesc struct fields with an embedded copy
of the devdesc struct, to avoid fragility. That means all the scattered
comments indicating that structs must match are no longer valid. Likewise
asserts that attempted to mitigate some of the old fragility.
Rework if_gif(4) to use new encap_lookup_t method to speedup lookup
of needed interface when many gif interfaces are present.
Remove rmlock from gif_softc, use epoch(9) and CK_LIST instead.
Move more AF-related code into AF-related locations.
Use hash table to speedup lookup of needed softc. Interfaces
with GIF_IGNORE_SOURCE flag are stored in plain CK_LIST.
Sysctl net.link.gif.parallel_tunnels is removed. The removal was planed
16 years ago, and actually it could work only for outbound direction.
Each protocol, that can be handled by if_gif(4) interface is registered
by separate encap handler, this helps avoid invoking the handler
for unrelated protocols (GRE, PIM, etc.).
This change allows dramatically improve performance when many gif(4)
interfaces are used.
Currently it has several disadvantages:
- it uses single mutex to protect internal structures. It is used by
data- and control- path, thus there are no parallelism at all.
- it uses single list to keep encap handlers for both INET and INET6
families.
- struct encaptab keeps unneeded information (src, dst, masks, protosw),
that isn't used by code in the source tree.
- matches are prioritized and when many tunneling interfaces are
registered, encapcheck handler of each interface is invoked for each
packet. The search takes O(n) for n interfaces. All this work is done
with exclusive lock held.
What this patch includes:
- the datapath is converted to be lockless using epoch(9) KPI.
- struct encaptab now linked using CK_LIST.
- all unused fields removed from struct encaptab. Several new fields
addedr: min_length is the minimum packet length, that encapsulation
handler expects to see; exact_match is maximum number of bits, that
can return an encapsulation handler, when it wants to consume a packet.
- IPv6 and IPv4 handlers are stored in separate lists;
- added new "encap_lookup_t" method, that will be used later. It is
targeted to speedup lookup of needed interface, when gif(4)/gre(4) have
many interfaces.
- the need to use protosw structure is eliminated. The only pr_input
method was used from this structure, so I don't see the need to keep
using it.
- encap_input_t method changed to avoid using mbuf tags to store softc
pointer. Now it is passed directly trough encap_input_t method.
encap_getarg() funtions is removed.
- all sockaddr structures and code that uses them removed. We don't have
any code in the tree that uses them. All consumers use encap_attach_func()
method, that relies on invoking of encapcheck() to determine the needed
handler.
- introduced struct encap_config, it contains parameters of encap handler
that is going to be registered by encap_attach() function.
- encap handlers are stored in lists ordered by exact_match value, thus
handlers that need more bits to match will be checked first, and if
encapcheck method returns exact_match value, the search will be stopped.
- all current consumers changed to use new KPI.
ian [Tue, 5 Jun 2018 17:18:10 +0000 (17:18 +0000)]
Make the v*printf() functions in libsa return int instead of void.
This makes them compatible with the C standard signatures, avoiding
spurious mismatch errors in the places where the oddball requirements
of standalone code end up putting two declarations of the same function
in play.
hselasky [Tue, 5 Jun 2018 15:37:28 +0000 (15:37 +0000)]
Add "access" function pointer to the "vm_operations_struct" structure
in the LinuxKPI. While at it document when to use the "virtual_address" or
the "address" field in the "vm_fault" structure.
ram [Tue, 5 Jun 2018 15:05:26 +0000 (15:05 +0000)]
Issue:
Utility hangs when OCS_IOCTL_CMD_MGMT_GET_ALL called in parallel on port 0 and port 1.
Fix: Using static structure for results is corrupting the second ioctl request. Removed static for results structure.
Approved by: ken
MFC after: 3 days
kevlo [Tue, 5 Jun 2018 05:24:42 +0000 (05:24 +0000)]
Since we don't enable BUF_TRACKING and FULL_BUF_TRACKING buffer debugging
options in GENERIC kernels on arm and arm64, there's no need to disable
them.
mmacy [Tue, 5 Jun 2018 04:26:40 +0000 (04:26 +0000)]
hwpmc: log name->pid, name->tid mappings
By logging all threads and processes 'pmc filter'
can now filter on process or thread name, relieving
the user of the burden of determining which tid or
pid was which when the sample was taken.
pstef [Mon, 4 Jun 2018 21:05:56 +0000 (21:05 +0000)]
indent(1): add --version option
There exist multi-platform programs that check indent's version in order to
know what they can expect from it. GNU indent provides that via --version,
so implement the same option here.
markj [Mon, 4 Jun 2018 19:35:15 +0000 (19:35 +0000)]
Reimplement brk() and sbrk() to avoid the use of _end.
Previously, libc.so would initialize its notion of the break address
using _end, a special symbol emitted by the static linker following
the bss section. Compatibility issues between lld and ld.bfd could
cause the wrong definition of _end (libc.so's definition rather than
that of the executable) to be used, breaking the brk()/sbrk()
interface.
Avoid this problem and future interoperability issues by simply not
relying on _end. Instead, modify the break() system call to return
the kernel's view of the current break address, and have libc
initialize its state using an extra syscall upon the first use of the
interface. As a side effect, this appears to fix brk()/sbrk() usage
in executables run with rtld direct exec, since the kernel and libc.so
no longer maintain separate views of the process' break address.
cem [Mon, 4 Jun 2018 18:51:06 +0000 (18:51 +0000)]
Correctly handle the padding for IPv6-AH, as specified by RFC4302
The RFC specifies that under IPv6 the complete AH header must be 64 bit
aligned, and under IPv4, 32 bit aligned. Prior to this change, we (along
with other BSDs and MacOS) had violated this requirement.
This makes it possible to set up IPv6-AH between Linux and BSD, and also
probably between Windows and BSD.
PR: 222684
Reported and tested by: Jason Mader <jasonmader AT gmail.com>
Obtained from: NetBSD xform_ah.c 1.105
(b939fe2483972eb43d71bf990cfb7f26dece7839 NetBSD/src on GH)
by Maxime Villard
MFC after: 35.2731 hours
Relnotes: probably (breaks ipv6 compat with older FreeBSD/NetBSD/MacOS)
Sponsored by: Dell EMC Isilon
cem [Mon, 4 Jun 2018 18:47:14 +0000 (18:47 +0000)]
str(r)chr: Replace union abuse with __DECONST
Writing one union member and reading another is technically illegal C,
although we do it in many places in the tree. Use the __DECONST macro
instead, which is (technically) a valid C construct.
Trivial style(9) cleanups to touched lines while here.
alc [Mon, 4 Jun 2018 16:28:06 +0000 (16:28 +0000)]
Use a single, consistent approach to returning success versus failure in
vm_map_madvise(). Previously, vm_map_madvise() used a traditional Unix-
style "return (0);" to indicate success in the common case, but Mach-
style return values in the edge cases. Since KERN_SUCCESS equals zero,
the only problem with this inconsistency was stylistic. vm_map_madvise()
has exactly two callers in the entire source tree, and only one of them
cares about the return value. That caller, kern_madvise(), can be
simplified if vm_map_madvise() consistently uses Unix-style return
values.
Since vm_map_madvise() uses the variable modify_map as a Boolean, make it
one.
Eliminate a redundant error check from kern_madvise(). Add a comment
explaining where the check is performed.
Explicitly note that exec_release_args_kva() doesn't care about
vm_map_madvise()'s return value. Since MADV_FREE is passed as the
behavior, the return value will always be zero.
Fix build: ignore a GCC 7.2.0 warning which says that third argument of
memset(3) should contain the number of elements multiplied by the element
size.
jhibbits [Mon, 4 Jun 2018 14:42:13 +0000 (14:42 +0000)]
Set kernelname in bootconfig to the kernel file
Summary:
The kernel reads 'kernelname' to set the kern.bootfile sysctl. By setting this,
'make installkernel' will backup the running kernel as appropriate.
mmacy [Mon, 4 Jun 2018 02:05:48 +0000 (02:05 +0000)]
hwpmc: ABI fixes
- increase pmc cpuid field from 8 to 12 bits
- add cpuid version string to initialize entry in the log
so that filter can identify which counter index an
event name maps to
- GC unused config flags
- make fixed counter assignment more robust as well as the
changes needed to be properly identified for filter
mmacy [Mon, 4 Jun 2018 01:10:23 +0000 (01:10 +0000)]
hwpmc: support sampling both kernel and user stacks when interrupted in kernel
This adds the -U options to pmcstat which will attribute in-kernel samples
back to the user stack that invoked the system call. It is not the default,
because when looking at kernel profiles it is generally more desirable to
merge all instances of a given system call together.
Although heavily revised, this change is directly derived from D7350 by
Jonathan T. Looney.
eadler [Sun, 3 Jun 2018 23:40:54 +0000 (23:40 +0000)]
top(1): another pass of cleanup
- avoid the need to call a function to get size of known array. I'll
likely re-arrange some of the indirect in a later to avoid the magic
constants.
- use correct type
- add const
- replace caddr_t with void*. This corrects an alignment warning.
- remove duplicated include from immediately prior commit
Under base clang we're now down to:
- 3 warning in top.c, 1 warning in mahcine.c, 4 warning in display.c,
- 1 warning in utils.c
Tested with base clang, gcc7, gcc9, base gcc (mips)
eadler [Sun, 3 Jun 2018 22:42:54 +0000 (22:42 +0000)]
top(1): top warnings and cleanup
- Add const where helpful
- add missing 'static' for file-local functions
- use nitems where possible
- convert manual abort() to assert
- use strndup instead of homegrown version
pstef [Sun, 3 Jun 2018 21:40:38 +0000 (21:40 +0000)]
indent(1): new option -lpl
With -lpl, code surrounded by parentheses in continuation lines is lined up
even if it would extend past the right margin.
With -nlpl (the default), such a line that would extend past the right
margin is moved left to keep it within the margin, if that does not require
placing it to the left of the prevailing indentation level.
These switches have no effect if -nlp is selected.
pstef [Sun, 3 Jun 2018 20:59:59 +0000 (20:59 +0000)]
indent(1): new option -lpl (always line up to parenthesis)
With -lp, if a line has an opening paren which is not closed on that line,
then continuation lines will be lined up to start at the character position
just after the opening paren.
rmacklem [Sun, 3 Jun 2018 19:46:44 +0000 (19:46 +0000)]
Fix a gcc8 warning about a write only variable.
gcc8 warns that "verf" was set but not used. This was because the code
that uses it is disabled via a "#if 0".
This patch adds a "#if 0" to the variable's declaration and assignment
to get rid of the warning.
This way the code could be re-enabled without difficulty.
pstef [Sun, 3 Jun 2018 18:19:41 +0000 (18:19 +0000)]
indent(1): improve CHECK_SIZE_ macros
Rewrite the macros so that they take a parameter. Consumers use it to signal
how much room in the buffer they need; this lets them do that once when
required space is known instead of doing the check once every loop step.
Also take the parameter value into consideration when resizing the buffer;
the requested space may be larger than the constant 400 bytes that the
previous version used - now it's the sum of those two values.
On the consumer side, don't copy strings byte by byte - use memcpy().
Deduplicate code that copied base 2, base 8 and base 16 literals.
Don't advance the e_token pointer once the token has been copied into
s_token. This allows easy calculation of the token's length.
pstef [Sun, 3 Jun 2018 17:55:50 +0000 (17:55 +0000)]
indent(1): remove troff output support
The troff output in indent was invented at Sun and the online documentation
for some post-SunOS operating system includes this:
The usual way to get a troffed listing is with the command
indent -troff program.c | troff -mindent
The indent manual page in FreeBSD 1.0 already lacks that information and
troff -mindent complains about not being able to find the macro file.
It seems that the file did exist on SunOS and was supposed to be imported
into 4.3BSD together with the feature, but that has never happened.
Removal of troff output support simplifies a lot of indent's code.