1 .\" Copyright (c) 1983, 1991, 1993
2 .\" The Regents of the University of California.
3 .\" Copyright (c) 2010-2011 The FreeBSD Foundation
4 .\" All rights reserved.
6 .\" Portions of this documentation were written at the Centre for Advanced
7 .\" Internet Architectures, Swinburne University of Technology, Melbourne,
8 .\" Australia by David Hayes under sponsorship from the FreeBSD Foundation.
10 .\" Redistribution and use in source and binary forms, with or without
11 .\" modification, are permitted provided that the following conditions
13 .\" 1. Redistributions of source code must retain the above copyright
14 .\" notice, this list of conditions and the following disclaimer.
15 .\" 2. Redistributions in binary form must reproduce the above copyright
16 .\" notice, this list of conditions and the following disclaimer in the
17 .\" documentation and/or other materials provided with the distribution.
18 .\" 3. Neither the name of the University nor the names of its contributors
19 .\" may be used to endorse or promote products derived from this software
20 .\" without specific prior written permission.
22 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
23 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
24 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
25 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
26 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
27 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
28 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
29 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
30 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
31 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
34 .\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93
42 .Nd Internet Transmission Control Protocol
49 .Fn socket AF_INET SOCK_STREAM 0
53 protocol provides reliable, flow-controlled, two-way
55 It is a byte-stream protocol used to
61 Internet address format and, in addition, provides a per-host
63 .Dq "port addresses" .
64 Thus, each address is composed
65 of an Internet address specifying the host and network,
68 port on the host identifying the peer entity.
76 Active sockets initiate connections to passive
80 sockets are created active; to create a
83 system call must be used
84 after binding the socket with the
87 Only passive sockets may use the
89 call to accept incoming connections.
90 Only active sockets may use the
92 call to initiate connections.
96 their location to match
97 incoming connection requests from multiple networks.
98 This technique, termed
99 .Dq "wildcard addressing" ,
101 server to provide service to clients on multiple networks.
102 To create a socket which listens on all networks, the Internet
108 port may still be specified
109 at this time; if the port is not specified, the system will assign one.
110 Once a connection has been established, the socket's address is
111 fixed by the peer entity's location.
112 The address assigned to the
113 socket is the address associated with the network interface
114 through which packets are being transmitted and received.
115 Normally, this address corresponds to the peer entity's network.
118 supports a number of socket options which can be set with
122 .Bl -tag -width ".Dv TCP_FUNCTION_BLK"
124 Information about a socket's underlying TCP session may be retrieved
125 by passing the read-only option
129 It accepts a single argument: a pointer to an instance of
130 .Vt "struct tcp_info" .
132 This API is subject to change; consult the source to determine
133 which fields are currently filled out by this option.
135 specific additions include
139 bandwidth-controlled window space.
141 Set or query congestion control algorithm specific parameters.
145 .It Dv TCP_CONGESTION
146 Select or query the congestion control algorithm that TCP will use for the
152 Enable or disable TCP Fast Open (TFO).
153 To use this option, the kernel must be built with the
157 This option can be set on the socket either before or after the
160 Clearing this option on a listen socket after it has been set has no effect on
161 existing TFO connections or TFO connections in progress; it only prevents new
162 TFO connections from being established.
164 For passively-created sockets, the
166 socket option can be queried to determine whether the connection was established
168 Note that connections that are established via a TFO
170 but that fall back to using a non-TFO
176 In addition to the facilities defined in RFC7413, this implementation supports a
177 pre-shared key (PSK) mode of operation in which the TFO server requires the
178 client to be in posession of a shared secret in order for the client to be able
179 to successfully open TFO connections with the server.
180 This is useful, for example, in environments where TFO servers are exposed to
181 both internal and external clients and only wish to allow TFO connections from
184 In the PSK mode of operation, the server generates and sends TFO cookies to
185 requesting clients as usual.
186 However, when validating cookies received in TFO SYNs from clients, the server
187 requires the client-supplied cookie to equal
188 .Bd -literal -offset left
189 SipHash24(key=\fI16-byte-psk\fP, msg=\fIcookie-sent-to-client\fP)
192 Multiple concurrent valid pre-shared keys are supported so that time-based
193 rolling PSK invalidation policies can be implemented in the system.
194 The default number of concurrent pre-shared keys is 2.
196 This can be adjusted with the
197 .Dv TCP_RFC7413_MAX_PSKS
199 .It Dv TCP_FUNCTION_BLK
200 Select or query the set of functions that TCP will use for this connection.
201 This allows a user to select an alternate TCP stack.
202 The alternate TCP stack must already be loaded in the kernel.
203 To list the available TCP stacks, see
204 .Va functions_available
207 section further down.
208 To list the default TCP stack, see
209 .Va functions_default
216 option accepts a per-socket timeout argument of
218 in seconds, for new, non-established
221 For the global default in milliseconds see
225 section further down.
229 option accepts an argument of
231 for the amount of time, in seconds, that the connection must be idle
232 before keepalive probes (if enabled) are sent for the connection of this
234 If set on a listening socket, the value is inherited by the newly created
237 For the global default in milliseconds see
241 section further down.
245 option accepts an argument of
247 to set the per-socket interval, in seconds, between keepalive probes sent
249 If set on a listening socket, the value is inherited by the newly created
252 For the global default in milliseconds see
256 section further down.
260 option accepts an argument of
262 and allows a per-socket tuning of the number of probes sent, with no response,
263 before the connection will be dropped.
264 If set on a listening socket, the value is inherited by the newly created
267 For the global default see the
271 section further down.
273 Under most circumstances,
275 sends data when it is presented;
276 when outstanding data has not yet been acknowledged, it gathers
277 small amounts of output to be sent in a single packet once
278 an acknowledgement is received.
279 For a small number of clients, such as window systems
280 that send a stream of mouse events which receive no replies,
281 this packetization may cause significant delays.
284 defeats this algorithm.
286 By default, a sender- and
287 .No receiver- Ns Tn TCP
288 will negotiate among themselves to determine the maximum segment size
289 to be used for each connection.
292 option allows the user to determine the result of this negotiation,
293 and to reduce it if desired.
296 usually sends a number of options in each packet, corresponding to
299 extensions which are provided in this implementation.
302 is provided to disable
304 option use on a per-connection basis.
307 .No sender- Ns Tn TCP
310 bit, and begin transmission immediately (if permitted) at the end of
315 When this option is set to a non-zero value,
317 will delay sending any data at all until either the socket is closed,
318 or the internal send buffer is filled.
320 This option enables the use of MD5 digests (also known as TCP-MD5)
321 on writes to the specified socket.
322 Outgoing traffic is digested;
323 digests on incoming traffic are verified.
324 When this option is enabled on a socket, all inbound and outgoing
325 TCP segments must be signed with MD5 digests.
327 One common use for this in a
329 router deployment is to enable
330 based routers to interwork with Cisco equipment at peering points.
331 Support for this feature conforms to RFC 2385.
333 In order for this option to function correctly, it is necessary for the
334 administrator to add a tcp-md5 key entry to the system's security
335 associations database (SADB) using the
338 This entry can only be specified on a per-host basis at this time.
340 If an SADB entry cannot be found for the destination,
341 the system does not send any outgoing segments and drops any inbound segments.
343 Each dropped segment is taken into account in the TCP protocol statistics.
346 The option level for the
348 call is the protocol number for
351 .Xr getprotobyname 3 ,
354 All options are declared in
359 transport level may be used with
363 Incoming connection requests that are source-routed are noted,
364 and the reverse source route is used in responding.
366 The default congestion control algorithm for
370 Other congestion control algorithms can be made available using the
376 protocol implements a number of variables in the
381 .Bl -tag -width ".Va TCPCTL_DO_RFC1323"
382 .It Dv TCPCTL_DO_RFC1323
384 Implement the window scaling and timestamp options of RFC 1323/RFC 7323
386 .It Va tolerate_missing_ts
387 Tolerate the missing of timestamps (RFC 1323/RFC 7323) for
389 segments belonging to
391 connections for which support of
393 timestamps has been negotiated.
394 As of June 2021, several TCP stacks are known to violate RFC 7323, including
395 modern widely deployed ones.
396 Therefore the default is 1, i.e., the missing of timestamps is tolerated.
397 .It Dv TCPCTL_MSSDFLT
399 The default value used for the maximum segment size
401 when no advice to the contrary is received from MSS negotiation.
402 .It Dv TCPCTL_SENDSPACE
407 .It Dv TCPCTL_RECVSPACE
413 Log any connection attempts to ports where there is not a socket
414 accepting connections.
415 The value of 1 limits the logging to
417 (connection establishment) packets only.
418 That of 2 results in any
420 packets to closed ports being logged.
421 Any value unlisted above disables the logging
422 (default is 0, i.e., the logging is disabled).
424 The Maximum Segment Lifetime, in milliseconds, for a packet.
426 Timeout, in milliseconds, for new, non-established
429 The default is 75000 msec.
431 Amount of time, in milliseconds, that the connection must be idle
432 before keepalive probes (if enabled) are sent.
433 The default is 7200000 msec (2 hours).
435 The interval, in milliseconds, between keepalive probes sent to remote
436 machines, when no response is received on a
439 The default is 75000 msec.
441 Number of probes sent, with no response, before a connection
443 The default is 8 packets.
444 .It Va always_keepalive
449 connections, the kernel will
450 periodically send a packet to the remote host to verify the connection
455 unreachable messages may abort connections in
461 reassembly queue if the system is low on mbufs.
463 If enabled, disable sending of RST when a connection is attempted
464 to a port where there is not a socket accepting connections.
468 Delay ACK to try and piggyback it onto a data packet.
470 Maximum amount of time, in milliseconds, before a delayed ACK is sent.
471 .It Va path_mtu_discovery
472 Enable Path MTU Discovery.
476 control-block hash table
478 This may be tuned using the kernel option
481 .Va net.inet.tcp.tcbhashsize
485 Number of active process control blocks
488 Determines whether or not
490 cookies should be generated for outbound
494 cookies are a great help during
496 flood attacks, and are enabled by default.
499 .It Va isn_reseed_interval
500 The interval (in seconds) specifying how often the secret data used in
501 RFC 1948 initial sequence number calculations should be reseeded.
502 By default, this variable is set to zero, indicating that
503 no reseeding will occur.
504 Reseeding should not be necessary, and will break
506 recycling for a few minutes.
507 .It Va reass.cursegments
508 The current total number of segments present in all reassembly queues.
509 .It Va reass.maxsegments
510 The maximum limit on the total number of segments across all reassembly
512 The limit can be adjusted as a tunable.
513 .It Va reass.maxqueuelen
514 The maximum number of segments allowed in each reassembly queue.
515 By default, the system chooses a limit based on each TCP connection's
516 receive buffer size and maximum segment size (MSS).
517 The actual limit applied to a session's reassembly queue will be the lower of
518 the system-calculated automatic limit and the user-specified
519 .Va reass.maxqueuelen
521 .It Va rexmit_initial , rexmit_min , rexmit_slop
522 Adjust the retransmit timer calculation for
525 typically added to the raw calculation to take into account
526 occasional variances that the
528 (smoothed round-trip time)
529 is unable to accommodate, while the minimum specifies an
534 second minimum, these RFCs tend to focus on streaming behavior,
535 and fail to deal with the fact that a 1 second minimum has severe
536 detrimental effects over lossy interactive connections, such
537 as a 802.11b wireless link, and over very fast but lossy
538 connections for those cases not covered by the fast retransmit
540 For this reason, we use 200ms of slop and a near-0
541 minimum, which gives us an effective minimum of 200ms (similar to
543 The initial value is used before an RTT measurement has been performed.
544 .It Va initcwnd_segments
545 Enable the ability to specify initial congestion window in number of segments.
546 The default value is 10 as suggested by RFC 6928.
547 Changing the value on fly would not affect connections using congestion window
550 This regulates the burst of packets allowed to be sent in the first RTT.
551 The value should be relative to the link capacity.
552 Start with small values for lower-capacity links.
553 Large bursts can cause buffer overruns and packet drops if routers have small
554 buffers or the link is experiencing congestion.
556 Calculate the bytes in flight using the algorithm described in RFC 6675, and
557 is also a prerequisite to enable Proportional Rate Reduction.
559 Enable the Limited Transmit algorithm as described in RFC 3042.
560 It helps avoid timeouts on lossy links and also when the congestion window
561 is small, as happens on short transfers.
563 Enable support for RFC 3390, which allows for a variable-sized
564 starting congestion window on new connections, depending on the
565 maximum segment size.
566 This helps throughput in general, but
567 particularly affects short transfers and high-bandwidth large
568 propagation-delay connections.
570 Enable support for RFC 2018, TCP Selective Acknowledgment option,
571 which allows the receiver to inform the sender about all successfully
572 arrived segments, allowing the sender to retransmit the missing segments
575 Maximum number of SACK holes per connection.
577 .It Va sack.globalmaxholes
578 Maximum number of SACK holes per system, across all connections.
581 When a TCP connection enters the
583 state, its associated socket structure is freed, since it is of
584 negligible size and use, and a new structure is allocated to contain a
585 minimal amount of information necessary for sustaining a connection in
586 this state, called the compressed TCP TIME_WAIT state.
587 Since this structure is smaller than a socket structure, it can save
588 a significant amount of system memory.
590 .Va net.inet.tcp.maxtcptw
591 MIB variable controls the maximum number of these structures allocated.
592 By default, it is initialized to
593 .Va kern.ipc.maxsockets
595 .It Va nolocaltimewait
596 Suppress creating of compressed TCP TIME_WAIT states for connections in
597 which both endpoints are local.
598 .It Va fast_finwait2_recycle
602 connections faster when the socket is marked as
604 (no user process has the socket open, data received on
605 the socket cannot be read).
606 The timeout used here is
607 .Va finwait2_timeout .
608 .It Va finwait2_timeout
609 Timeout to use for fast recycling of
613 Defaults to 60 seconds.
615 Enable support for TCP Explicit Congestion Notification (ECN).
616 ECN allows a TCP sender to reduce the transmission rate in order to
623 Allow incoming connections to request ECN.
624 Outgoing connections will request ECN.
626 Allow incoming connections to request ECN.
627 Outgoing connections will not request ECN.
629 .It Va ecn.maxretries
630 Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a
632 This is needed to help with connection establishment
633 when a broken firewall is in the network path.
634 .It Va pmtud_blackhole_detection
635 Enable automatic path MTU blackhole detection.
636 In case of retransmits of MSS sized segments,
637 the OS will lower the MSS to check if it's an MTU problem.
638 If the current MSS is greater than the configured value to try
639 .Po Va net.inet.tcp.pmtud_blackhole_mss
641 .Va net.inet.tcp.v6pmtud_blackhole_mss
643 it will be set to this value, otherwise,
644 the MSS will be set to the default values
645 .Po Va net.inet.tcp.mssdflt
647 .Va net.inet.tcp.v6mssdflt
652 Disable path MTU blackhole detection.
654 Enable path MTU blackhole detection for IPv4 and IPv6.
656 Enable path MTU blackhole detection only for IPv4.
658 Enable path MTU blackhole detection only for IPv6.
660 .It Va pmtud_blackhole_mss
661 MSS to try for IPv4 if PMTU blackhole detection is turned on.
662 .It Va v6pmtud_blackhole_mss
663 MSS to try for IPv6 if PMTU blackhole detection is turned on.
664 .It Va fastopen.acceptany
665 When non-zero, all client-supplied TFO cookies will be considered to be valid.
667 .It Va fastopen.autokey
669 .Va net.inet.tcp.fastopen.server_enable
670 are non-zero, a new key will be automatically generated after this specified
673 .It Va fastopen.ccache_bucket_limit
674 The maximum number of entries in a client cookie cache bucket.
675 The default value can be tuned with the
676 .Dv TCP_FASTOPEN_CCACHE_BUCKET_LIMIT_DEFAULT
677 kernel option or by setting
678 .Va net.inet.tcp.fastopen_ccache_bucket_limit
681 .It Va fastopen.ccache_buckets
682 The number of client cookie cache buckets.
684 The value can be tuned with the
685 .Dv TCP_FASTOPEN_CCACHE_BUCKETS_DEFAULT
686 kernel option or by setting
687 .Va fastopen.ccache_buckets
690 .It Va fastopen.ccache_list
691 Print the client cookie cache.
693 .It Va fastopen.client_enable
694 When zero, no new active (i.e., client) TFO connections can be created.
695 On the transition from enabled to disabled, the client cookie cache is cleared
697 The transition from enabled to disabled does not affect any active TFO
698 connections in progress; it only prevents new ones from being established.
700 .It Va fastopen.keylen
701 The key length in bytes.
703 .It Va fastopen.maxkeys
704 The maximum number of keys supported.
706 .It Va fastopen.maxpsks
707 The maximum number of pre-shared keys supported.
709 .It Va fastopen.numkeys
710 The current number of keys installed.
712 .It Va fastopen.numpsks
713 The current number of pre-shared keys installed.
715 .It Va fastopen.path_disable_time
716 When a failure occurs while trying to create a new active (i.e., client) TFO
717 connection, new active connections on the same path, as determined by the tuple
718 .Brq client_ip, server_ip, server_port ,
719 will be forced to be non-TFO for this many seconds.
720 Note that the path disable mechanism relies on state stored in client cookie
721 cache entries, so it is possible for the disable time for a given path to be
722 reduced if the corresponding client cookie cache entry is reused due to resource
723 pressure before the disable period has elapsed.
725 .Dv TCP_FASTOPEN_PATH_DISABLE_TIME_DEFAULT .
726 .It Va fastopen.psk_enable
727 When non-zero, pre-shared key (PSK) mode is enabled for all TFO servers.
728 On the transition from enabled to disabled, all installed pre-shared keys are
731 .It Va fastopen.server_enable
732 When zero, no new passive (i.e., server) TFO connections can be created.
733 On the transition from enabled to disabled, all installed keys and pre-shared
735 On the transition from disabled to enabled, if
737 is non-zero and there are no keys installed, a new key will be generated
739 The transition from enabled to disabled does not affect any passive TFO
740 connections in progress; it only prevents new ones from being established.
742 .It Va fastopen.setkey
743 Install a new key by writing
744 .Va net.inet.tcp.fastopen.keylen
745 bytes to this sysctl.
746 .It Va fastopen.setpsk
747 Install a new pre-shared key by writing
748 .Va net.inet.tcp.fastopen.keylen
749 bytes to this sysctl.
750 .It Va functions_available
751 List of available TCP function blocks (TCP stacks).
752 .It Va functions_default
753 The default TCP function block (TCP stack).
754 .It Va functions_inherit_listen_socket_stack
755 Determines whether to inherit listen socket's tcp stack or use the current
756 system default tcp stack, as defined by
757 .Va functions_default
761 Use criteria defined in RFC793 instead of RFC5961 for accepting RST segments.
764 Use criteria defined in RFC793 instead of RFC5961 for accepting SYN segments.
766 .It Va ts_offset_per_conn
767 When initializing the TCP timestamps, use a per connection offset instead of a
768 per host pair offset.
769 Default is to use per connection offsets as recommended in RFC 7323.
772 A socket operation may fail with one of the following errors returned:
775 when trying to establish a connection on a socket which
777 .It Bo Er ENOBUFS Bc or Bo Er ENOMEM Bc
778 when the system runs out of memory for
779 an internal data structure;
781 when a connection was dropped
782 due to excessive retransmissions;
785 forces the connection to be closed;
786 .It Bq Er ECONNREFUSED
788 peer actively refuses connection establishment (usually because
789 no process is listening to the port);
792 is made to create a socket with a port which has already been
794 .It Bq Er EADDRNOTAVAIL
795 when an attempt is made to create a
796 socket with a network address for which no network interface
798 .It Bq Er EAFNOSUPPORT
799 when an attempt is made to bind or connect a socket to a multicast
802 when trying to change TCP function blocks at an invalid point in the session;
804 when trying to use a TCP function block that is not available;
823 .%T "TCP Extensions for High Performance"
830 .%A "R. Scheffenegger"
831 .%T "TCP Extensions for High Performance"
836 .%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
840 .%A "K. Ramakrishnan"
843 .%T "The Addition of Explicit Congestion Notification (ECN) to IP"
851 The RFC 1323 extensions for window scaling and timestamps were added
856 option was introduced in
859 .Em subject to change .