1 .\" Copyright (c) 1983, 1991, 1993
2 .\" The Regents of the University of California.
3 .\" Copyright (c) 2010-2011 The FreeBSD Foundation
4 .\" All rights reserved.
6 .\" Portions of this documentation were written at the Centre for Advanced
7 .\" Internet Architectures, Swinburne University of Technology, Melbourne,
8 .\" Australia by David Hayes under sponsorship from the FreeBSD Foundation.
10 .\" Redistribution and use in source and binary forms, with or without
11 .\" modification, are permitted provided that the following conditions
13 .\" 1. Redistributions of source code must retain the above copyright
14 .\" notice, this list of conditions and the following disclaimer.
15 .\" 2. Redistributions in binary form must reproduce the above copyright
16 .\" notice, this list of conditions and the following disclaimer in the
17 .\" documentation and/or other materials provided with the distribution.
18 .\" 3. Neither the name of the University nor the names of its contributors
19 .\" may be used to endorse or promote products derived from this software
20 .\" without specific prior written permission.
22 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
23 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
24 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
25 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
26 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
27 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
28 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
29 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
30 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
31 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
34 .\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93
42 .Nd Internet Transmission Control Protocol
49 .Fn socket AF_INET SOCK_STREAM 0
53 protocol provides reliable, flow-controlled, two-way
55 It is a byte-stream protocol used to
61 Internet address format and, in addition, provides a per-host
63 .Dq "port addresses" .
64 Thus, each address is composed
65 of an Internet address specifying the host and network,
68 port on the host identifying the peer entity.
76 Active sockets initiate connections to passive
80 sockets are created active; to create a
83 system call must be used
84 after binding the socket with the
87 Only passive sockets may use the
89 call to accept incoming connections.
90 Only active sockets may use the
92 call to initiate connections.
96 their location to match
97 incoming connection requests from multiple networks.
98 This technique, termed
99 .Dq "wildcard addressing" ,
101 server to provide service to clients on multiple networks.
102 To create a socket which listens on all networks, the Internet
108 port may still be specified
109 at this time; if the port is not specified, the system will assign one.
110 Once a connection has been established, the socket's address is
111 fixed by the peer entity's location.
112 The address assigned to the
113 socket is the address associated with the network interface
114 through which packets are being transmitted and received.
115 Normally, this address corresponds to the peer entity's network.
118 supports a number of socket options which can be set with
122 .Bl -tag -width ".Dv TCP_FUNCTION_BLK"
124 Information about a socket's underlying TCP session may be retrieved
125 by passing the read-only option
129 It accepts a single argument: a pointer to an instance of
130 .Vt "struct tcp_info" .
132 This API is subject to change; consult the source to determine
133 which fields are currently filled out by this option.
135 specific additions include
139 bandwidth-controlled window space.
141 Set or query congestion control algorithm specific parameters.
145 .It Dv TCP_CONGESTION
146 Select or query the congestion control algorithm that TCP will use for the
151 .It Dv TCP_FUNCTION_BLK
152 Select or query the set of functions that TCP will use for this connection.
153 This allows a user to select an alternate TCP stack.
154 The alternate TCP stack must already be loaded in the kernel.
155 To list the available TCP stacks, see
156 .Va functions_available
159 section further down.
160 To list the default TCP stack, see
161 .Va functions_default
168 option accepts a per-socket timeout argument of
170 in seconds, for new, non-established
173 For the global default in milliseconds see
177 section further down.
181 option accepts an argument of
183 for the amount of time, in seconds, that the connection must be idle
184 before keepalive probes (if enabled) are sent for the connection of this
186 If set on a listening socket, the value is inherited by the newly created
189 For the global default in milliseconds see
193 section further down.
197 option accepts an argument of
199 to set the per-socket interval, in seconds, between keepalive probes sent
201 If set on a listening socket, the value is inherited by the newly created
204 For the global default in milliseconds see
208 section further down.
212 option accepts an argument of
214 and allows a per-socket tuning of the number of probes sent, with no response,
215 before the connection will be dropped.
216 If set on a listening socket, the value is inherited by the newly created
219 For the global default see the
223 section further down.
225 Under most circumstances,
227 sends data when it is presented;
228 when outstanding data has not yet been acknowledged, it gathers
229 small amounts of output to be sent in a single packet once
230 an acknowledgement is received.
231 For a small number of clients, such as window systems
232 that send a stream of mouse events which receive no replies,
233 this packetization may cause significant delays.
236 defeats this algorithm.
238 By default, a sender- and
239 .No receiver- Ns Tn TCP
240 will negotiate among themselves to determine the maximum segment size
241 to be used for each connection.
244 option allows the user to determine the result of this negotiation,
245 and to reduce it if desired.
248 usually sends a number of options in each packet, corresponding to
251 extensions which are provided in this implementation.
254 is provided to disable
256 option use on a per-connection basis.
259 .No sender- Ns Tn TCP
262 bit, and begin transmission immediately (if permitted) at the end of
267 When this option is set to a non-zero value,
269 will delay sending any data at all until either the socket is closed,
270 or the internal send buffer is filled.
272 This option enables the use of MD5 digests (also known as TCP-MD5)
273 on writes to the specified socket.
274 Outgoing traffic is digested;
275 digests on incoming traffic are verified.
276 When this option is enabled on a socket, all inbound and outgoing
277 TCP segments must be signed with MD5 digests.
279 One common use for this in a
281 router deployment is to enable
282 based routers to interwork with Cisco equipment at peering points.
283 Support for this feature conforms to RFC 2385.
285 In order for this option to function correctly, it is necessary for the
286 administrator to add a tcp-md5 key entry to the system's security
287 associations database (SADB) using the
290 This entry can only be specified on a per-host basis at this time.
292 If an SADB entry cannot be found for the destination,
293 the system does not send any outgoing segments and drops any inbound segments.
295 Manage collection of connection level statistics using the
299 Each dropped segment is taken into account in the TCP protocol statistics.
300 .It Dv TCP_TXTLS_ENABLE
301 Enable in-kernel Transport Layer Security (TLS) for data written to this
304 .Vt struct tls_so_enable
305 argument defines the encryption and authentication algorithms and keys
306 used to encrypt the socket data as well as the maximum TLS record
309 All data written to this socket will be encapsulated in TLS records
310 and subsequently encrypted.
311 By default all data written to this socket is treated as application data.
312 Individual TLS records with a type other than application data
313 (for example, handshake messages),
314 may be transmitted by invoking
316 with a custom TLS record type set in a
317 .Dv TLS_SET_RECORD_TYPE
319 The payload of this control message is a single byte holding the desired
322 At present, only a single transmit key may be set on a socket.
323 As such, users of this option must disable rekeying.
324 .It Dv TCP_TXTLS_MODE
325 The integer argument can be used to get or set the current TLS transmit mode
327 Setting the mode can only used to toggle between software and NIC TLS after
328 TLS has been initially enabled via the
331 The available modes are:
332 .Bl -tag -width "Dv TCP_TLS_MODE_IFNET"
333 .It Dv TCP_TLS_MODE_NONE
334 In-kernel TLS framing and encryption is not enabled for this socket.
335 .It Dv TCP_TLS_MODE_SW
336 TLS records are encrypted by the kernel prior to placing the data in the
338 Typically this encryption is performed in software.
339 .It Dv TCP_TLS_MODE_IFNET
340 TLS records are encrypted by the network interface card (NIC).
341 .It Dv TCP_TLS_MODE_TOE
342 TLS records are encrypted by the NIC using a TCP offload engine (TOE).
344 .It Dv TCP_RXTLS_ENABLE
345 Enable in-kernel TLS for data read from this socket.
347 .Vt struct tls_so_enable
348 argument defines the encryption and authentication algorithms and keys
349 used to decrypt the socket data.
351 Each received TLS record must be read from the socket using
353 Each received TLS record will contain a
355 control message along with the decrypted payload.
356 The control message contains a
357 .Vt struct tls_get_record
358 which includes fields from the TLS record header.
359 If an invalid or corrupted TLS record is received,
361 will fail with one of the following errors:
364 The version fields in a TLS record's header did not match the version required
366 .Vt struct tls_so_enable
367 structure used to enable in-kernel TLS.
369 A TLS record's length was either too small or too large.
371 The connection was closed after sending a truncated TLS record.
373 The TLS record failed to match the included authentication tag.
376 At present, only a single receive key may be set on a socket.
377 As such, users of this option must disable rekeying.
378 .It Dv TCP_RXTLS_MODE
379 The integer argument can be used to get the current TLS receive mode
381 The available modes are the same as for
385 The option level for the
387 call is the protocol number for
390 .Xr getprotobyname 3 ,
393 All options are declared in
398 transport level may be used with
402 Incoming connection requests that are source-routed are noted,
403 and the reverse source route is used in responding.
405 The default congestion control algorithm for
409 Other congestion control algorithms can be made available using the
415 protocol implements a number of variables in the
420 .Bl -tag -width ".Va TCPCTL_DO_RFC1323"
421 .It Dv TCPCTL_DO_RFC1323
423 Implement the window scaling and timestamp options of RFC 1323
425 .It Dv TCPCTL_MSSDFLT
427 The default value used for the maximum segment size
429 when no advice to the contrary is received from MSS negotiation.
430 .It Dv TCPCTL_SENDSPACE
435 .It Dv TCPCTL_RECVSPACE
441 Log any connection attempts to ports where there is not a socket
442 accepting connections.
443 The value of 1 limits the logging to
445 (connection establishment) packets only.
446 That of 2 results in any
448 packets to closed ports being logged.
449 Any value unlisted above disables the logging
450 (default is 0, i.e., the logging is disabled).
452 The Maximum Segment Lifetime, in milliseconds, for a packet.
454 Timeout, in milliseconds, for new, non-established
457 The default is 75000 msec.
459 Amount of time, in milliseconds, that the connection must be idle
460 before keepalive probes (if enabled) are sent.
461 The default is 7200000 msec (2 hours).
463 The interval, in milliseconds, between keepalive probes sent to remote
464 machines, when no response is received on a
467 The default is 75000 msec.
469 Number of probes sent, with no response, before a connection
471 The default is 8 packets.
472 .It Va always_keepalive
477 connections, the kernel will
478 periodically send a packet to the remote host to verify the connection
483 unreachable messages may abort connections in
489 reassembly queue if the system is low on mbufs.
491 If enabled, disable sending of RST when a connection is attempted
492 to a port where there is not a socket accepting connections.
496 Delay ACK to try and piggyback it onto a data packet.
498 Maximum amount of time, in milliseconds, before a delayed ACK is sent.
499 .It Va path_mtu_discovery
500 Enable Path MTU Discovery.
504 control-block hash table
506 This may be tuned using the kernel option
509 .Va net.inet.tcp.tcbhashsize
513 Number of active process control blocks
516 Determines whether or not
518 cookies should be generated for outbound
522 cookies are a great help during
524 flood attacks, and are enabled by default.
527 .It Va isn_reseed_interval
528 The interval (in seconds) specifying how often the secret data used in
529 RFC 1948 initial sequence number calculations should be reseeded.
530 By default, this variable is set to zero, indicating that
531 no reseeding will occur.
532 Reseeding should not be necessary, and will break
534 recycling for a few minutes.
535 .It Va reass.cursegments
536 The current total number of segments present in all reassembly queues.
537 .It Va reass.maxsegments
538 The maximum limit on the total number of segments across all reassembly
540 The limit can be adjusted as a tunable.
541 .It Va reass.maxqueuelen
542 The maximum number of segments allowed in each reassembly queue.
543 By default, the system chooses a limit based on each TCP connection's
544 receive buffer size and maximum segment size (MSS).
545 The actual limit applied to a session's reassembly queue will be the lower of
546 the system-calculated automatic limit and the user-specified
547 .Va reass.maxqueuelen
549 .It Va rexmit_initial , rexmit_min , rexmit_slop
550 Adjust the retransmit timer calculation for
553 typically added to the raw calculation to take into account
554 occasional variances that the
556 (smoothed round-trip time)
557 is unable to accommodate, while the minimum specifies an
562 second minimum, these RFCs tend to focus on streaming behavior,
563 and fail to deal with the fact that a 1 second minimum has severe
564 detrimental effects over lossy interactive connections, such
565 as a 802.11b wireless link, and over very fast but lossy
566 connections for those cases not covered by the fast retransmit
568 For this reason, we use 200ms of slop and a near-0
569 minimum, which gives us an effective minimum of 200ms (similar to
571 The initial value is used before an RTT measurement has been performed.
572 .It Va initcwnd_segments
573 Enable the ability to specify initial congestion window in number of segments.
574 The default value is 10 as suggested by RFC 6928.
575 Changing the value on fly would not affect connections using congestion window
578 This regulates the burst of packets allowed to be sent in the first RTT.
579 The value should be relative to the link capacity.
580 Start with small values for lower-capacity links.
581 Large bursts can cause buffer overruns and packet drops if routers have small
582 buffers or the link is experiencing congestion.
584 Enable the New Congestion Window Validation mechanism as described in RFC 7661.
585 This gently reduces the congestion window during periods, where TCP is
586 application limited and the network bandwidth is not utilized completely.
587 That prevents self-inflicted packet losses once the application starts to
588 transmit data at a higher speed.
590 Calculate the bytes in flight using the algorithm described in RFC 6675, and
591 is also a prerequisite to enable Proportional Rate Reduction.
593 Enable the Limited Transmit algorithm as described in RFC 3042.
594 It helps avoid timeouts on lossy links and also when the congestion window
595 is small, as happens on short transfers.
597 Enable support for RFC 3390, which allows for a variable-sized
598 starting congestion window on new connections, depending on the
599 maximum segment size.
600 This helps throughput in general, but
601 particularly affects short transfers and high-bandwidth large
602 propagation-delay connections.
604 Enable support for RFC 2018, TCP Selective Acknowledgment option,
605 which allows the receiver to inform the sender about all successfully
606 arrived segments, allowing the sender to retransmit the missing segments
609 Maximum number of SACK holes per connection.
611 .It Va sack.globalmaxholes
612 Maximum number of SACK holes per system, across all connections.
615 When a TCP connection enters the
617 state, its associated socket structure is freed, since it is of
618 negligible size and use, and a new structure is allocated to contain a
619 minimal amount of information necessary for sustaining a connection in
620 this state, called the compressed TCP TIME_WAIT state.
621 Since this structure is smaller than a socket structure, it can save
622 a significant amount of system memory.
624 .Va net.inet.tcp.maxtcptw
625 MIB variable controls the maximum number of these structures allocated.
626 By default, it is initialized to
627 .Va kern.ipc.maxsockets
629 .It Va nolocaltimewait
630 Suppress creating of compressed TCP TIME_WAIT states for connections in
631 which both endpoints are local.
632 .It Va fast_finwait2_recycle
636 connections faster when the socket is marked as
638 (no user process has the socket open, data received on
639 the socket cannot be read).
640 The timeout used here is
641 .Va finwait2_timeout .
642 .It Va finwait2_timeout
643 Timeout to use for fast recycling of
647 Defaults to 60 seconds.
649 Enable support for TCP Explicit Congestion Notification (ECN).
650 ECN allows a TCP sender to reduce the transmission rate in order to
657 Allow incoming connections to request ECN.
658 Outgoing connections will request ECN.
660 Allow incoming connections to request ECN.
661 Outgoing connections will not request ECN.
663 .It Va ecn.maxretries
664 Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a
666 This is needed to help with connection establishment
667 when a broken firewall is in the network path.
668 .It Va pmtud_blackhole_detection
669 Enable automatic path MTU blackhole detection.
670 In case of retransmits of MSS sized segments,
671 the OS will lower the MSS to check if it's an MTU problem.
672 If the current MSS is greater than the configured value to try
673 .Po Va net.inet.tcp.pmtud_blackhole_mss
675 .Va net.inet.tcp.v6pmtud_blackhole_mss
677 it will be set to this value, otherwise,
678 the MSS will be set to the default values
679 .Po Va net.inet.tcp.mssdflt
681 .Va net.inet.tcp.v6mssdflt
686 Disable path MTU blackhole detection.
688 Enable path MTU blackhole detection for IPv4 and IPv6.
690 Enable path MTU blackhole detection only for IPv4.
692 Enable path MTU blackhole detection only for IPv6.
694 .It Va pmtud_blackhole_mss
695 MSS to try for IPv4 if PMTU blackhole detection is turned on.
696 .It Va v6pmtud_blackhole_mss
697 MSS to try for IPv6 if PMTU blackhole detection is turned on.
698 .It Va functions_available
699 List of available TCP function blocks (TCP stacks).
700 .It Va functions_default
701 The default TCP function block (TCP stack).
702 .It Va functions_inherit_listen_socket_stack
703 Determines whether to inherit listen socket's tcp stack or use the current
704 system default tcp stack, as defined by
705 .Va functions_default .
708 Use criteria defined in RFC793 instead of RFC5961 for accepting RST segments.
711 Use criteria defined in RFC793 instead of RFC5961 for accepting SYN segments.
713 .It Va ts_offset_per_conn
714 When initializing the TCP timestamps, use a per connection offset instead of a
715 per host pair offset.
716 Default is to use per connection offsets as recommended in RFC 7323.
717 .It Va perconn_stats_enable
718 Controls the default collection of statistics for all connections using the
721 0 disables, 1 enables, 2 enables random sampling across log id connection
722 groups with all connections in a group receiving the same setting.
723 .It Va perconn_stats_sample_rates
724 A CSV list of template_spec=percent key-value pairs which controls the per
725 template sampling rates when
730 A socket operation may fail with one of the following errors returned:
733 when trying to establish a connection on a socket which
735 .It Bo Er ENOBUFS Bc or Bo Er ENOMEM Bc
736 when the system runs out of memory for
737 an internal data structure;
739 when a connection was dropped
740 due to excessive retransmissions;
743 forces the connection to be closed;
744 .It Bq Er ECONNREFUSED
746 peer actively refuses connection establishment (usually because
747 no process is listening to the port);
750 is made to create a socket with a port which has already been
752 .It Bq Er EADDRNOTAVAIL
753 when an attempt is made to create a
754 socket with a network address for which no network interface
756 .It Bq Er EAFNOSUPPORT
757 when an attempt is made to bind or connect a socket to a multicast
760 when trying to change TCP function blocks at an invalid point in the session;
762 when trying to use a TCP function block that is not available;
782 .%T "TCP Extensions for High Performance"
787 .%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
791 .%A "K. Ramakrishnan"
794 .%T "The Addition of Explicit Congestion Notification (ECN) to IP"
802 The RFC 1323 extensions for window scaling and timestamps were added
807 option was introduced in
810 .Em subject to change .