1 .\" Copyright (c) 1983, 1991, 1993
2 .\" The Regents of the University of California.
3 .\" Copyright (c) 2010-2011 The FreeBSD Foundation
4 .\" All rights reserved.
6 .\" Portions of this documentation were written at the Centre for Advanced
7 .\" Internet Architectures, Swinburne University of Technology, Melbourne,
8 .\" Australia by David Hayes under sponsorship from the FreeBSD Foundation.
10 .\" Redistribution and use in source and binary forms, with or without
11 .\" modification, are permitted provided that the following conditions
13 .\" 1. Redistributions of source code must retain the above copyright
14 .\" notice, this list of conditions and the following disclaimer.
15 .\" 2. Redistributions in binary form must reproduce the above copyright
16 .\" notice, this list of conditions and the following disclaimer in the
17 .\" documentation and/or other materials provided with the distribution.
18 .\" 3. All advertising materials mentioning features or use of this software
19 .\" must display the following acknowledgement:
20 .\" This product includes software developed by the University of
21 .\" California, Berkeley and its contributors.
22 .\" 4. Neither the name of the University nor the names of its contributors
23 .\" may be used to endorse or promote products derived from this software
24 .\" without specific prior written permission.
26 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
27 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
28 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
29 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
30 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
31 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
32 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
33 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
34 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
35 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
38 .\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93
46 .Nd Internet Transmission Control Protocol
52 .Fn socket AF_INET SOCK_STREAM 0
56 protocol provides reliable, flow-controlled, two-way
58 It is a byte-stream protocol used to
64 Internet address format and, in addition, provides a per-host
66 .Dq "port addresses" .
67 Thus, each address is composed
68 of an Internet address specifying the host and network,
71 port on the host identifying the peer entity.
79 Active sockets initiate connections to passive
83 sockets are created active; to create a
86 system call must be used
87 after binding the socket with the
90 Only passive sockets may use the
92 call to accept incoming connections.
93 Only active sockets may use the
95 call to initiate connections.
99 their location to match
100 incoming connection requests from multiple networks.
101 This technique, termed
102 .Dq "wildcard addressing" ,
104 server to provide service to clients on multiple networks.
105 To create a socket which listens on all networks, the Internet
111 port may still be specified
112 at this time; if the port is not specified, the system will assign one.
113 Once a connection has been established, the socket's address is
114 fixed by the peer entity's location.
115 The address assigned to the
116 socket is the address associated with the network interface
117 through which packets are being transmitted and received.
118 Normally, this address corresponds to the peer entity's network.
121 supports a number of socket options which can be set with
125 .Bl -tag -width ".Dv TCP_CONGESTION"
127 Information about a socket's underlying TCP session may be retrieved
128 by passing the read-only option
132 It accepts a single argument: a pointer to an instance of
133 .Vt "struct tcp_info" .
135 This API is subject to change; consult the source to determine
136 which fields are currently filled out by this option.
138 specific additions include
142 bandwidth-controlled window space.
143 .It Dv TCP_CONGESTION
144 Select or query the congestion control algorithm that TCP will use for the
152 option accepts a per-socket timeout argument of
154 in seconds, for new, non-established
157 For the global default in milliseconds see
161 section further down.
165 option accepts an argument of
167 for the amount of time, in seconds, that the connection must be idle
168 before keepalive probes (if enabled) are sent for the connection of this
170 If set on a listening socket, the value is inherited by the newly created
173 For the global default in milliseconds see
177 section further down.
181 option accepts an argument of
183 to set the per-socket interval, in seconds, between keepalive probes sent
185 If set on a listening socket, the value is inherited by the newly created
188 For the global default in milliseconds see
192 section further down.
196 option accepts an argument of
198 and allows a per-socket tuning of the number of probes sent, with no response,
199 before the connection will be dropped.
200 If set on a listening socket, the value is inherited by the newly created
203 For the global default see the
207 section further down.
209 Under most circumstances,
211 sends data when it is presented;
212 when outstanding data has not yet been acknowledged, it gathers
213 small amounts of output to be sent in a single packet once
214 an acknowledgement is received.
215 For a small number of clients, such as window systems
216 that send a stream of mouse events which receive no replies,
217 this packetization may cause significant delays.
220 defeats this algorithm.
222 By default, a sender- and
223 .No receiver- Ns Tn TCP
224 will negotiate among themselves to determine the maximum segment size
225 to be used for each connection.
228 option allows the user to determine the result of this negotiation,
229 and to reduce it if desired.
232 usually sends a number of options in each packet, corresponding to
235 extensions which are provided in this implementation.
238 is provided to disable
240 option use on a per-connection basis.
243 .No sender- Ns Tn TCP
246 bit, and begin transmission immediately (if permitted) at the end of
251 When this option is set to a non-zero value,
253 will delay sending any data at all until either the socket is closed,
254 or the internal send buffer is filled.
256 This option enables the use of MD5 digests (also known as TCP-MD5)
257 on writes to the specified socket.
258 Outgoing traffic is digested;
259 digests on incoming traffic are verified if the
260 .Va net.inet.tcp.signature_verify_input
262 The current default behavior for the system is to respond to a system
263 advertising this option with TCP-MD5; this may change.
265 One common use for this in a
267 router deployment is to enable
268 based routers to interwork with Cisco equipment at peering points.
269 Support for this feature conforms to RFC 2385.
272 sessions are supported.
274 In order for this option to function correctly, it is necessary for the
275 administrator to add a tcp-md5 key entry to the system's security
276 associations database (SADB) using the
279 This entry must have an SPI of 0x1000 and can therefore only be specified
280 on a per-host basis at this time.
282 If an SADB entry cannot be found for the destination, the outgoing traffic
283 will have an invalid digest option prepended, and the following error message
284 will be visible on the system console:
285 .Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" .
288 The option level for the
290 call is the protocol number for
293 .Xr getprotobyname 3 ,
296 All options are declared in
301 transport level may be used with
305 Incoming connection requests that are source-routed are noted,
306 and the reverse source route is used in responding.
308 The default congestion control algorithm for
312 Other congestion control algorithms can be made available using the
318 protocol implements a number of variables in the
323 .Bl -tag -width ".Va TCPCTL_DO_RFC1323"
324 .It Dv TCPCTL_DO_RFC1323
326 Implement the window scaling and timestamp options of RFC 1323
328 .It Dv TCPCTL_MSSDFLT
330 The default value used for the maximum segment size
332 when no advice to the contrary is received from MSS negotiation.
333 .It Dv TCPCTL_SENDSPACE
338 .It Dv TCPCTL_RECVSPACE
344 Log any connection attempts to ports where there is not a socket
345 accepting connections.
346 The value of 1 limits the logging to
348 (connection establishment) packets only.
349 That of 2 results in any
351 packets to closed ports being logged.
352 Any value unlisted above disables the logging
353 (default is 0, i.e., the logging is disabled).
355 The Maximum Segment Lifetime, in milliseconds, for a packet.
357 Timeout, in milliseconds, for new, non-established
360 The default is 75000 msec.
362 Amount of time, in milliseconds, that the connection must be idle
363 before keepalive probes (if enabled) are sent.
364 The default is 7200000 msec (2 hours).
366 The interval, in milliseconds, between keepalive probes sent to remote
367 machines, when no response is received on a
370 The default is 75000 msec.
372 Number of probes sent, with no response, before a connection
374 The default is 8 packets.
375 .It Va always_keepalive
380 connections, the kernel will
381 periodically send a packet to the remote host to verify the connection
386 unreachable messages may abort connections in
392 reassembly queue if the system is low on mbufs.
394 If enabled, disable sending of RST when a connection is attempted
395 to a port where there is not a socket accepting connections.
399 Delay ACK to try and piggyback it onto a data packet.
401 Maximum amount of time, in milliseconds, before a delayed ACK is sent.
402 .It Va path_mtu_discovery
403 Enable Path MTU Discovery.
407 control-block hash table
409 This may be tuned using the kernel option
412 .Va net.inet.tcp.tcbhashsize
416 Number of active process control blocks
419 Determines whether or not
421 cookies should be generated for outbound
425 cookies are a great help during
427 flood attacks, and are enabled by default.
430 .It Va isn_reseed_interval
431 The interval (in seconds) specifying how often the secret data used in
432 RFC 1948 initial sequence number calculations should be reseeded.
433 By default, this variable is set to zero, indicating that
434 no reseeding will occur.
435 Reseeding should not be necessary, and will break
437 recycling for a few minutes.
438 .It Va rexmit_min , rexmit_slop
439 Adjust the retransmit timer calculation for
442 typically added to the raw calculation to take into account
443 occasional variances that the
445 (smoothed round-trip time)
446 is unable to accommodate, while the minimum specifies an
451 second minimum, these RFCs tend to focus on streaming behavior,
452 and fail to deal with the fact that a 1 second minimum has severe
453 detrimental effects over lossy interactive connections, such
454 as a 802.11b wireless link, and over very fast but lossy
455 connections for those cases not covered by the fast retransmit
457 For this reason, we use 200ms of slop and a near-0
458 minimum, which gives us an effective minimum of 200ms (similar to
461 Enable the Limited Transmit algorithm as described in RFC 3042.
462 It helps avoid timeouts on lossy links and also when the congestion window
463 is small, as happens on short transfers.
465 Enable support for RFC 3390, which allows for a variable-sized
466 starting congestion window on new connections, depending on the
467 maximum segment size.
468 This helps throughput in general, but
469 particularly affects short transfers and high-bandwidth large
470 propagation-delay connections.
472 Enable support for RFC 2018, TCP Selective Acknowledgment option,
473 which allows the receiver to inform the sender about all successfully
474 arrived segments, allowing the sender to retransmit the missing segments
477 Maximum number of SACK holes per connection.
479 .It Va sack.globalmaxholes
480 Maximum number of SACK holes per system, across all connections.
483 When a TCP connection enters the
485 state, its associated socket structure is freed, since it is of
486 negligible size and use, and a new structure is allocated to contain a
487 minimal amount of information necessary for sustaining a connection in
488 this state, called the compressed TCP TIME_WAIT state.
489 Since this structure is smaller than a socket structure, it can save
490 a significant amount of system memory.
492 .Va net.inet.tcp.maxtcptw
493 MIB variable controls the maximum number of these structures allocated.
494 By default, it is initialized to
495 .Va kern.ipc.maxsockets
497 .It Va nolocaltimewait
498 Suppress creating of compressed TCP TIME_WAIT states for connections in
499 which both endpoints are local.
500 .It Va fast_finwait2_recycle
504 connections faster when the socket is marked as
506 (no user process has the socket open, data received on
507 the socket cannot be read).
508 The timeout used here is
509 .Va finwait2_timeout .
510 .It Va finwait2_timeout
511 Timeout to use for fast recycling of
515 Defaults to 60 seconds.
517 Enable support for TCP Explicit Congestion Notification (ECN).
518 ECN allows a TCP sender to reduce the transmission rate in order to
520 .It Va ecn.maxretries
521 Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a
522 specific connection. This is needed to help with connection establishment
523 when a broken firewall is in the network path.
526 A socket operation may fail with one of the following errors returned:
529 when trying to establish a connection on a socket which
532 when the system runs out of memory for
533 an internal data structure;
535 when a connection was dropped
536 due to excessive retransmissions;
539 forces the connection to be closed;
540 .It Bq Er ECONNREFUSED
542 peer actively refuses connection establishment (usually because
543 no process is listening to the port);
546 is made to create a socket with a port which has already been
548 .It Bq Er EADDRNOTAVAIL
549 when an attempt is made to create a
550 socket with a network address for which no network interface
552 .It Bq Er EAFNOSUPPORT
553 when an attempt is made to bind or connect a socket to a multicast
572 .%T "TCP Extensions for High Performance"
577 .%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
581 .%A "K. Ramakrishnan"
584 .%T "The Addition of Explicit Congestion Notification (ECN) to IP"
592 The RFC 1323 extensions for window scaling and timestamps were added
597 option was introduced in
600 .Em subject to change .