1 .\" Copyright (c) 1983, 1991, 1993
2 .\" The Regents of the University of California. All rights reserved.
4 .\" Redistribution and use in source and binary forms, with or without
5 .\" modification, are permitted provided that the following conditions
7 .\" 1. Redistributions of source code must retain the above copyright
8 .\" notice, this list of conditions and the following disclaimer.
9 .\" 2. Redistributions in binary form must reproduce the above copyright
10 .\" notice, this list of conditions and the following disclaimer in the
11 .\" documentation and/or other materials provided with the distribution.
12 .\" 3. All advertising materials mentioning features or use of this software
13 .\" must display the following acknowledgement:
14 .\" This product includes software developed by the University of
15 .\" California, Berkeley and its contributors.
16 .\" 4. Neither the name of the University nor the names of its contributors
17 .\" may be used to endorse or promote products derived from this software
18 .\" without specific prior written permission.
20 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
32 .\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93
40 .Nd Internet Transmission Control Protocol
46 .Fn socket AF_INET SOCK_STREAM 0
50 protocol provides reliable, flow-controlled, two-way
52 It is a byte-stream protocol used to
58 Internet address format and, in addition, provides a per-host
60 .Dq "port addresses" .
61 Thus, each address is composed
62 of an Internet address specifying the host and network,
65 port on the host identifying the peer entity.
73 Active sockets initiate connections to passive
77 sockets are created active; to create a
80 system call must be used
81 after binding the socket with the
84 Only passive sockets may use the
86 call to accept incoming connections.
87 Only active sockets may use the
89 call to initiate connections.
93 their location to match
94 incoming connection requests from multiple networks.
95 This technique, termed
96 .Dq "wildcard addressing" ,
98 server to provide service to clients on multiple networks.
99 To create a socket which listens on all networks, the Internet
105 port may still be specified
106 at this time; if the port is not specified, the system will assign one.
107 Once a connection has been established, the socket's address is
108 fixed by the peer entity's location.
109 The address assigned to the
110 socket is the address associated with the network interface
111 through which packets are being transmitted and received.
112 Normally, this address corresponds to the peer entity's network.
115 supports a number of socket options which can be set with
119 .Bl -tag -width ".Dv TCP_NODELAY"
121 Information about a socket's underlying TCP session may be retrieved
122 by passing the read-only option
126 It accepts a single argument: a pointer to an instance of
127 .Vt "struct tcp_info" .
129 This API is subject to change; consult the source to determine
130 which fields are currently filled out by this option.
132 specific additions include
136 bandwidth-controlled window space.
138 Under most circumstances,
140 sends data when it is presented;
141 when outstanding data has not yet been acknowledged, it gathers
142 small amounts of output to be sent in a single packet once
143 an acknowledgement is received.
144 For a small number of clients, such as window systems
145 that send a stream of mouse events which receive no replies,
146 this packetization may cause significant delays.
149 defeats this algorithm.
151 By default, a sender- and
152 .No receiver- Ns Tn TCP
153 will negotiate among themselves to determine the maximum segment size
154 to be used for each connection.
157 option allows the user to determine the result of this negotiation,
158 and to reduce it if desired.
161 usually sends a number of options in each packet, corresponding to
164 extensions which are provided in this implementation.
167 is provided to disable
169 option use on a per-connection basis.
172 .No sender- Ns Tn TCP
175 bit, and begin transmission immediately (if permitted) at the end of
180 When this option is set to a non-zero value,
182 will delay sending any data at all until either the socket is closed,
183 or the internal send buffer is filled.
185 This option enables the use of MD5 digests (also known as TCP-MD5)
186 on writes to the specified socket.
187 In the current release, only outgoing traffic is digested;
188 digests on incoming traffic are not verified.
189 The current default behavior for the system is to respond to a system
190 advertising this option with TCP-MD5; this may change.
192 One common use for this in a
194 router deployment is to enable
195 based routers to interwork with Cisco equipment at peering points.
196 Support for this feature conforms to RFC 2385.
199 sessions are supported.
201 In order for this option to function correctly, it is necessary for the
202 administrator to add a tcp-md5 key entry to the system's security
203 associations database (SADB) using the
206 This entry must have an SPI of 0x1000 and can therefore only be specified
207 on a per-host basis at this time.
209 If an SADB entry cannot be found for the destination, the outgoing traffic
210 will have an invalid digest option prepended, and the following error message
211 will be visible on the system console:
212 .Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" .
215 The option level for the
217 call is the protocol number for
220 .Xr getprotobyname 3 ,
223 All options are declared in
228 transport level may be used with
232 Incoming connection requests that are source-routed are noted,
233 and the reverse source route is used in responding.
237 protocol implements a number of variables in the
242 .Bl -tag -width ".Va TCPCTL_DO_RFC1323"
243 .It Dv TCPCTL_DO_RFC1323
245 Implement the window scaling and timestamp options of RFC 1323
247 .It Dv TCPCTL_MSSDFLT
249 The default value used for the maximum segment size
251 when no advice to the contrary is received from MSS negotiation.
252 .It Dv TCPCTL_SENDSPACE
257 .It Dv TCPCTL_RECVSPACE
263 Log any connection attempts to ports where there is not a socket
264 accepting connections.
265 The value of 1 limits the logging to
267 (connection establishment) packets only.
268 That of 2 results in any
270 packets to closed ports being logged.
271 Any value unlisted above disables the logging
272 (default is 0, i.e., the logging is disabled).
273 .It Va slowstart_flightsize
274 The number of packets allowed to be in-flight during the
276 slow-start phase on a non-local network.
277 .It Va local_slowstart_flightsize
278 The number of packets allowed to be in-flight during the
280 slow-start phase to local machines in the same subnet.
282 The Maximum Segment Lifetime, in milliseconds, for a packet.
284 Timeout, in milliseconds, for new, non-established
288 Amount of time, in milliseconds, that the connection must be idle
289 before keepalive probes (if enabled) are sent.
291 The interval, in milliseconds, between keepalive probes sent to remote
295 (default 8) probes are sent, with no response, the connection is dropped.
296 .It Va always_keepalive
301 connections, the kernel will
302 periodically send a packet to the remote host to verify the connection
307 unreachable messages may abort connections in
313 reassembly queue if the system is low on mbufs.
315 If enabled, disable sending of RST when a connection is attempted
316 to a port where there is not a socket accepting connections.
320 Delay ACK to try and piggyback it onto a data packet.
322 Maximum amount of time, in milliseconds, before a delayed ACK is sent.
326 NewReno Fast Recovery algorithm,
327 as described in RFC 2582.
328 .It Va path_mtu_discovery
329 Enable Path MTU Discovery.
333 control-block hash table
335 This may be tuned using the kernel option
338 .Va net.inet.tcp.tcbhashsize
342 Number of active process control blocks
345 Determines whether or not
347 cookies should be generated for outbound
351 cookies are a great help during
353 flood attacks, and are enabled by default.
356 .It Va isn_reseed_interval
357 The interval (in seconds) specifying how often the secret data used in
358 RFC 1948 initial sequence number calculations should be reseeded.
359 By default, this variable is set to zero, indicating that
360 no reseeding will occur.
361 Reseeding should not be necessary, and will break
363 recycling for a few minutes.
364 .It Va rexmit_min , rexmit_slop
365 Adjust the retransmit timer calculation for
368 typically added to the raw calculation to take into account
369 occasional variances that the
371 (smoothed round-trip time)
372 is unable to accommodate, while the minimum specifies an
377 second minimum, these RFCs tend to focus on streaming behavior,
378 and fail to deal with the fact that a 1 second minimum has severe
379 detrimental effects over lossy interactive connections, such
380 as a 802.11b wireless link, and over very fast but lossy
381 connections for those cases not covered by the fast retransmit
383 For this reason, we use 200ms of slop and a near-0
384 minimum, which gives us an effective minimum of 200ms (similar to
386 .It Va inflight.enable
389 bandwidth-delay product limiting.
390 An attempt will be made to calculate
391 the bandwidth-delay product for each individual
393 connection, and limit
394 the amount of inflight data being transmitted, to avoid building up
395 unnecessary packets in the network.
396 This option is recommended if you
397 are serving a lot of data over connections with high bandwidth-delay
398 products, such as modems, GigE links, and fast long-haul WANs, and/or
399 you have configured your machine to accommodate large
403 situations, without this option, you may experience high interactive
404 latencies or packet loss due to the overloading of intermediate routers
406 Note that bandwidth-delay product limiting only effects
407 the transmit side of a
410 .It Va inflight.debug
411 Enable debugging for the bandwidth-delay product algorithm.
413 This puts a lower bound on the bandwidth-delay product window, in bytes.
414 A value of 1024 is typically used for debugging.
415 6000-16000 is more typical in a production installation.
416 Setting this value too low may result in
417 slow ramp-up times for bursty connections.
418 Setting this value too high effectively disables the algorithm.
420 This puts an upper bound on the bandwidth-delay product window, in bytes.
421 This value should not generally be modified, but may be used to set a
422 global per-connection limit on queued data, potentially allowing you to
423 intentionally set a less than optimum limit, to smooth data flow over a
424 network while still being able to specify huge internal
428 The bandwidth-delay product algorithm requires a slightly larger window
429 than it otherwise calculates for stability.
430 This parameter determines the extra window in maximal packets / 10.
431 The default value of 20 represents 2 maximal packets.
432 Reducing this value is not recommended, but you may
433 come across a situation with very slow links where the
436 reduction of the default inflight code is not sufficient.
437 If this case occurs, you should first try reducing
439 and, if that does not
445 15, 10, or 5 for the latter.
446 Never use a value less than 5.
449 can lead to upwards of a 20% underutilization of the link
450 as well as reducing the algorithm's ability to adapt to changing
451 situations and should only be done as a last resort.
453 Enable the Limited Transmit algorithm as described in RFC 3042.
454 It helps avoid timeouts on lossy links and also when the congestion window
455 is small, as happens on short transfers.
457 Enable support for RFC 3390, which allows for a variable-sized
458 starting congestion window on new connections, depending on the
459 maximum segment size.
460 This helps throughput in general, but
461 particularly affects short transfers and high-bandwidth large
462 propagation-delay connections.
464 When this feature is enabled, the
465 .Va slowstart_flightsize
467 .Va local_slowstart_flightsize
468 settings are not observed for new
469 connection slow starts, but they are still used for slow starts
470 that occur when the connection has been idle and starts sending
473 Enable support for RFC 2018, TCP Selective Acknowledgment option,
474 which allows the receiver to inform the sender about all successfully
475 arrived segments, allowing the sender to retransmit the missing segments
478 Maximum number of SACK holes per connection.
480 .It Va sack.globalmaxholes
481 Maximum number of SACK holes per system, across all connections.
484 When a TCP connection enters the
486 state, its associated socket structure is freed, since it is of
487 negligible size and use, and a new structure is allocated to contain a
488 minimal amount of information necessary for sustaining a connection in
489 this state, called the compressed TCP TIME_WAIT state.
490 Since this structure is smaller than a socket structure, it can save
491 a significant amount of system memory.
493 .Va net.inet.tcp.maxtcptw
494 MIB variable controls the maximum number of these structures allocated.
495 By default, it is initialized to
496 .Va kern.ipc.maxsockets
498 .It Va nolocaltimewait
499 Suppress creating of compressed TCP TIME_WAIT states for connections in
500 which both endpoints are local.
501 .It Va fast_finwait2_recycle
505 connections faster when the socket is marked as
507 (no user process has the socket open, data received on
508 the socket cannot be read).
509 The timeout used here is
510 .Va finwait2_timeout .
511 .It Va finwait2_timeout
512 Timeout to use for fast recycling of
516 Defaults to 60 seconds.
519 A socket operation may fail with one of the following errors returned:
522 when trying to establish a connection on a socket which
525 when the system runs out of memory for
526 an internal data structure;
528 when a connection was dropped
529 due to excessive retransmissions;
532 forces the connection to be closed;
533 .It Bq Er ECONNREFUSED
535 peer actively refuses connection establishment (usually because
536 no process is listening to the port);
539 is made to create a socket with a port which has already been
541 .It Bq Er EADDRNOTAVAIL
542 when an attempt is made to create a
543 socket with a network address for which no network interface
545 .It Bq Er EAFNOSUPPORT
546 when an attempt is made to bind or connect a socket to a multicast
563 .%T "TCP Extensions for High Performance"
568 .%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
576 The RFC 1323 extensions for window scaling and timestamps were added
581 option was introduced in
584 .Em subject to change .