1 .\" Copyright (c) 1983, 1991, 1993
2 .\" The Regents of the University of California. All rights reserved.
4 .\" Redistribution and use in source and binary forms, with or without
5 .\" modification, are permitted provided that the following conditions
7 .\" 1. Redistributions of source code must retain the above copyright
8 .\" notice, this list of conditions and the following disclaimer.
9 .\" 2. Redistributions in binary form must reproduce the above copyright
10 .\" notice, this list of conditions and the following disclaimer in the
11 .\" documentation and/or other materials provided with the distribution.
12 .\" 3. All advertising materials mentioning features or use of this software
13 .\" must display the following acknowledgement:
14 .\" This product includes software developed by the University of
15 .\" California, Berkeley and its contributors.
16 .\" 4. Neither the name of the University nor the names of its contributors
17 .\" may be used to endorse or promote products derived from this software
18 .\" without specific prior written permission.
20 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
32 .\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93
40 .Nd Internet Transmission Control Protocol
46 .Fn socket AF_INET SOCK_STREAM 0
50 protocol provides reliable, flow-controlled, two-way
52 It is a byte-stream protocol used to
58 Internet address format and, in addition, provides a per-host
60 .Dq "port addresses" .
61 Thus, each address is composed
62 of an Internet address specifying the host and network,
65 port on the host identifying the peer entity.
73 Active sockets initiate connections to passive
77 sockets are created active; to create a
80 system call must be used
81 after binding the socket with the
84 Only passive sockets may use the
86 call to accept incoming connections.
87 Only active sockets may use the
89 call to initiate connections.
93 their location to match
94 incoming connection requests from multiple networks.
95 This technique, termed
96 .Dq "wildcard addressing" ,
98 server to provide service to clients on multiple networks.
99 To create a socket which listens on all networks, the Internet
105 port may still be specified
106 at this time; if the port is not specified, the system will assign one.
107 Once a connection has been established, the socket's address is
108 fixed by the peer entity's location.
109 The address assigned to the
110 socket is the address associated with the network interface
111 through which packets are being transmitted and received.
112 Normally, this address corresponds to the peer entity's network.
115 supports a number of socket options which can be set with
119 .Bl -tag -width ".Dv TCP_NODELAY"
121 Under most circumstances,
123 sends data when it is presented;
124 when outstanding data has not yet been acknowledged, it gathers
125 small amounts of output to be sent in a single packet once
126 an acknowledgement is received.
127 For a small number of clients, such as window systems
128 that send a stream of mouse events which receive no replies,
129 this packetization may cause significant delays.
132 defeats this algorithm.
134 By default, a sender- and
135 .No receiver- Ns Tn TCP
136 will negotiate among themselves to determine the maximum segment size
137 to be used for each connection.
140 option allows the user to determine the result of this negotiation,
141 and to reduce it if desired.
144 usually sends a number of options in each packet, corresponding to
147 extensions which are provided in this implementation.
150 is provided to disable
152 option use on a per-connection basis.
155 .No sender- Ns Tn TCP
158 bit, and begin transmission immediately (if permitted) at the end of
163 When this option is set to a non-zero value,
165 will delay sending any data at all until either the socket is closed,
166 or the internal send buffer is filled.
168 This option enables the use of MD5 digests (also known as TCP-MD5)
169 on writes to the specified socket.
170 In the current release, only outgoing traffic is digested;
171 digests on incoming traffic are not verified.
172 The current default behavior for the system is to respond to a system
173 advertising this option with TCP-MD5; this may change.
175 One common use for this in a
177 router deployment is to enable
178 based routers to interwork with Cisco equipment at peering points.
179 Support for this feature conforms to RFC 2385.
182 sessions are supported.
184 In order for this option to function correctly, it is necessary for the
185 administrator to add a tcp-md5 key entry to the system's security
186 associations database (SADB) using the
189 This entry must have an SPI of 0x1000 and can therefore only be specified
190 on a per-host basis at this time.
192 If an SADB entry cannot be found for the destination, the outgoing traffic
193 will have an invalid digest option prepended, and the following error message
194 will be visible on the system console:
195 .Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" .
198 The option level for the
200 call is the protocol number for
203 .Xr getprotobyname 3 ,
206 All options are declared in
211 transport level may be used with
215 Incoming connection requests that are source-routed are noted,
216 and the reverse source route is used in responding.
220 protocol implements a number of variables in the
225 .Bl -tag -width ".Va TCPCTL_DO_RFC1323"
226 .It Dv TCPCTL_DO_RFC1323
228 Implement the window scaling and timestamp options of RFC 1323
230 .It Dv TCPCTL_MSSDFLT
232 The default value used for the maximum segment size
234 when no advice to the contrary is received from MSS negotiation.
235 .It Dv TCPCTL_SENDSPACE
240 .It Dv TCPCTL_RECVSPACE
246 Log any connection attempts to ports where there is not a socket
247 accepting connections.
248 The value of 1 limits the logging to
250 (connection establishment) packets only.
251 That of 2 results in any
253 packets to closed ports being logged.
254 Any value unlisted above disables the logging
255 (default is 0, i.e., the logging is disabled).
256 .It Va slowstart_flightsize
257 The number of packets allowed to be in-flight during the
259 slow-start phase on a non-local network.
260 .It Va local_slowstart_flightsize
261 The number of packets allowed to be in-flight during the
263 slow-start phase to local machines in the same subnet.
265 The Maximum Segment Lifetime, in milliseconds, for a packet.
267 Timeout, in milliseconds, for new, non-established
271 Amount of time, in milliseconds, that the connection must be idle
272 before keepalive probes (if enabled) are sent.
274 The interval, in milliseconds, between keepalive probes sent to remote
278 (default 8) probes are sent, with no response, the connection is dropped.
279 .It Va always_keepalive
284 connections, the kernel will
285 periodically send a packet to the remote host to verify the connection
290 unreachable messages may abort connections in
296 reassembly queue if the system is low on mbufs.
298 If enabled, disable sending of RST when a connection is attempted
299 to a port where there is not a socket accepting connections.
303 Delay ACK to try and piggyback it onto a data packet.
305 Maximum amount of time, in milliseconds, before a delayed ACK is sent.
309 NewReno Fast Recovery algorithm,
310 as described in RFC 2582.
311 .It Va path_mtu_discovery
312 Enable Path MTU Discovery.
316 control-block hash table
318 This may be tuned using the kernel option
321 .Va net.inet.tcp.tcbhashsize
325 Number of active process control blocks
328 Determines whether or not
330 cookies should be generated for outbound
334 cookies are a great help during
336 flood attacks, and are enabled by default.
339 .It Va isn_reseed_interval
340 The interval (in seconds) specifying how often the secret data used in
341 RFC 1948 initial sequence number calculations should be reseeded.
342 By default, this variable is set to zero, indicating that
343 no reseeding will occur.
344 Reseeding should not be necessary, and will break
346 recycling for a few minutes.
347 .It Va rexmit_min , rexmit_slop
348 Adjust the retransmit timer calculation for
351 typically added to the raw calculation to take into account
352 occasional variances that the
354 (smoothed round-trip time)
355 is unable to accommodate, while the minimum specifies an
360 second minimum, these RFCs tend to focus on streaming behavior,
361 and fail to deal with the fact that a 1 second minimum has severe
362 detrimental effects over lossy interactive connections, such
363 as a 802.11b wireless link, and over very fast but lossy
364 connections for those cases not covered by the fast retransmit
366 For this reason, we use 200ms of slop and a near-0
367 minimum, which gives us an effective minimum of 200ms (similar to
369 .It Va inflight.enable
372 bandwidth-delay product limiting.
373 An attempt will be made to calculate
374 the bandwidth-delay product for each individual
376 connection, and limit
377 the amount of inflight data being transmitted, to avoid building up
378 unnecessary packets in the network.
379 This option is recommended if you
380 are serving a lot of data over connections with high bandwidth-delay
381 products, such as modems, GigE links, and fast long-haul WANs, and/or
382 you have configured your machine to accommodate large
386 situations, without this option, you may experience high interactive
387 latencies or packet loss due to the overloading of intermediate routers
389 Note that bandwidth-delay product limiting only effects
390 the transmit side of a
393 .It Va inflight.debug
394 Enable debugging for the bandwidth-delay product algorithm.
396 This puts a lower bound on the bandwidth-delay product window, in bytes.
397 A value of 1024 is typically used for debugging.
398 6000-16000 is more typical in a production installation.
399 Setting this value too low may result in
400 slow ramp-up times for bursty connections.
401 Setting this value too high effectively disables the algorithm.
403 This puts an upper bound on the bandwidth-delay product window, in bytes.
404 This value should not generally be modified, but may be used to set a
405 global per-connection limit on queued data, potentially allowing you to
406 intentionally set a less than optimum limit, to smooth data flow over a
407 network while still being able to specify huge internal
411 The bandwidth-delay product algorithm requires a slightly larger window
412 than it otherwise calculates for stability.
413 This parameter determines the extra window in maximal packets / 10.
414 The default value of 20 represents 2 maximal packets.
415 Reducing this value is not recommended, but you may
416 come across a situation with very slow links where the
419 reduction of the default inflight code is not sufficient.
420 If this case occurs, you should first try reducing
422 and, if that does not
428 15, 10, or 5 for the latter.
429 Never use a value less than 5.
432 can lead to upwards of a 20% underutilization of the link
433 as well as reducing the algorithm's ability to adapt to changing
434 situations and should only be done as a last resort.
436 Enable the Limited Transmit algorithm as described in RFC 3042.
437 It helps avoid timeouts on lossy links and also when the congestion window
438 is small, as happens on short transfers.
440 Enable support for RFC 3390, which allows for a variable-sized
441 starting congestion window on new connections, depending on the
442 maximum segment size.
443 This helps throughput in general, but
444 particularly affects short transfers and high-bandwidth large
445 propagation-delay connections.
447 When this feature is enabled, the
448 .Va slowstart_flightsize
450 .Va local_slowstart_flightsize
451 settings are not observed for new
452 connection slow starts, but they are still used for slow starts
453 that occur when the connection has been idle and starts sending
456 Enable support for RFC 2018, TCP Selective Acknowledgment option,
457 which allows the receiver to inform the sender about all successfully
458 arrived segments, allowing the sender to retransmit the missing segments
460 .It Va sack.initburst
461 Control the number of SACK retransmissions done upon initiation of SACK
465 A socket operation may fail with one of the following errors returned:
468 when trying to establish a connection on a socket which
471 when the system runs out of memory for
472 an internal data structure;
474 when a connection was dropped
475 due to excessive retransmissions;
478 forces the connection to be closed;
479 .It Bq Er ECONNREFUSED
481 peer actively refuses connection establishment (usually because
482 no process is listening to the port);
485 is made to create a socket with a port which has already been
487 .It Bq Er EADDRNOTAVAIL
488 when an attempt is made to create a
489 socket with a network address for which no network interface
491 .It Bq Er EAFNOSUPPORT
492 when an attempt is made to bind or connect a socket to a multicast
509 .%T "TCP Extensions for High Performance"
514 .%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
522 The RFC 1323 extensions for window scaling and timestamps were added