Transmission Control Protocol

Transmission Control Protocol (TCP) provides a connection-based, reliable byte-stream service to applications. Microsoft networking relies upon TCP for the logon process, file and print sharing, replication of information between domain controllers, transfer of browse lists, and other common functions. It can only be used for one-to-one communications. Windows 2000 TCP is compliant with RFC 793 and section 4.2 of RFC 1122.

TCP uses a checksum that checks for transmission errors on both the TCP header and payload of each segment to reduce the chance of network corruption going undetected. NDIS 5.0 provides support for task offloading, and Windows 2000 TCP takes advantage of this by allowing the network adapter to perform the TCP checksum calculations if the network adapter driver offers support for this function. Offloading the checksum calculations to hardware can result in performance improvements in very high throughput environments. The robustness of Windows 2000 TCP has also been improved and has been subject to an internal security review intended to reduce susceptibility to future hacker attacks.

TCP Receive Window Size Calculation and Window Scaling

The TCP receive window size is the amount of receive data (in bytes) that can be buffered at one time on a connection. The sending host can send only that amount of data before waiting for acknowledgments for data sent and window updates from the receiving host. Windows 2000 TCP/IP is designed to tune itself in most environments, and uses larger default window sizes than earlier versions.

Instead of using a hard-coded default receive window size, TCP adjusts to even increments of the maximum segment size (MSS) negotiated during connection setup. Matching the receive window to even increments of the MSS increases the percentage of full-sized TCP segments used during bulk data transmission.

The receive window size defaults to a value calculated as follows:

  1. The first connection request sent to a remote host advertises a receive window size of 16 kilobytes (KB) or 16,384 bytes.

  2. Upon establishing the connection, the receive window size is rounded up to an integral multiple of the TCP maximum segment size (MSS) that was negotiated during connection setup.

  3. If the rounded-up value is not at least four times the MSS, then it is adjusted to 4 * MSS, with a maximum size of 64 KB, unless a window scaling option (RFC 1323) is in effect.

For Ethernet-based TCP connections, the window is normally set to 17,520 bytes, or 16 KB rounded up to twelve 1,460-byte segments. In previous versions of Microsoft® Windows NT® TCP/IP, the Ethernet window used was 8,760 bytes, or six MSS-sized segments.

There are two methods for setting the receive window size to specific values:

  • The TcpWindowSize registry entry (HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interface\ < interface >).

  • On a per-socket basis with the setsockopt( ) Windows Sockets function.

To improve performance on high-bandwidth, high-delay networks, Windows 2000 TCP supports TCP window scaling described in RFC 1323. TCP window scaling supports TCP receive window sizes larger that 64 KB by negotiating a window scaling factor during the TCP three-way handshake. This allows for a receive window of up to 1 gigabyte (GB).

When you read captures of a connection that was established by two computers that support scalable windows, keep in mind that the window sizes advertised in the segment must be scaled by the negotiated scale factor. The window scale factor only appears in the first two segments of the TCP three-way handshake. The scale factor is 2 s , where s is the negotiated scale factor. For example, for an advertised window size of 65535 and a scale factor of 3, the actual receive window size is 524280, or 2 3 * 65535.

The following Network Monitor capture shows the window scale option in the TCP SYN segment:

Src Addr Dst Addr Protocol Description

HOST100 CORPSRVR TCP ....S., len:0, seq:725163-725163, ack:0, win:65535, src:1217 dst:139

+ FRAME: Base frame properties

+ ETHERNET: ETYPE = 0x0800 : Protocol = IP: DOD Internet Protocol

+ IP: ID = 0xB908; Proto = TCP; Len: 64

TCP: ....S., len:0, seq:725163-725163, ack:0, win:65535, src:1217 dst:139 (NBT Session)

TCP: Source Port = 0x04C1

TCP: Destination Port = NETBIOS Session Service

TCP: Sequence Number = 725163 (0xB10AB)

TCP: Acknowledgement Number = 0 (0x0)

TCP: Data Offset = 44 (0x2C)

TCP: Reserved = 0 (0x0000)

+ TCP: Flags = 0x02 : ....S.

TCP: Window = 65535 (0xFFFF)

TCP: Checksum = 0x8565

TCP: Urgent Pointer = 0 (0x0)

TCP: Options

+ TCP: Maximum Segment Size Option

TCP: Option Nop = 1 (0x1)

TCP: Window Scale Option

TCP: Option Type = Window Scale

TCP: Option Length = 3 (0x3)

TCP: Window Scale = 5 (0x5)

TCP: Option Nop = 1 (0x1)

TCP: Option Nop = 1 (0x1)

+ TCP: Timestamps Option

TCP: Option Nop = 1 (0x1)

TCP: Option Nop = 1 (0x1)

+ TCP: SACK Permitted Option

TCP window scaling is enabled by default and used automatically whenever the TCP window size for the connection is set to a value greater than 64 kilobytes (KB), either through the TCPWindowSize registry entry (HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interface \< interface >) or through the setsockopt( ) Windows Sockets function. TCP window scaling can be enabled through the Tcp1323Opts registry entry (HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters).

Delayed Acknowledgments

As specified in RFC 1122, TCP uses delayed acknowledgments (ACKs) to reduce the number of packets sent on the media. Rather than sending an acknowledgment for each TCP segment received, Windows 2000 TCP takes a common approach to implementing delayed ACKs. As data is received by TCP on a given connection, it only sends an acknowledgment back if one of the following conditions is met:

  • No ACK was sent for the previous segment received.

  • A segment is received, but no other segment arrives within 200 milliseconds for that connection.

Normally an ACK is sent for every other TCP segment received on a connection, unless the delayed ACK timer (200 milliseconds) expires. The delayed ACK timer for each interface can be adjusted by setting the value of the TCPDelAckTicks registry entry (HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces \< interface >), which was first introduced in Microsoft® Windows NT® version 4.0, Service Pack 4.

TCP Selective Acknowledgment

Windows 2000 introduces support for an important performance feature known as Selective Acknowledgment (SACK), described in RFC 2018. SACK is very important for connections using large TCP window sizes. Prior to SACK, a receiver could only acknowledge the latest sequence number of contiguous data that had been received, or the left edge of the receive window. With SACK enabled, the receiver continues to use the ACK number to acknowledge the left edge of the receive window, but the receiver can also individually acknowledge other non-contiguous blocks of received data.

SACK uses TCP header options to negotiate the use of SACK during the TCP connection establishment and to indicate the left edge and right edge of blocks of data received. Multiple blocks received can be indicated. For more details, see RFC 2018. By default, SACK is enabled.

When a segment or series of segments arrive in a non-contiguous fashion, the receiver is able to inform the sender of exactly which data has been received, implicitly indicating which data did not arrive. The sender can selectively retransmit the missing data without needing to retransmit blocks of data that have been successfully received. SACK is enabled by default through the value of the SackOpts registry entry (HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters).

The following Network Monitor capture shows a host acknowledging all data up to sequence number 54857340, plus the data from sequence number 54858789-54861684.

+ FRAME: Base frame properties

+ ETHERNET: ETYPE = 0x0800 : Protocol = IP: DOD Internet Protocol

+ IP: ID = 0x1A0D; Proto = TCP; Len: 64

TCP: .A...., len:0, seq:925104-925104, ack:54857341, win:32722, src:1242 dst:139

TCP: Source Port = 0x04DA

TCP: Destination Port = NETBIOS Session Service

TCP: Sequence Number = 925104 (0xE1DB0)

TCP: Acknowledgement Number = 54857341 (0x3450E7D)

TCP: Data Offset = 44 (0x2C)

TCP: Reserved = 0 (0x0000)

+ TCP: Flags = 0x10 : .A....

TCP: Window = 32722 (0x7FD2)

TCP: Checksum = 0x4A72

TCP: Urgent Pointer = 0 (0x0)

TCP: Options

TCP: Option Nop = 1 (0x1)

TCP: Option Nop = 1 (0x1)

+ TCP: Timestamps Option

TCP: Option Nop = 1 (0x1)

TCP: Option Nop = 1 (0x1)

TCP: SACK Option

TCP: Option Type = 0x05

TCP: Option Length = 10 (0xA)

TCP: Left Edge of Block = 54858789 (0x3451425)

TCP: Right Edge of Block = 54861685 (0x3451F75)

TCP Timestamps

In previous versions of Microsoft TCP/IP, TCP calculated the round trip time (RTT) for only one sample per window of data sent to adjust the retransmission time-out (RTO). To calculate the RTT, TCP recorded the time that the segment was sent and the time that an acknowledgement for the segment was received. For example, if the window size was 8760 (six full segments), a common value for Ethernet, one in six segments were used to recalculate the round trip time. This is an adequate sampling rate for such a small window size. However, with support for TCP window scaling, sampling one segment for the entire window size is not sufficient. For example, with the maximum window size using window scaling of 1 GB over an Ethernet network, there would only be one sample for every 735,440 segments.

TCP timestamps are implemented as TCP header options that record the time a segment was sent, The timestamp of the sent TCP segment is echoed in the acknowledgement. For more details, see RFC 1323.

The following Network Monitor capture shows the TCP timestamps option:

+ FRAME: Base frame properties

+ ETHERNET: ETYPE = 0x0800 : Protocol = IP: DOD Internet Protocol

+ IP: ID = 0x1A0D; Proto = TCP; Len: 64

TCP: .A...., len:0, seq:925104-925104, ack:54857341, win:32722, src:1242 dst:139

TCP: Source Port = 0x04DA

TCP: Destination Port = NETBIOS Session Service

TCP: Sequence Number = 925104 (0xE1DB0)

TCP: Acknowledgement Number = 54857341 (0x3450E7D)

TCP: Data Offset = 44 (0x2C)

TCP: Reserved = 0 (0x0000)

+ TCP: Flags = 0x10 : .A....

TCP: Window = 32722 (0x7FD2)

TCP: Checksum = 0x4A72

TCP: Urgent Pointer = 0 (0x0)

TCP: Options

TCP: Option Nop = 1 (0x1)

TCP: Option Nop = 1 (0x1)

TCP: Timestamps Option

TCP: Option Type = Timestamps

TCP: Option Length = 10 (0xA)

TCP: Timestamp = 2525186 (0x268802)

TCP: Reply Timestamp = 1823192 (0x1BD1D8)

TCP: Option Nop = 1 (0x1)

TCP: Option Nop = 1 (0x1)

+ TCP: SACK Option

TCP timestamps are disabled by default. You can enable TCP timestamps by changing the value of the Tcp1323Opts registry entry (HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters).

Protection Against Wrapped Sequence Numbers

Using TCP timestamps provides protection against wrapped sequence numbers (PAWS). The TCP sequence number is a 32-bit value that indicates the first byte of data in the segment. With 32 bits in the sequence number, only 4 GB of data can be in transit between the sender and the receiver before the TCP sequence number begins to wrap around and become ambiguous. While this is not likely in typical Ethernet and Token Ring environments, high capacity networks using Gigabit per second (Gbps) or Terabit per second (Tbps) technologies can wrap the TCP sequence number in a matter of seconds. If a segment is dropped or delayed, a different segment could exist with the same sequence number. Corrupted data could result from the receiver misinterpreting the new sequence number with an old sequence number it is expecting to receive.

To avoid confusion in the event of duplicate sequence numbers, the TCP timestamp is used as an extension to the sequence number. Current packets have current and progressing timestamps. An old packet has an older timestamp and is discarded.

Dead Gateway Detection

Dead gateway detection is used by TCP traffic to detect the failure of the default gateway and to make an adjustment to the IP routing table to use another default gateway. Windows 2000 TCP/IP uses the triggered reselection method described in RFC 816, with slight modifications.

When any TCP connection that is routed through the default gateway has attempted to send a TCP packet to the destination a number of times equal to one-half of the value of the registry entry TcpMaxDataRetransmissions (default value of 5) without receiving a response, the forwarding IP address for the Destination IP Address is changed to use the next default gateway in the list. When 25 percent of the TCP connections have moved to the next default gateway, TCP informs IP to update the default route for the IP address of the next default gateway, the one that the changed connections are now using.

For example, assume that for a host:

  • There are TCP connections to 11 different IP addresses that are routed through the default gateway.

  • The host has multiple default gateways configured.

  • TcpMaxDataRetransmissions is set at the default value of 5.

When the default gateway fails, the following process switches the default gateway to the next one in the list:

  1. When the first TCP connection tries to send data, it does not receive any acknowledgments. After the third retransmission, the forwarding IP address for that remote IP address is switched to use the next default gateway in the list. At this point, any TCP connections to that remote IP address are switched over to the new default gateway, but the remaining connections will still try to use the original default gateway.

  2. When the second TCP connection tries to send data, the same thing happens. Now, two of the 11 connections are using the new gateway.

  3. When the third TCP connection tries to send data, its default gateway is changed to the next default gateway in the list. Three of 11 connections have been switched to the second default gateway. Because over 25 percent of the connections have been changed, the Gateway IP address for the default route in the routing table is updated with the IP address of the new gateway.

  4. The new default gateway remains the primary one for the computer until it experiences problems, causing dead gateway detection to switch to the next gateway in the list again, or until the computer is restarted.

When the search reaches the last default gateway, it returns to the beginning of the list.

TCP Retransmission Behavior

TCP starts a retransmission timer when each outbound segment is handed down to IP. If no acknowledgment is received for the data in a given segment before the timer expires, then the segment is retransmitted. For new connection requests, the retransmission timer is initialized to three seconds, and the TCP connection request segment is resent up to the number of times specified by the value of TcpMaxConnectRetransmissions (2 by default) registry entry (HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters). On existing connections, the number of retransmissions is controlled by the the value of the TcpMaxDataRetransmissions registry entry (5 by default) (HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters).

The retransmission time-out (RTO) is adjusted on an ongoing basis to match the characteristics of the connection using Smoothed Round Trip Time (SRTT) calculations and Karn's algorithm. For more information about Karn's algorithm, see "Additional Resources" later in this chapter.

The timer for a given segment is doubled after each retransmission of that segment. Using this algorithm, TCP tunes itself to the "normal" delay of a connection. TCP connections over high-delay links take much longer to time out than those over low-delay links.

The following Network Monitor capture shows the retransmission algorithm for two hosts connected over Ethernet on the same subnet. An FTP file transfer was in progress when the receiving host was disconnected from the network. Because the SRTT for this connection was very small, the first retransmission was sent after about one-half second. The timer was then doubled for each of the retransmissions that followed. After the fifth retransmission, the timer was once again doubled, and since no acknowledgment was received before it expired, the connection was aborted.

time source ip dest ip pro flags description

0.000 10.57.10.32 10.57.9.138 TCP .A.., len: 1460, seq: 8043781, ack: 8153124, win: 8760

0.521 10.57.10.32 10.57.9.138 TCP .A.., len: 1460, seq: 8043781, ack: 8153124, win: 8760

1.001 10.57.10.32 10.57.9.138 TCP .A.., len: 1460, seq: 8043781, ack: 8153124, win: 8760

2.003 10.57.10.32 10.57.9.138 TCP .A.., len: 1460, seq: 8043781, ack: 8153124, win: 8760

4.007 10.57.10.32 10.57.9.138 TCP .A.., len: 1460, seq: 8043781, ack: 8153124, win: 8760

8.130 10.57.10.32 10.57.9.138 TCP .A.., len: 1460, seq: 8043781, ack: 8153124, win: 8760

Fast Retransmit

There are some circumstances under which TCP retransmits data prior to the retransmission timer expiring. The most common circumstance occurs due to a feature known as fast retransmit . When a receiver that supports fast retransmit receives data with a sequence number beyond the current expected one, then it is likely that some data was dropped. To help make the sender aware of this event, the receiver immediately sends an ACK with the acknowledgment number set to the sequence number that it was expecting. It continues to do this for each additional TCP segment that arrives containing data subsequent to the missing data in the incoming stream.

When the sender starts to receive a stream of ACKs that are acknowledging the same sequence number, and that sequence number is earlier than the current sequence number being sent, it can infer that a segment (or segments) has been dropped. Senders that support the fast retransmit algorithm immediately resend the segment that the receiver is expecting to fill in the gap in data, without waiting for the retransmission timer to expire for that segment. This optimization greatly improves performance in a high-loss network environment.

By default, Windows 2000 resends a segment if it receives three ACKs for the same sequence number, and that sequence number lags the current one. The maximum number of duplicate ACKs that triggers a resend is determined by the value of the TcpMaxDupAcks registry entry (HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters).

TCP Keep-Alive Messages

A TCP keep-alive packet is simply an ACK with the sequence number set to one less than the current sequence number for the connection. A host receiving one of these ACKs will respond with an ACK for the current sequence number. Keep-alives can be used to verify that the computer at the remote end of a connection is still available. Windows 2000 TCP keep-alive behavior can be modified by changing the values of the KeepAliveTime and KeepAliveInterval registry entries (HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters). TCP keep-alives can be sent once for every interval specified by the value of KeepAliveTime (defaults to 7,200,000 milliseconds, or two hours) if no other data or higher level keep-alives have been carried over the TCP connection. If there is no response to a keep-alive, it is repeated once every interval specified by the value of KeepAliveInterval in seconds. By default, the KeepAliveInterval entry is set to a value of one second.

NetBT connections, such as those used by many Microsoft networking components, send NetBIOS keep-alives more frequently, so normally no TCP keep-alives are sent on a NetBIOS connection. TCP keep-alives are disabled by default, but Windows Sockets applications can use the Windows Sockets setsockopt( ) function to enable them.

Slow Start Algorithm and Congestion Avoidance

Windows 2000 TCP is compliant with the slow start and congestion avoidance algorithms. For more information about slow start and congestion avoidance algorithms, see "Additional Resources" later in this chapter.

When a connection is established, TCP sends data slowly at first to assess the bandwidth of the connection and to avoid overwhelming the receiving host or any other devices or links in the path. The send window is set to two TCP segments, and when both segments are acknowledged, the send window is increased to three segments. If those are acknowledged, then the send window is increased again, and so on until the amount of data being sent per burst reaches the size of the receive window advertised by the remote host. At that point, the slow start algorithm is no longer in use and flow control is governed by the advertised receive window.

At any time during transmission, congestion can occur. Congestion is detected when a retransmission timer expires, or when a host receives an ICMP Source Quench message for a TCP segment that was discarded by a router. If this happens, the TCP congestion avoidance algorithm is used to reduce the send window size and gradually grow it back to half the size of the send window when the congestion occurred. Then, the slow start algorithm is used to grow the send window up to the size of the receive window of the receiving host.

Silly Window Syndrome

Silly Window Syndrome (SWS) is the advertising of receive window sizes that are less than a full TCP segment. Silly Window Syndrome can cause very small TCP segments to be sent, resulting in an inefficient use of the network. Windows 2000 TCP/IP implements sender and receiver SWS avoidance as specified in RFC 1122. Receiver-side SWS avoidance is implemented by not opening the receive window in increments of less than a TCP segment. Sender-side SWS avoidance is implemented by not sending more data until there is a sufficient window size advertised by the receiving end to send a full TCP segment. There are exceptions to this rule for sender-side SWS avoidance, as described in RFC 1122.

Nagle Algorithm

Windows 2000 TCP/IP implements the Nagle algorithm, described in RFC 896. The purpose of this algorithm is to reduce the number of small segments sent, especially on high-delay (remote) links. A small segment is a segment that is smaller than the MSS. The Nagle algorithm allows only one small segment to be outstanding at a time without acknowledgment.

If more small segments are generated while awaiting the ACK for the first one, then these segments are accumulated into one larger segment. Any full-sized segment is transmitted immediately, assuming there is a sufficient receive window available. The Nagle algorithm is effective in reducing the number of packets sent by interactive applications, such as Telnet, especially over slow links.

The Nagle algorithm can be observed in the following Network Monitor capture. A Telnet (character mode) session was established, then the Y key was held down on the Windows 2000 workstation. At all times, one segment was sent, and further Y characters were held by the stack until an acknowledgment was received for the previous segment. In this example, three to four Y characters were buffered each time and sent together in one segment. Due to the Nagle algorithm, the number of segments sent was reduced by a factor of about three.

Time Source IP Dest IP Prot Description

0.644 204.182.66.83 199.181.164.4 TELNET To Server Port = 1901

0.144 199.181.164.4 204.182.66.83 TELNET To Client Port = 1901

0.000 204.182.66.83 199.181.164.4 TELNET To Server Port = 1901

0.145 199.181.164.4 204.182.66.83 TELNET To Client Port = 1901

0.000 204.182.66.83 199.181.164.4 TELNET To Server Port = 1901

0.144 199.181.164.4 204.182.66.83 TELNET To Client Port = 1901

. . .

Each segment contained several of the Y characters. The first segment is shown more fully parsed below, and the data portion is bolded in the hexadecimal display at the bottom.

Time Source IP Dest IP Prot Description

0.644 204.182.66.83 199.181.164.4 TELNET To Server Port = 1901

+ FRAME: Base frame properties

+ ETHERNET: ETYPE = 0x0800 : Protocol = IP: DOD Internet Protocol

+ IP: ID = 0xEA83; Proto = TCP; Len: 43

+ TCP: .AP..., len: 3, seq:1032660278, ack: 353339017, win: 7766, src: 1901 dst: 23 (TELNET)

TELNET: To Server From Port = 1901

TELNET: Telnet Data

D2 41 53 48 00 00 52 41 53 48 00 00 08 00 45 00 .ASH..RASH....E.

00 2B EA 83 40 00 20 06 F5 85 CC B6 42 53 C7 B5 .+..@. .....BS..

A4 04 07 6D 00 17 3D 8D 25 36 15 0F 86 89 50 18 ...m..=.%6....P.

1E 56 1E 56 00 00 79 79 79 .V.V.. yyy

Windows Sockets applications can disable the Nagle algorithm for their connections by setting the TCP_NODELAY socket option. However, this practice should be avoided unless absolutely necessary as it increases network utilization. Some network applications might not perform well if their design does not take into account the effects of transmitting large numbers of small packets and the Nagle algorithm.

The Nagle algorithm is not applied to loopback TCP connections for performance reasons. Windows 2000 Netbt disables nagling for NetBIOS over TCP connections as well as NetBIOS-less redirector/server connections, which can improve performance for applications issuing numerous small file manipulation commands. An example is an application that uses file locking/unlocking frequently.

TCP TIME-WAIT Delay

When a TCP connection is closed, the socket pair is placed into a state known as TIME-WAIT so that a new connection does not use the same protocol, source IP address, destination IP address, source port, and destination port until enough time has passed to ensure that any segments that have been misrouted or delayed will not be delivered unexpectedly. The length of time that the socket-pair should not be reused is specified by RFC 793 as two maximum segment lifetimes (2MSL) or 240 seconds (four minutes). This is the default setting for Windows 2000. However, with this default setting, some network applications that perform many outbound connections in a short time might use up all available ports before the ports can be recycled.

Windows 2000 offers two methods of controlling this behavior. First, the TcpTimedWaitDelay registry entry (HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters) can be used to alter this value. Windows 2000 allows this value to be set as low as 30 seconds, which should not cause problems in most environments. Second, the number of user-accessible ephemeral ports that can be used to source outbound connections is configurable with the MaxUserPort registry entry (HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters). By default, when an application requests any socket from the system to use for an outbound call, a port between the values of 1024 and 5000 is supplied. You can use the MaxUserPort registry entry to set the value of the highest port number to be used for outbound connections. For example, setting this value to 10000 would make approximately 9000 user ports available for outbound connections. For more details, see RFC 793. See also the MaxFreeTcbs and MaxHashTableSize registry settings (HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters).

TCP Connections To and From Multihomed Computers

When TCP connections are made to a multihomed host, both the WINS client and the Domain Name Resolver (DNR) attempt to determine whether any of the destination IP addresses provided by the name server are on the same subnet as any of the interfaces in the local computer. If so, these addresses are sorted to the top of the list so that the application can try them prior to trying addresses that are not on the same subnet. If none of the addresses are on a common subnet with the local computer, then behavior is different depending upon the namespace. The PrioritizeRecordData registry setting (HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters) can be used to prevent the DNR component from sorting local subnet addresses to the top of the list.

In the WINS namespace, the client is responsible for choosing a random address among the provided addresses. The WINS server always returns the list of addresses in the same order, and the WINS client randomly picks one of them for each connection.

In the DNS namespace, the DNS server is usually configured to provide the addresses in round-robin order. The DNR does not choose a random address. In some situations, it is desirable to connect to a specific interface on a multihomed computer. The best way to accomplish this is to provide the interface with its own DNS entry. For example, a computer named Computer could have two separate DNS records with the same name, one for each IP address, and then records in the DNS for Computer1 and Computer2, each associated with just one of the IP addresses assigned to the computer.

For TCP connections made from a multihomed host, if the connection is a Winsock connection using the DNS namespace, once the target IP address for the connection is known, TCP attempts to connect from the best source IP address available. The IP routing table is used to make this determination, and if there is an interface in the local computer that is on the same subnet as the target IP address, its IP address is used as the source in the connection request. If there is no best source IP address to use, then the system chooses one randomly.

If the connection is a NetBIOS-based connection using the redirector, little routing information is available at the application level. The NetBIOS interface supports connections over various protocols and has no knowledge of IP. Instead, the redirector places calls on all of the logical networks that are bound to it. If there are two interfaces in the computer and one protocol installed, then there are two logical networks available to the redirector. Calls are placed on both, and NetBIOS over TCP/IP (NetBT) will submit connection requests to the stack using an IP addresses from each interface. It is possible that both calls will succeed. If so, then the redirector will cancel one of them. The choice of which one to cancel depends upon the value of the ObeyBindingOrder registry entry. If the value of this entry is 0, the default value, then the primary logical network determined by binding order is the preferred one, and the redirector waits for the primary transport to time out before accepting the connection on the secondary transport. If this value is 1, the binding order is ignored and the redirector accepts the first connection that succeeds and cancels any other connections.

Throughput Considerations

Windows 2000 TCP/IP can adapt to most network conditions, and can dynamically provide the best throughput and reliability possible on a per-connection basis. Attempts at manual tuning are often counter-productive unless a qualified network engineer first performs a careful study of data flow.

TCP is designed to provide optimum performance over varying link conditions, and Windows 2000 contains improvements, such as those supporting RFC 1323. Actual throughput for a link depends on a number of variables, but the most important factors are:

  • Link speed (bits/second that can be transmitted).

  • Propagation delay.

  • Window size (amount of unacknowledged data that might be outstanding on a TCP connection).

  • Link reliability.

  • Network and intermediate device congestion.

Key considerations of throughput:

  • The capacity of a communications channel, also known as a pipe, is known as the bandwidth-delay product and is the bandwidth (the bit rate) multiplied by round-trip time. If the link has a low number of bit level errors, the window size for best performance should be greater than or equal to the bandwidth-delay product so that the sender can fill the pipe. Without window scaling, 65,535 is the largest window size that can be specified due to the 16-bit Window field in the TCP header. Window scaling can be used for window sizes up to 1 GB.

  • Throughput can never exceed the window size divided by round-trip time.

  • If the link has a large number of bit-level errors or is badly congested and packets are being dropped, using a larger window size might not improve throughput. Windows 2000 supports SACK, for improved performance in high-loss environments, and TCP timestamps, for improved RTT estimation.

  • Propagation delay is dependent upon the speed of transmission of electrical or optical signals in various media and latencies in transmission equipment and intermediate systems.

  • Transmission delay depends on the speed of the media and the nature of the media access control scheme.

  • For a particular path, propagation delay is fixed, but transmission delay depends upon the packet size and congestion.

  • At low speeds, transmission delay is the limiting factor. At high speeds, propagation delay might become the limiting factor.