Track Down Elusive Network Problems
At a Glance:
- The usual causes of network problems
- Looking beyond the obvious
- When troubleshooting tools won't help
- Why you need to configure connection limits
You've probably seen it happen many times—your machine can't communicate with other machines and you don't know why. Your management system sits on one segment of a routed network connected to other network segments by means of a router such as Microsoft Internet
Security and Acceleration (ISA) Server or another hardware device. When you attempt to manage 10, 20, or even 100 systems, you don't encounter any problems. But when you attempt to manage 500 systems, your computer is unable to communicate on the network except with the machines with which it already has open connections. You cannot communicate with any other systems, you cannot get onto the Internet, yet no one else on the network, including on your segment, is experiencing this phenomenon. Where would you look first?
Diagnosing the Problem
The most common assumption in this situation is that the management software is faulty. Many proactive management tools connect to and manage your systems, but sometimes these tools themselves can cause the problem you're trying to track down. That's because a proactive management tool can spawn thousands of connections to your devices in the name of better management. Windows® will keep these connections open for two minutes by default even when the connections are idle, unless the tool, application, or service keep them alive longer. This means that even though your management system has not spoken to any machines in two minutes, you may still have more than 1,000 connections open. (You can view open connections by running NETSTAT in a command prompt. The NETSTAT command will show you all open, pending, and closing connections to and from your system and give you their status. Descriptions of the status messages can be found within RFC 793 at tools.ietf.org/html/rfc793.)
To rule out malfunctioning management software, you can create a batch file that establishes connections to the remote systems. If the same problem occurs while running the batch file, you'll know the management software and its threads were not to blame. Here's an example of the required batch file's contents:
If the management program in question happened to implement its own networking and authentication stack, it might have been the culprit, but in agentless solutions like most of these management packages, the tool uses the operating system's networking and authentication stacks to perform network operations. Using a batch file that launches just as many network connections without causing the failure will show that the issue is not a result of the program's use of the operating system's networking and authentication stacks, as the batch file uses them as well.
Logs and Error Messages that Don't Help
You may have noticed that as the connections began to fail, you received erroneous failure information: error 53—network path not found, error 64—the network name was deleted, and error 1203—no network provider accepted the given network path. All of these might normally indicate name resolution issues, except that all other machines are not having any problems resolving names and connecting to the same systems. To verify that your machine settings are not at fault, simply run ipconfig to confirm that your settings are correct.
Next, as the phenomenon seems to be localized to your management system, you should look at the event logs. A search of the application logs bears no fruit, yet in the system log you find a warning event 4226 from event source TCP/IP stating that the maximum number of connections had been reached (see Figure 1).
Figure 1 TCP connection limit has been reached
Conducting a through search of the Microsoft® Knowledge Base for connection limits will reveal the fact that there are connection limitations imposed on incomplete connections, but there are no limits placed upon completed connections. You have the ability to adjust the following registry entries under HKLM\System\CurrentControlSet\Services\TCPIP\Parameters to control the aforementioned factors: TcpNumConnections is used to set the maximum number of connections that TCP can have open at the same time (the default is 10). TCPTimedWaitDelay sets the time a connection stays in the TIME_WAIT state when it is closing. The default half-life is 120 seconds, which means a connection is essentially in use for 4 minutes. Finally, MaxFreeTcbs also plays a role in the maximum number of connections. If all TCP control blocks are in use, TCP should release connections listed as being in a TIME_WAIT state to create more connections even though the TCPTimedWaitDelay has not yet expired. TCPTimedWaitDelay has a value range of 30-300 seconds (0x1E – 0x12C).
Depending on your scenario, you may see a slight improvement in overall performance by making these registry changes.
After other unsuccessful attempts to solve the connection issues, a network capture from the machines involved seems promising. When I ran Microsoft Network Monitor (Netmon), it produced captures that displayed the exact results that I was seeing in the management tools and the test scripts: everything works and then it doesn't, without any further indication of error.
Figure 2 shows the result of running Netmon, which indicated successful communication between the first n systems. Note that I'm obtaining acknowledgments of RPC requests. This is what you want to see—successful two-way communication.
Figure 2 Successful communication in Netmon (Click the image for a larger view)
Now you need to look at the captures from the management system and the remote machines that appear to exhibit the error 53/1203. As you might expect, there is nothing to see, as the machines are not communicating. In the network capture in Figure 3, the management system has resolved the IP address and attempts to connect to the system over port 445 (the Microsoft SMB port) but never receives a response.
Figure 3 Attempts to connect to the system over port 445 yield no response (Click the image for a larger view)
The error you receive when you have more threads than your machine can currently connect to is not always consistent. In some cases, you may see from the source system an error 53, indicating that you received name resolution but that the IP address simply could not be found. This is indicative of DNS providing an address to which you cannot connect. You may receive an error 1203, indicating that no machine responded to the name or IP address that you requested. Error 1203, in this case, indicates that DNS is unavailable to you. You'll see that's the case if you run nslookup.
At this point you're bound to be frustrated, but there are more options to consider. Most people will not even think to look at the connecting infrastructure because of the way in which this problem presents itself: your machine is the only one that cannot connect to the rest of the network and the event logs show your machine has reached the maximum number of allowed connections, so the problem would not seem to be architectural in nature.
While the thousands of connections your management solution spawns will not initiate at the same time, transmission keep-alives and connection timeouts can mean that you have more connections open at any one time than you might think. Therefore you must also examine those systems that connect the rest of your network.
Let me explain. As I mentioned earlier, as network traffic passes through your network, it will pass through switches, routers, and perhaps firewalls. At any one of these points, usually the router or firewall, you may encounter intrusion detection systems. Managed switches and routers may also employ traffic filtering. You, or whoever has control over these devices, will need to check the logs for errors or warnings. The communication problem may very well lie within these systems.
Because you are connecting from an internal system to other internal systems, you may find that there are no alerts being generated, because alerting is not configured on the device or because the problem does not qualify as an intrusion or denial of service (DoS) attack. Once again, you should begin with the logs. Using ISA Server as an example, these logs can be found in the ISA Server Management console at Arrays\<ArrayName>\Monitoring\Logging.
If it is a policy that is blocking you, you may (in the case of ISA Server) look for the following result codes where the source IP is your management computer:
- 0xc0040037 FWX_E_TCP_RATE_QUOTA_EXCEEDED_DROPPED
- 0xc004000d FWX_E_POLICY_RULES_DENIED
- 0xc0040017 FWX_E_TXP_SYN_PACKET_DROPPED
If you find such results, you have identified the root of your connectivity problems.
Implementing the Solution
Now that you've identified the problem, the solution can be simple, but department politics may make it difficult to implement. Before you make any changes, make sure you have the proper permissions to do so, as creating exclusions in the security configurations on your firewalls, routers, and/or intrusion detection systems is not always allowed.
Again, using ISA Server as the example, the following steps show how to increase the maximum number of connections for a given host or for all machines in the network (as shown in Figure 4). Open ISA Server Management console and navigate to Arrays\<ArrayName>\Configuration\General\Configure Flood Mitigation Settings.
Figure 4 Increase the maximum number of connections for one host or all machines using ISA Server
The two settings of interest are the maximum concurrent TCP connections per IP address and maximum TCP connect requests per minute per IP address. The maximum concurrent TCP connections per IP address is typically set to a value that would suffice where no active management is occurring, meaning no single computer is actively connecting to thousands of other machines quickly. In ISA Server, the default limit for maximum concurrent TCP connections is 160. The maximum TCP connect requests per minute per IP address is designed to limit the damage and visibility of network scans. The default limit is 600 connections requests per minute per IP.
As stated previously, Windows will keep a connection alive for a default period of two minutes provided nothing else is attempting to keep the connection alive, even if the connection is not in use. This means that even though you have already successfully managed a computer and no longer need to communicate with it, the connection will remain active. This open connection counts toward your total allotted connections. Repeat this process more than 160 times without removing connections and you will find that all connection attempts are being denied by your router. Even if your management program actively kills a session, Windows may well leave the connection in a time_wait state waiting for the target machine to agree to disconnect the session.
Start making adjustments to the most likely culprit: maximum concurrent TCP connections per IP address. You should set this so that your management machine will be able to create all the connections it needs to manage your systems without being denied access. Click the Edit button next to the setting and change the values.
The next prompt (see Figure 5) has a Limit box that represents the default limit, applicable to all clients. The Custom limit applies to all machines, networks, and so forth that are defined on the IP Exceptions tab of the flood mitigation dialog. If you want every machine to be permitted to create more connections, then change the value of Limit. To configure the exception for your management machine only, adjust the Custom limit and add your machine to the IP Exceptions tab. It is generally better to allow the exception for just your own machine.
Figure 5 Default connection limit and custom connection limit
If you wish to use the custom limit to modify this value for only specific systems, then change the value in the Custom limit field and click OK. Then add your machine to the IP Exceptions tab on the flood mitigation dialog. To add your single computer to the exceptions list, on the IP Exceptions click Add to bring up the Computer Sets dialog. Click New to create a new network set to contain your system(s) if a network set containing these machines does not already exist. Select this network set and click Add to add the Internal Networks network set and click Close. Select the Internal Networks network set again and click Edit. This will open the properties for Internal Networks (see Figure 6). Click Add on this dialog to display a submenu where you can choose to add a computer, address range, or subnet; choose computer. Type in a name to identify the entry, the machine's IP address, and a description so that anyone looking at this entry later will not feel compelled to remove your system (see Figure 7). Click OK to add your system, and click it again to add the exception. Now that you have made these changes, you must apply the settings.
Figure 6 Internal networks properties settings
Figure 7 Enter computer name, IP address, and description to ensure your system is not removed
Try your proactive management tool again and you should find that the performance is much better and the connection does not fail—at least not as a result of network traffic. In the end, it turned out that the problem was not the number of connections the software initiated, it was the lack of proper planning for those connections that caused the disruption.
One of the biggest headaches in IT is experiencing and then resolving the really elusive problems. They're the problems the end user didn't create, the server team didn't cause, the help desk is unaware of, and, unfortunately, fixing them may be your responsibility. There are a slew of tools available to help you troubleshoot, isolate, and resolve problems, but sometimes the tools are not enough. Sometimes they're wrong. And sometimes you just need to be smarter than the tools.
Next time you find yourself unable to establish connections among a number of machines on your network with no obvious explanation, try the steps that I described here. There's a good chance that by following the steps, taking a close look at your management software, and setting allowable connections properly, you'll solve the problem successfully.
Christopher Stoneff is a product manager at Lieberman Software (liebsoft.com), a security and systems management software developer. Chris holds more technical certifications than any sane person should. His biggest drive is not just to know how something works a certain way, but why.
© 2008 Microsoft Corporation and CMP Media, LLC. All rights reserved; reproduction in part or in whole without permission is prohibited.