Troubleshooting Routing

 

This topic discusses some common situations that can disrupt routing in your Microsoft® Exchange organization. Topics addressed include:

  • Using WinRoute

    This section explains the value of the WinRoute tool in troubleshooting routing issues.

  • Common Link State Problems

    This section explains the problems that are created by disconnections between routing groups, routing group master conflicts, deleted routing groups, connectors that are not marked as available, and oscillating connections, and explains how to resolve these situations.

  • Broken Link State Propagation

    This section explains the problems that occur when you change a routing group bridgehead server from an Exchange Server version 5.5 server to an Exchange 2000 Server or Exchange Server 2003 bridgehead server, and then change the bridgehead server back to an Exchange 5.5 server.

Using WinRoute

WinRoute is an Exchange 2003 tool that is used to determine the routing topology and link state routing information that is known to the routing master. This tool should be the first step in troubleshooting routing issues in an Exchange 2000 and Exchange 2003 messaging environment. The tool connects to the link state port, TCP port 691, on an Exchange 2000 server or Exchange 2003 server, and extracts the link state information for an organization. The information is a series of GUIDs that WinRoute matches to objects in Microsoft Active Directory® directory service, connectors and bridgehead servers, and then presents in a readable format.

Note

The WinRoute tool and user documentation are available at the Downloads for Exchange Server 2003 Web site. It is recommended that you download and use this tool on all Exchange 2000 and Exchange 2003 servers in your organization. Use this tool rather than the WinRoute tool that shipped with Exchange 2000.

Within a routing group, Exchange uses TCP port 691 to communicate link state and routing information updates between the routing group master and its routing group members. Between two routing groups, two routing group bridgehead servers use the X-LINK2STATE verb to exchange link state information by comparing the digest, an encrypted digital signature in the Orginfo packet, that contains link state information of the two routing group bridgeheads. A discrepancy between these digests causes an exchange of link state information between the two servers using SMTP port 25.

The routing group master coordinates changes to link state that are learned by servers within its routing group and retrieves updates from Active Directory. If the routing group master becomes unavailable, all servers in the routing group continue to operate on the same information that they had at the time that they lost contact with the routing group master.

When the routing group master becomes available again, it reconstructs its link state information, beginning with all servers and connectors marked as unavailable. Then, discovering any unavailable servers, the routing group master updates members within the routing group.

This section discusses the following link state problems and explains the recommended resolution:

  • Disconnection between routing group member and master

  • Conflicts between routing group masters

  • Problems that are caused by deleted routing groups

  • Connectors that are not marked as "down"

  • Oscillating connections

Disconnection Between Routing Group Member and Master

When a routing group member is unable to connect to the routing group master, WinRoute indicates the situation with a red X next to the routing group member.

Disconnection between routing group members and master

0c7a100b-c7a6-4694-bfa4-27d61e5d0fb2

Take the following steps to resolve this issue:

  • Ensure that the Microsoft Exchange Routing Engine service (RESvc) is started and in a controlled state on all affected servers in the routing group. If the routing engine service is in an unstable state, routing group members may not be able to connect to the routing group master. Investigate the root cause of any unstable services first.

  • Verify that port 691 is not restricted by a firewall by initiating a telnet session to port 691 of the affected servers and the master node. You should see a Microsoft Routing Engine banner to indicate an active state.

  • From a command line, type the following:

    netstat -a -n 
    

    The output should reveal all routing group members and the master itself connecting to port 691 on the master node, similar to the following:

    TCP    127.0.0.1:691          127.0.0.1:691         ESTABLISHED
    
  • Check Event Viewer application logs for any events that indicate a failure to authenticate using the machine account (domain\server name). Monitor for the following transport events:

    • Event ID 961 is logged when a member server fails to authenticate with its routing group master.

    • Event ID 962 is logged after a client node fails to authenticate with the routing service (RESvc).

    • Event ID 996 is logged when a client routing node successfully authenticates with the routing engine service.

    • Event ID 995 is logged when a routing group member successfully authenticates with its routing group master.

  • Verify that the affected servers can generate a ServicePrincipalName (SPN) that is used in the authentication process by checking the ncacn_ip_tcp value in the network address attribute of the affected servers. This is done by using a directory access tool, such as LDP (ldp.exe) or ADSI Edit (adsiEdit.msc).

    Members in a routing group have to mutually authenticate with the routing group master to connect. To do this, they use the ncacn_ip_tcp value in the network address attribute of the Exchange server to generate the SPN for the master node by calling DsClientMakeSpnForTargetServer. The routing group members can then authenticate using Kerberos. Make sure this value is a fully qualified domain name (FQDN), and not a NetBIOS name or an Internet Protocol address. Restart the Exchange Routing Engine service.

  • Verify that the domain machine account password has not expired*.*

  • If the membership of the routing group spans multiple domains, ensure that the cause of the problem is not a child domain or root domain problem from a DNS misconfiguration.

  • Check for any non-Microsoft applications or group policy objects that restrict permissions or security.

  • Configure another server in the routing group as the routing group master. This approach offers an interim solution. Reassigning the routing group master role can provide relief until the problem is resolved.

  • If the routing group master or any routing group members are missing the SendAs permission, WinRoute will show the server as Am I connected to the Master?: NO. Verify that this server or the groups it belongs to are not explicitly denied the SendAs permission on the routing group master.

Conflicts Between Routing Group Masters

The first server that is installed into the routing group is automatically designated as the master node or routing group master. As other servers are installed, you can designate another server as the routing group master.

At any point in time, only one server should be recognized by itself and other servers as the master. This configuration is enforced by an algorithm where (N/2) +1 servers in the routing group must agree and acknowledge the master. N denotes the number of servers in the routing group. The member nodes consequently send link state ATTACH data to the master.

Sometimes, two or more servers mistake the wrong server as the routing group master. For example, if a routing group master was moved or was deleted without choosing another master node, it is possible for msExchRoutingMasterDN, the attribute in Active Directory that designates the routing group master, to point to a deleted server because the attribute is not linked.

Furthermore, this situation can also occur when an old routing group master refuses to detach as master, or a rogue node keeps sending link state ATTACH information to an old routing group master. In Exchange 2003, if msExchRoutingMasterDN points to a deleted object, the master node relinquishes its role as master and initiates a shutdown of the master role.

Take the following steps to resolve this issue:

  • Check for healthy link state propagation within the routing group on port 691. Verify that a firewall or SMTP filters are not blocking communication.

  • Verify that no Exchange service is stopped.

  • Check Active Directory replication latencies by using the Active Directory Replication Monitor tool (Replmon.exe) that is available in the Windows Resource Kit.

  • Check for network problem and latencies.

  • Check for deleted routing group masters or servers that no longer exist. If this is the case, a transport event 958 is logged in the application log of Event Viewer that states that a routing group master no longer exists. Verify this information by using a directory access tool, such as LDP (ldp.exe) or ADSI Edit (adsiEdit.msc).

Problems Caused by Deleted Routing Groups

When routing groups are deleted after servers are moved out of them or for other reasons, WinRoute may display the text "object_not_found_in_DS" for the objects.

Exchange servers maintain the link state table that still references the objects, but these objects are missing from Active Directory when the routing engine service initializes and checks Active Directory to find the related objects.

Exchange routing cannot automatically remove deleted routing groups and their members (that is, servers and connectors) from the link state table. In fact, routing treats the deleted routing groups no differently than existing, functional routing groups. In rare cases, deleted routing groups can cause a malfunction in routing as well as mail loops. Deleted routing groups can severely affect topologies in which an Exchange 5.5 site joins an Exchange 2003 organization.

Additionally, these deleted routing group objects may significantly contribute to the size of the link state table, and thereby increase the network traffic that is incurred in the exchange of link state information.

Finally, if the Personal Address Book (PAB) or Offline Address Book (OAB) has a legacyExchange domain name that matches a deleted routing group, the deleted routing group objects will cause mail that is sent to non-existent users from the PABs or OABs to be added to the messages with an unreachable destination queue. After the default timeout of two days, the mail will be returned to the sender with a non-delivery report (NDR). Without the deleted routing group object, mail sent to non-existent users will immediately be returned to the sender with an NDR, instead of being added to the queue first.

Deleted routing group in WinRoute

f53456f0-16e4-4e9c-938d-4dc833dab74b

To resolve this issue, first verify that the account that you are using to view routing information on the server has adequate permissions. If possible, log on to WinRoute by using the system account and the AT interactive command. Lack of adequate read permissions can result in erroneous object_not_found_in_DS messages in WinRoute.

You can purge deleted routing groups from the link state information by using one of the following methods:

  • Shut down all servers in the organization at the same time to refresh routing cache information, and purge deleted routing groups and connectors.

  • Shut down all Exchange and Windows Management Instrumentation (WMI) services on all Exchange servers in the organization simultaneously.

You can also resolve this issue by using Remonitor.exe to reduce the size of the deleted routing group footprint and to mark the routing group as deleted.

Remonitor.exe is a tool that can inject a custom routing packet into an Exchange organization. The custom packet is a modified version of the deleted routing group packet. This modified version does not have server or connector entries, which significantly reduces the size of the routing group object. Also, this version eliminates the possibility of a malfunction or delay in routing that can occur due to a connector entry in which the routing group is deleted. Because the tool injects a modified packet that does not have connector entries, there cannot be a connector entry in which the routing group is deleted. Finally, Remonitor.exe updates the version number of the modified routing group so that no server or connection entries can be added to this deleted routing group.

For detailed instructions, see How to Run Remonitor.exe as Local System Account in Inject Mode.

After running Remonitor.exe, the routing packet no longer contains any server members or connectors. The routing group addresses are now prefaced by the key word deleted. Also, the version number of the routing group object is incremented.

Deleted routing group in WinRoute after running Remonitor.exe

WinRoute with Routing Group: [object_not_found_in_

Connectors Are Not Marked as "Down"

There are some instances where a connector's link state may be marked as "up" when it is in fact unavailable or "down." Routing does not mark link state on a connector as down in the following situations:

  • Connectors that use DNS to route to a domain in the address space (for example, SMTP connectors using DNS).

  • Exchange 5.5 or custom EDK (Exchange Development Kit) connectors because they do not use link state routing.

  • Routing group connectors with local bridgehead servers of any local bridgehead. You designate any local bridgehead server as the local bridgehead by clicking Any local server can send mail over this connector when creating a routing group.

  • Routing group connectors where one bridgehead server is an Exchange 5.5 server.

Other unusual instances include:

  • Situations where, within a routing group, relay mail fails to prevent message transfer agent (MTA) loops within a routing group.

  • Connectors configured with a smart host that has changed very recently.

For routing to mark a connector as down, all source bridgehead servers have to be down with a state of VS_CONN_NOT_AVAILABLE or VS_CONN_NOT_STARTED. You can check the status by using WinRoute.

Oscillating Connections

Connectors that are on an unreliable network and are marked as "up" and then "down" repeatedly cause excessive link state updates between servers. These changes cause expensive and frequent recalculation of routes within Exchange. In Event Viewer, event ID 4005 is logged frequently and appears with the text "reset routes." Exchange 2003 mitigates these changes if it detects a frequently changing connector state by leaving the state marked as up within a single polling window, the period during which a server monitors the change. However, if these changes occur in different polling periods, an oscillating connection can still generate link state traffic. The default state delay change interval is 10 minutes for Exchange 2003 servers.

Exchange routing chooses the optimal path and locates the next server for a message to make its next hop to, giving this "next-hop" server name to queuing. The optimal path is chosen considering variables such as cost, message type, and restrictions. Consequently, because of the oscillating state of a connector, Exchange has to recalculate the most optimal path repeatedly, which involves queries to Active Directory and performance costs.

When Exchange queuing notices a link failure to the bridgehead server on a connector, routing relays this information to the routing group master. The routing group master suppresses this information for up to 10 minutes to prevent connector state fluctuations. If routing marks the connector as down, this change is propagated to all Exchange servers in the organization, including the server on which the original failure occurred. This notification is called a reset route, and it is a highly expensive process in terms of CPU usage. Mail no longer queues on the connector, and routing must generate new paths for delivery. The same process occurs for marking a connector as up.

An oscillating connection occurs in the following situations:

  • Network problems, which can be seen in a network trace.

  • Reactions to link status notification callbacks from underlying protocol services (SMTP and MTA) due to an interference on the X.400 or SMTP protocol levels by non-Microsoft applications. In this scenario, only a network monitor capture can reveal the issues. In addition you can use the remonitor.exe tool that is available from Microsoft Product Support Services.

You can use Network Monitor (Netmon.exe) or the remonitor.exe tool in monitor mode to identify and address the root causes of oscillating connections. Additionally, if the oscillating connections are causing excessive propagation traffic, you can suppress the propagation of link state changes until you solve the root cause.

For detailed instructions, see How to Suppress Link State Information on a Server.

For more information about suppressing link state traffic, see "Suppressing Link State Traffic for Connectors" in Advanced Routing Configuration.

Exchange 5.5 servers do not use link state information, but instead they rely on the gateway address routing table (GWART) to route messages. In a mixed-mode organization, Exchange 2000 and later versions recognize this limitation and read the configuration of Exchange 5.5 servers directly from Active Directory. Thus, Exchange 2000 and Exchange 2003 servers do not expect Exchange 5.5 servers to exchange link state information with them.

When an Exchange 5.5 bridgehead server in an Exchange routing group is upgraded to an Exchange 2000 or Exchange 2003 server and designated as a bridgehead server, it begins to participate in the exchange of link state information and it no longer has a major version number of zero. Exchange 2000 and Exchange 2003 servers use version numbers in the link state table to compare link state tables and ensure that servers have the most recent information about link state. A major version number of zero indicates a server that does not participate in link state information or has never exchanged link state information. All pure Exchange 5.5 sites have a version number of zero because they do not exchange link state information. When the server is upgraded to an Exchange 2000 or Exchange 2003 server, it begins to participate in link state information and increments its major version number. So, bridgehead servers in other routing groups expect the newly upgraded server to inform them of link state changes in its routing group.

A problem occurs if you now designate an Exchange 5.5 server as the bridgehead server for this routing group. Other servers still expect the Exchange 5.5 bridgehead server, the former Exchange 2000 or Exchange 2003 bridgehead server, to participate in link state propagation and wait for this server to give them updated link state information. However, because the server has reverted to Exchange 5.5, it no longer has a link state table. Therefore, this routing group now becomes isolated and does not participate in dynamic link state updates in the organization.

This isolated routing group is problematic in a situation as shown in Figure 11.4. Specifically, because the Exchange 5.5 bridgehead server was formerly an Exchange 2000 or Exchange 2003 bridgehead server, other servers expect it to participate in link state propagation. The Exchange 5.5 Internet Mail Connector and Exchange 2003 SMTP connector in the following figure both use a single smart host to route mail to the Internet. The smart host becomes unavailable, so the Exchange 2003 bridgehead server marks the route through its SMTP connector as unavailable. However, because the bridgehead server expects the Exchange 5.5 server to send link state information about its routing groups and connectors, it assumes that the route through the Internet Mail Connector is available and attempts to deliver messages through this route. After one failure, the Exchange 2003 server detects a possible loop and does not attempt delivery through this route.

Exchange 5.5 and Exchange 2003 servers connecting to a smart host

c06c745e-6a42-435d-abc6-961a93b7f8a5

Link state propagation can also be broken if a firewall that is blocking link state propagation is added to the system. For example, ports 25 and 691 are required within a routing group, and port 25 is required between routing groups. Also, the Extended Simple Mail Transfer Protocol (ESMTP) command X-LINK2STATE must not be blocked by a firewall.

To resolve this problem, the following solutions are available:

  • Upgrade the Exchange 5.5 bridgehead server to an Exchange 2000 or Exchange 2003 server, or use another Exchange 2000 or Exchange 2003 server to send link state information for this routing group again. Either of these options provides the preferred and simplest resolution.

  • To reset non-connected routing groups to link state major version number 0, shut down all Exchange servers in your organization simultaneously, and then restart all Exchange servers.

  • Configure the firewall so that link state propagation is not prevented.

For more information about isolated or disjointed routing groups and the major version numbers, see Microsoft Knowledge Base article 842026, "Routing status information is not propagated correctly to all servers in Exchange 2000 Server or in Exchange Server 2003."