One Network Day at Microsoft

Archived content. No warranty is made as to technical accuracy. Content may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist.

By Alexander Levin, Microsoft Information Technology Group, Network and Systems Analysis

Editor's Note This article is excerpted from "Optimizing Network Traffic," which is part of the Microsoft Press Notes From the Field series that outlines the best system management practices and procedures. For more information on this and other Microsoft Press books, go to https://www.microsoft.com/mspress/.

Microsoft has a large internal information infrastructure. This chapter discusses how it works and what flavors and levels of traffic cross it during a normal day. Networks are all the same (in essence) and yet all different (in particular). Looking at a day in the life of one network, however, especially a large and complex one, should show some patterns from which useful observations and inferences can be drawn. Then, too, the previous chapter discusses the various types of network traffic, so this is a good time to show how these types occur and move on a daily basis.

On This Page

In Focus
Infrastructure
Capturing and Analyzing Traffic
Browsing and Replication Traffic Between Domain Controllers
Directory Services Traffic
WINS Replication Traffic
DHCP Traffic
Backup Services
Case Studies

In Focus

Enterprise

An inside view of some network analysis done by Microsoft's Information Technology Group.

Network

This discussion contains a collection of best practices and recommendations derived from looking at a variety of intranet applications put into production on Microsoft's corporate network.

Challenge

Improving the performance of an application, especially over the wide area network.

Solution

Build efficiencies in network communication into your corporate intranet applications, developing them with WAN performance in mind.

What You'll Find In This Chapter

  • A discussion of traffic capturing and analysis techniques, based on analyses conducted on Microsoft's network infrastructure.

  • A clear analogy to understand the complexity of network traffic, and a review of some basic definitions.

  • Basic traffic capture and analysis concepts.

  • A description of a test file-transfer case.

  • A case study of a troubleshooting analysis conducted over the WAN.

Infrastructure

First things first. It is difficult to discuss this topic (as the paragraph above shows) without resorting to the term infrastructure. What is an infrastructure?

At the Mall

Suppose you go to the opening of a new mall—one of those state-of-the-mallmaker-art capitalist shrines currently replacing suburban scenery all over America and the world. This one has been super-hyped (now there's something new) and promises to have the best stuff to buy (ditto), the cleanest, most convenient setup. The works. On opening day you go to have a look.

But you see problems immediately. For starters, you can't get into the parking lot because it is packed with trucks delivering goods. You park in the next county and hike in: the entranceways are a gauntlet of advertisements and promotion booths. People are stacking up like jets over Newark. And when you finally get inside, you can't even get into the stores along the court because too many people are cleaning floors, stocking shelves, using the doorways, stacking the highly-vaunted first-class merchandise in the aisles.

All of these impediments are support activities. They are necessary, but they obviously should not supersede the mall's primary activity, which is to sell things by the ton. These functions and the store buildings and walkways are its infrastructure: an underlying foundation or framework that is (or should be) seamless. These services have to be there but they should be invisible to you when you try to buy something. Internal support should help, not prevent or degrade, the primary activity.

An Information Infrastructure

A business network is a combination of computers (2 or 20,000) and networks (software and hardware connections) linked so that they can share data and peripheral devices for a purpose, usually creating a product or providing a service. The support structure here is the information infrastructure—the computers and connections that acquire, store, and transfer information. As with the stocks at the mall, the information infrastructure should be hidden from the user, who should focus on business, not on the network's foundation or its support and repair. By creating, implementing, and maintaining a seamless infrastructure, information technology specialists add significant value to the company.

This is one of the possible ways to present an information structure of any significant size company (Fig. 2.1).

Cc767906.opt2_1(en-us,TechNet.10).gif

Figure 2.1: Representation of network traffic in relation to network infrastructure.

Three Windows 2000 infrastructure services generate significantly more network traffic than all others: Server Browser (meaning the browsing function in general—there are several flavors of browsing), WINS Replication, and Directory Services. These services are integral to the operating system, running automatically when computers are booted or connected. They can be tuned, but doing so requires testing, study, and in-depth understanding of your network operations.

A fourth service, backup, is different in that network designers and implementers have to plan and configure it, and if they do these things poorly the infrastructure can take a beating from lots of unnecessary traffic, slow communications, impaired business operation, and surly users.

The rest of this chapter shows how these services look at Microsoft. Tests show how Server Browser (the browsing function), WINS Replication, and Directory Services affect the network: the type and amount of traffic they generate, how it accrues and increments, etc.

A section on backup shows how it can create special problems. The demands it places on the network and on technicians at Microsoft are enormous. In one case shown in the tests, a server's backup volume was too high to complete during off hours, and technicians had to resort to completing it during normal business hours, loading the network. The charts presented in the backup section clearly show the high levels of additional traffic that resulted during normal business hours. Short of resorting to partial backups (an unacceptable alternative) this problem requires increasing backup capacity by adding hardware.

Capturing and Analyzing Traffic

Most servers at the Microsoft corporate data center are connected to Cisco Catalyst switches with further ATM uplinks to the corporate network backbone. With a Switched Port Analyzer (SPAN), you can analyze traffic on any of these servers without service interruption, because Catalyst switches allow you to mirror traffic on any port. This means you can create a copy of the traffic as it passes through, divert it, then analyze it by connecting a sniffer or RMON probe to the port. This is a standard capturing technique, and the users on the network are in no way inconvenienced by the en-route capture of a mirrored stream of their traffic.

The analysis that follows was performed with two software tools: Microsoft Network Monitor (version 4.00.352) and a NetXRay (version 3.0.3) from Network General (McAfee). NetMon was used for detailed packet-by-packet analysis because it can decode MSRPC packets. These packets are used in several types of network traffic (replication, e-mail, etc.) and if you can't decode them they are shown as TCP packets, which makes it almost impossible to understand what higher level protocol is being used. Version 3.0.3 of NetXRay cannot decode MSRPC packets (the SnifferPro product version 2.0.01 can), but it is a good tool for gathering statistics over a period of time.

The difficulty behind this sort of monitoring is to find the correct network point (connector, hub, router, etc.) where the traffic you want to analyze will pass through, then use a tool to capture it (by port mirroring, for example). In the good old days (in terms of networking technology, a few months ago) it was almost always possible to find such a capture point.

Now, however, the proliferation of switches and ATM connections from the WAN to the ATM company backbone is making it harder and harder to capture specific data streams without very expensive equipment and circuit intrusion—sometimes impossible. When data flowed only through routers or shared hubs, it was possible to find a point at which to intercept the traffic you were interested in. Data streams that go into switches over high-speed media, full-duplex transfer modes, and ATM, however, cannot always be parsed and captured, so traffic analysis is developing a new set of challenges.

Browsing and Replication Traffic Between Domain Controllers

To return to the case at hand, the capture was easy enough to effect. A computer running Network Monitor and NetXRay was connected to a port on the Catalyst switch to which a domain controller's to/from traffic was mirrored. An IP filter was configured on both tools to capture traffic between the primary domain controller (PDC) and one of the backup domain controllers (BDCs). Figure 2.2 is a snapshot of that traffic.

Cc767906.opt2_2(en-us,TechNet.10).gif

Figure 2.2: Traffic volume between PDC and one BDC.

Spikes on this graph represent results of the browser-type function, which invoked at 12-minute intervals and resulted in delivering a substantial amount of information within a short period of time. The browser is a network resource enumeration tool. It maintains a list, called a browse list, of all available servers, workgroups, and domains. When a user attempts to connect to any network resource using Network Neighborhood, the list of servers that appears is provided by a browser in the local computer's workgroup or domain. This is a Windows service and it can generate substantial traffic.

NetXRay averaged the spike level for the 30-second timeframe, so the actual browser spike level is a little higher than the value shown. Duration of the data transfer on the LAN after the announcement is 7 to 13 seconds and the amount of data transferred is 115 to 135 Kb, which means that peak levels are going as high as 140 Kbits/sec. In a highly utilized 10-Mbps Ethernet, these data transfers can cause traffic slow downs and even some packet discards. They can be even more disruptive on a WAN, especially on slow links. You can tune announcement periodicity in the registry, but that sort of procedural detail is beyond the scope of this chapter.

Directory Services Traffic

Figure 2.3 shows that actual data replication (directory replication) traffic is pretty low (on the LAN). Keep in mind that these levels of traffic were averaged during 30 seconds. It means that actual average traffic levels are around 30 Kbits/sec. (calculated from a data capture). This graph represents traffic levels between 12-minute browser spikes, which are shown on a different scale in the two figures. Note that the graph covers approximately 11 minutes—just a fragment of the larger capture.

Cc767906.opt2_3(en-us,TechNet.10).gif

Figure 2.3: Domain replication traffic.

Figure 2.3 shows a normal mode of partial replication, which is accompanied by these messages in the Windows NT EVENT.LOG. Notice that the more recent entry is at the top.

10/5/98

11:59:08 PM NETLOGON Information None 5715 N/A NAME The partial synchronization replication of the Security Accounts Manager (SAM) database from the primary domain controller \\Controller_NAME completed successfully. 7 change(s) is(are) applied to the database.

10/5/98

11:49:08 PM NETLOGON Information None 5715 N/A NAME The partial synchronization replication of the SAM database from the primary domain controller \\Controller_NAME completed successfully. 2 change(s) is(are) applied to the database.

Four types of accounts are stored in the SAM database: user, computer, global group, and local group.

But problems can occur in the replication process, in which case messages such as these (more recent entry first) appear in the EVENT.LOG:

10/6/98

12:44:26 AM NETLOGON Information None 5717 N/A NAME The full synchronization replication of the SAM database from the primary domain controller \\Controller_NAME completed successfully.

10/6/98

12:37:45 AM NETLOGON Warning None 5716 N/A NAME The partial synchronization replication of the SAM database from the primary domain controller \\Controller_NAME failed with the following error: (error in HEX)

In case of the above mentioned failure, a traffic graph of the full synchronization process will look as follows:

Cc767906.opt2_4(en-us,TechNet.10).gif

Figure 2.4: Traffic volume spike during full synchronization between PDC and one BDC.

This is a pretty high level (up to 1.4 Mb/sec.) of traffic with a long duration (approximately 1.5 hours).

WINS Replication Traffic

WINS servers are dedicated to name resolution, which is necessary in a network environment. In dynamic environments there is no manual IP address – server name administration, and computers' IP addresses can change whenever they boot up or change location. Even in a static environment people cannot be expected to remember IP addresses.

Users can remember server names reasonably well, so most network servers are given names. That helps users immeasurably, but it does nothing for traffic in the wire, which can find servers only by address, not by name. So the network uses a mechanism such as WINS to resolve names—match them to their assigned IP addresses. Most organizations have more than one WINS server (for fault tolerance as well as performance), and they create network traffic when they update databases among themselves. A capture of the traffic to/from one of the WINS servers on Microsoft's network looks like this (presented in Kbits/sec. and packets/sec.).

Cc767906.opt2_5(en-us,TechNet.10).gif

Figure 2.5: Traffic volume to/from one of the WINS servers in Kbits/sec.

Cc767906.opt2_6(en-us,TechNet.10).gif

Figure 2.6: Traffic volume to/from one of the WINS servers in packets/sec.

The good news is that WINS resolution traffic is pretty low—no higher than 8-10 Kbits/sec. (7-10 packets/sec.). But there is more to look at. The capture shows that every 12 minutes the WINS server provides browser data for browse masters (browsing is explained in the "Browser Service (Browsing) Traffic" section of Chapter 1). Each request generates a response; the total response volume in the analyzed case was 247 packets of data and acknowledgements. The capture size of the saved response for the browser request consisted of 164 full-size packets (1514 bytes each). It takes 0.34 seconds to complete a transaction, so the spike level calculates out to approximately 5.8 Mbits/sec. (captured on the 100-Mbps Ethernet).

Notice that on the chart (Figure 2.6) it looks much lower; this is because the capturing tool averaged the traffic level for the 30-second period. The WINS server that was observed was in communication with 11 major servers—some were domain controllers, some were WINS servers, but all of them were master-browsers. A spike level such as this can be dangerous on lower-speed networks or highly-utilized Ethernets because longer response transmission times can overlap and create a substantial "noise" level.

Figures 2.7 and 2.8 show the traffic distribution (in percentages and MBs, respectively) of the traffic between one WINS server and the 20 others with which it communicates to exchange lists of domains, servers, and user computers. This represents administrative traffic only. It makes sense to learn traffic patterns on your network and tune this service, because data exchange and traffic volumes correlate to the return on investment of the company network.

Cc767906.opt2_7(en-us,TechNet.10).gif

Figure 2.7: Traffic distribution to/from one of the WINS servers over 24 hours in percentages .

Cc767906.opt2_8(en-us,TechNet.10).gif

Figure 2.8: Traffic volumes to/from one of the WINS servers over 24 hours in total MBs.

DHCP Traffic

DHCP (Dynamic Host Configuration Protocol) makes it easier to manage large TCP/IP networks by offering automatic TCP/IP configuration. Computers can be configured to obtain IP addresses automatically from the DHCP server, so during initial configuration users need not be concerned with configuring IP manually and registering the static IP address in the database. It helps even more when users travel, making it possible for them to connect a laptop computer in any office of the company all around the world and automatically obtain and register a valid IP address for the network they are connected to.

A capture of the traffic to/from one of the DHCP servers on Microsoft's network looks like this (presented in Kbits/sec. and packets/sec.).

Cc767906.opt2_9(en-us,TechNet.10).gif

Figure 2.9: Traffic volumes to/from one of the DHCP servers in packets/sec.

Cc767906.opt2_10(en-us,TechNet.10).gif

Figure 2.10: Traffic volumes to/from one of the DHCP servers in Kbits/sec.

Server characteristics:

  • Total amount of scopes (subnets maintained by a server): 71

  • Total addresses: 65130

  • Addresses in use: 39884 (61%)

  • Addresses available: 25246 (38%)

Traffic to/from a DHCP server consists of IP address leases and other related services provided by the server. For the server examined in this test, it appears that little of its traffic is related to DHCP—around 20-22 Kbits/sec. (8-10 packets/sec.).

This would be considered low traffic on a fast LAN, but it wouldn't make sense to have this kind of "background" administrative traffic on a WAN, where speeds are slower and costs are higher. For this reason, Microsoft offices have local DHCP servers so that users do not have to send IP address requests across the country or around the world. If a local DHCP server fails, requests can be serviced elsewhere on the net, but this is rare.

The sample capture of DHCP-related traffic between 8:30 a.m. and 2.30 p.m. at Microsoft shows an average of 74 requests for IP addresses per minute, each of which creates six small packets (about 2121 bytes total). So the calculations represented on the graph are: 74 requests x 2121 bytes/service x 8 bits/byte divided by 60 seconds/minute = 20.9 Kbits/sec. This has no significant impact on the LAN. The next graph shows only DHCP-related traffic:

Cc767906.opt2_11(en-us,TechNet.10).gif

Figure 2.11: DHCP traffic and Kbps.

Spikes on Figures 2.8 and 2.9 can be explained by some additional services, not related to DHCP (browser, database replications between servers, System Management Server operations) running on the analyzed server. These take much more bandwidth than IP address distribution, and the spikes hit 200 Kbits/sec. (35-40 packets/sec.) and higher. Remember that the tool used to produce the graph averaged spikes over the 30-second period, so the graph shows them lower than their actual level, which jumps up to 300-350 Kbits/sec.

Backup Services

Traffic volumes on one of the backup servers looks like this:

Cc767906.opt2_12(en-us,TechNet.10).gif

Figure 2.12: Traffic volume on backup server with four tape drives in Mbits/sec.

Cc767906.opt2_13(en-us,TechNet.10).gif

Figure 2.13: Traffic volume on backup server with four tape drives in packets/sec.

The backup services send many full-size packets (1514 bytes) and small acknowledgements (60 bytes) and this distribution (shown in Figure 2.14) is easy to explain: the backup software uses full-size Ethernet packets whenever possible.

Cc767906.opt2_14(en-us,TechNet.10).gif

Figure 2.14: Packet size distribution of backup traffic.

Case Studies

If you have ever distributed content on the Internet or moved big batches of data around, you probably have some experience complaining about slow links and long waiting times. Most people immediately think that additional bandwidth will alleviate whatever frustration they are experiencing. Maybe it will, and maybe it won't. What can be done to speed things up without spending money to upgrade bandwidth?

The Copying Process

Suppose you want to move a directory with lots of files across the country from Redmond, WA, to Boston, MA, over a 1.536-Mbps (T1) circuit. How long can you expect it to take?

Here are two ways to effect the move:

  1. Go to the command prompt and use the command copy *.* to move all the files to their destination.

  2. Compress the directory into one file and move it using the same command.

These methods were tested in a lab, using two routers and a WAN simulator to introduce a cross-country delay of 45 ms, 90 ms for a round-trip. The 45 ms is the time required for the electromagnetic wave to "fly" from coast to coast—a propagation speed set by the laws of physics—although the 45 ms includes some delay introduced by multiple network devices on the transmission path. The test copied two files: a Windows NT source directory with 2477 files (53.1 MB total) and a Windows NT source directory ZIPped into a single 49-MB file.

Summary of test results of transferring large files by two methods.

Object transferred

Size

Amount of data
on the network

Transfer time

Transfer rate

One file

~49 MB

52.5 Mbytes

~13.5 minutes

516 Kbits/sec.

Windows NT source directory
(2477 files)

~53.1 MB

65.5 MBytes

~56.5 minutes

154.4 Kbits/sec.

Obviously, copying multiple files creates significant overhead on the wire (12.4 MB vs. 3.5 MB). This is because the process involves numerous iterations of creating, opening, and closing files, which rapidly increases the number of round trips between source and destination.

Sniff fragment of transferring one file with opening/closing overhead.

Frame

Time
(between packets)

Src Name

Dst Name

Protocol

Description

1

0

MyRouter

GNETTOOLS74

SMB

C transact2 Query path info, File = \ntsource\WOWFAXUI.DL_

2

0.002

GNETTOOLS74

MyRouter

SMB

R transact2 - NT error, System, Error, Code = (52) STATUS_OBJECT_

3

0.099

MyRouter

GNETTOOLS74

SMB

C NT create & X, File = \ntsource\WOWFAXUI.DL_

4

0.003

GNETTOOLS74

MyRouter

SMB

R NT create & X, FID = 0x8806

5

0.097

MyRouter

GNETTOOLS74

SMB

C write, FID = 0x8806, Write 0x0 at 0x00001C47

6

0.001

GNETTOOLS74

MyRouter

SMB

R write, Wrote 0x0

7

0.112

MyRouter

GNETTOOLS74

SMB

C write block raw, FID = 0x8806, Write 0x10c4 of 0x1c47 at 0x0

8

0.008

MyRouter

GNETTOOLS74

NBT

SS: Session Message Cont., 1460 Bytes

9

0.001

GNETTOOLS74

MyRouter

TCP

.A...., len: 0, seq: 251972888-251972888, ack: 202967151, win:

10

0.007

MyRouter

GNETTOOLS74

NBT

SS: Session Message Cont., 1440 Bytes

Sniff fragment of transferring one file with opening/closing overhead. (continued)

Frame

Time
(between packets)

Src Name

Dst Name

Protocol

Description

11

0.001

GNETTOOLS74

MyRouter

SMB

R write block raw

12

0.112

MyRouter

GNETTOOLS74

NBT

SS: Session Message, Len: 2947

13

0.008

MyRouter

GNETTOOLS74

NBT

SS: Session Message Cont., 1460 Bytes

14

0

MyRouter

GNETTOOLS74

NBT

SS: Session Message Cont., 31 Bytes

15

0

GNETTOOLS74

MyRouter

TCP

.A...., len: 0, seq: 251972929-251972929, ack: 202971542, win:

16

0.097

MyRouter

GNETTOOLS74

SMB

C transact2 Set file info, FID = 0x8806

17

0.001

GNETTOOLS74

MyRouter

SMB

R transact2 Set file info (response to frame 16)

18

0.097

MyRouter

GNETTOOLS74

SMB

C close file, FID = 0x8806

19

0

GNETTOOLS74

MyRouter

SMB

R close file

Total time

0.646 s

 

 

 

 

The italicized entries above show overhead during the transfer of one of the NT source files (WOWFAXUI.DL_: 7,239 bytes). It took 0.646 seconds (6 round trips) to complete this transaction ; it takes only 0.025 seconds to copy the same amount of data contained in one big file. Clearly, the number of round trips on the WAN are significant components of transfer time, a fact that is built into file-transfer logic.

Sniff fragment of data transfer without opening/closing overhead.

Frame

Time
(between packets)

Src Name

Dst Name

Protocol

Description

1

0

MyRouter

GNETTOOLS74

NBT

SS: Session Message Cont., 1460 Bytes

2

0.001

GNETTOOLS74

MyRouter

TCP

.A...., len: 0, seq: 194679713-194679713, ack: 138262571, win:

3

0.008

MyRouter

GNETTOOLS74

NBT

SS: Session Message Cont., 1460 Bytes

4

0.008

MyRouter

GNETTOOLS74

NBT

SS: Session Message Cont., 1460 Bytes

5

0

GNETTOOLS74

MyRouter

TCP

.A...., len: 0, seq: 194679713-194679713, ack: 138265491, win:

6

0.008

MyRouter

GNETTOOLS74

NBT

SS: Session Message Cont., 1460 Bytes

Total time

0.025 s

 

 

 

 

Note that the time between packets is 8 ms, which is the time required to put a full-size Ethernet packet (1514 bytes x 8/1536000 = 7.9 ms) on the wire. The actual packet size on the serial side of the router is not exactly 1514 bytes, but the difference is irrelevant in this type of calculation.

Troubleshooting a High Utilization Problem on the WAN

Microsoft has lots of remote offices around the world, and they are connected to the main campus in Redmond, WA, by means of point-to-point links and through hubs and shared WAN links. When one office complained about slow network connectivity, the IT group studied the traffic levels and flavors between the remote site and campus to find out what was going on.

Figuring Out How To Test

To perform this study, Microsoft's IT people used a home-made Web-based tool that reports link utilization parameters for all WAN links in the corporation, and gathers some other statistics that are important for capacity planning. The tool showed that utilization on the 256-Kbps link (the troubled one) stayed high 24 hours a day. This is an immediate and obvious indication that something is out of whack. Utilization during normal business hours was about 100%, which is high, but it was more significant that it stayed at around 30% at night. Obviously, levels should drop significantly after the end of the business day, so this consistently high utilization rate indicated a problem. What was it?

The high levels after regular business hours indicated unknown additional traffic. This meant that traffic analysis during nights or weekends would provide more coherent troubleshooting indicators because the link would be running "clean" (without the usual business applications).

Now that the ITG had decided to ruin a weekend by sitting around sniffing the network, there remained the task of planning the sniff and how to make it. The WAN connection to be examined came directly to Redmond through a router with multiple serial links on the WAN side and a Fiber Distributed Data Interface (FDDI) connection on the LAN side.

This was good news because it meant it was possible to use an old-style Network General sniffer with the FDDI card and 10% beam splitter to see all the traffic coming to the corporate network. The bad news was that the old sniffer was an MS-DOS-based device without good RMON capabilities. This meant two things. The first was that this tool could not be used to gather, save, and adequately format detailed source-destination information over a period of time. The second was that another solution was needed.

After some deliberation, it was decided that Network General's NetXRay would work. This software tool, which loads on top of Windows NT, gathers statistical information that can be very helpful. The next step was to try to insert a PC with the NetXRay and a FDDI network card through the beam splitter. This would simplify things by allowing easy access to all traffic coming in to the campus without looking for another access point on switches "closer" to the backbone. Unfortunately, this attempt was unsuccessful because when information is transmitted by means of light you need a certain power of the incoming light or you can't capture the data, and in this case there was not enough light power for the FDDI card.

The next step was to look higher in the connection hierarchy and finally there was some good luck. As an uplink, the router still had an old shared FDDI hub, which allowed the test devices to be connected to any available port, showing all the traffic that moved through it. (The Microsoft corporate network now has an ATM backbone and traffic analysis (packet sniffing) capabilities on this backbone are significantly reduced. This type of troubleshooting can be accomplished on ATM connections only with very expensive tools.)

Figuring Out the Capture

Once a computer with NetXRay was successfully connected to the shared FDDI hub, all site traffic became visible on the wire. The troubled remote site had two IP address ranges assigned to the Ethernet interface, so the capture had to show all traffic between Redmond and these two IP ranges. Using knowledge of the address ranges, the team identified and isolated the appropriate traffic, and set up the sniff. After several hours the capture was stopped and this is what it showed (in Kbits/sec. and packets/sec.).

Cc767906.opt2_15(en-us,TechNet.10).gif

Figure 2.15: Traffic levels to/from a remote site coming to the corporate network in Kbits/sec.

Cc767906.opt2_16(en-us,TechNet.10).gif

Figure 2.16: Traffic levels to/from a remote site coming to the corporate network in packets/sec.

These graphs verify that traffic levels are a little high for the weekend and night time from Saturday to Sunday. The reason still was not apparent, so a deeper look was arranged into the top 20 pairs of communicating servers between the remote site and corporate campus. The diagram of this distribution is shown in Figure 2.17.

Figure 2.17: Traffic distribution between top 20 server pairs communicating between the remote site and the corporate campus.

Figure 2.17: Traffic distribution between top 20 server pairs communicating between the remote site and the corporate campus.

A real-life chart should represent bursty and hardly predictable network traffic, so it is startling when one comes out with the pie cut into nice even pieces. It is a sure-fire indication that something is not right. In this case, closer examination of the IP addresses for the communicating pairs showed that the first nine pairs were communicating with only one corporate network server.

Traffic levels between nine servers at the remote site and a single server in Redmond looked like this:

Cc767906.opt2_18(en-us,TechNet.10).gif

Figure 2.18: Traffic levels between 9 remote servers and a single server on the corporate network.

Cc767906.opt2_19(en-us,TechNet.10).gif

Figure 2.19: Traffic levels between nine remote servers and a single server on the corporate network.

Once this behavior was discovered, a Network Monitor capture showed that nine remote site servers were requesting browser lists from one corporate server. Each request created a response of about 155 full-size Ethernet packets. The transaction took about 22 seconds, which meant that the spike was about 85 Kbits/sec. The screen capture below shows how frequently these requests were coming to the corporate server.

Cc767906.opt2_20(en-us,TechNet.10).gif

Figure 2.20: Network Monitor screen capture showing browser list request periodicity from different servers.

What did this mean in terms of fixing the slow link? It meant that the solution lay in tuning, not additional bandwidth. The remote office has to request browser lists, but configuring them to request over the net was disastrously inefficient. The system was reconfigured so that only one active server communicates with the corporate master browser: the other servers request browser lists locally. This took all the requesting traffic and the responses off the net and dropped line utilization back into an acceptable and understandable range.

The NetMon capture shows another interesting feature: the interpacket delay in the "Time" column indicates that requests from the remote site are coming to the corporate network at irregular intervals. Delays after packets 91, 469, 862, 1270, etc., are very short, but longer delays happen after every other packet. What can cause these short delays?

The search for a cause begins with a closer look at the source and destination MAC addresses. These belong to the routers, which is as it should be. The analysis tool was connected on the corporate network. Packets coming from the remote office crossed several routers on the way, then had to cross one more router to get to the server on the corporate network. Thus, on the MAC level, source and destination addresses should be router MAC addresses, specifically:

  • 080002 0691D5 (3Com router)—the MAC address of the WAN router that the remote site packets come from

  • 00E034 B43840 (Cisco router)—the MAC address of the router that the destination server is "sitting" behind

So the question is: what is the MAC address (080002 1068AB, another 3Com router) shown for packets 92, 470, 863, 1271, etc.?

Here is what this was found to indicate. Packet 91, for example, comes from the remote site to the corporate network and, instead of being forwarded directly to the Cisco router with MAC address 00E034 B43840, it goes to the 3Com router with MAC address 080002 1068AB. This "mystery" router forwards it to its destination through the Cisco router with MAC address 00E034 B43840, which in turn forwards it to the destination server connected to the corporate network. This means:

  • Packets make an additional hop (they are rerouted one more time).

  • Packets are delivered with an additional delay varying from 1 ms (packet #1271) to 50 ms (packet 3453), which is equivalent to the propagation delay across the USA—a significant amount of time.

  • Packets from the remote site are going through the shared hub twice.

  • This "double flow" creates an additional load on the shared hub.

This path is indeed confusing. And it is a reminder that sometimes it makes sense to take a closer look at routing to verify that data streams are going through the correct paths. In a heavily utilized network, false routing (especially through the WAN) can badly affect network utilization and, as a result, application performance.

In this case, however, further investigation proved that there was no mystery and that this was a correct path. Ongoing network reengineering had resulted in the concurrent use of old and new backbones and this resulted in a necessary redirection of traffic flow. The "strange" evidence indicated only business as usual.