Microsoft Exchange Server Analyzer Retrospective

In this article, I’ll give you some history behind the family of Microsoft Exchange Server Analyzer Tools, how the technology developed, and how it has been used up to this point. I’ll also introduce you to the team behind the tools and talk about where we’re going with these tools in the future.

How It All Started

At Microsoft, we work hard to optimize customer service and support by identifying CritSits, or "critical situations," when a customer needs urgent assistance to resolve a problem that is interrupting service to end-users and affecting important business operations. When a CritSit is initiated, a whole team of Microsoft personnel, from Microsoft Customer Service and Support, Microsoft Services, our consulting arm, account management, and product development, cooperate to get the customer up and running as soon as possible.

At the end of calendar year 2003, we on the Exchange Server team started noticing three important CritSit trends for Exchange Server:

  • CritSits were becoming more frequent.
  • Over 60 percent of all Exchange Server CritSits were caused by configuration problems, not bugs in the product.
  • In some new CritSit cases, other customers had experienced the same CritSit problem just months earlier.

It was frustrating to see that a simple change in configuration could have such a dramatic effect on the critical nature of an electronic messaging environment, and we weren't happy to see similar CritSits across customers. What we needed was a tool that programmatically analyzed Exchange servers and flagged any issues that were known to cause performance, scalability, and availability problems.

In January 2004, I sent an e-mail to Jon Avner, one of the lead developers on the Exchange team, with an idea for such a tool. It just so happened that Jon had been thinking along similar lines, and so, in a matter of days, we fleshed out the specification and functionality of the tool over e-mail. Then, over one weekend, Jon hammered out what was to be the first version of the Microsoft Exchange Server Best Practices Analyzer Tool, or what we lovingly call the ExBPA tool.

At the time, we didn’t truly understand the importance of the technology that we had created. Jon had many years of development experience and the insight to create a generic framework that could be applied to almost any scenario, Exchange or otherwise. I had spent several years designing and troubleshooting large Exchange topologies and had the background knowledge and experience to immediately put together a list of checks that the tool should make.

Part Way There

Jon’s tool was great at collecting data from many different namespaces: Active Directory directory service, the registry, Microsoft Windows Management Instrumentation (WMI), and others. When a customer was having a problem, a Customer Service and Support engineer asked the administrator to run the tool, and then the support engineer examined the rich information that the tool collected to try to diagnose the root cause of the customer's problem.

The tool was collecting a lot of great information, but we needed a way to automatically parse and analyze it. That's when we called on the services of Jack Bennetto, an expert in automation who knows lots about XML Path Language (XPath), a comprehensive language used for navigating through the hierarchy of an XML document. In a few weeks, Jack had developed an analysis engine that we grafted onto Jon’s collection engine. With Jack's new engine we could author a series of "rules" that would generate a message based on a given threshold. If the collected value was outside the tolerable range, a rule would "fire."

Things were going great, and we decided to run our new tool against the internal Exchange topology here at Microsoft. This did not go well. It took over 24 hours for the tool to run. That was much too long, because this tool was designed to help customers who were in the grip of an e-mail downtime nightmare. We needed a performance expert who understood how to make this tool scale better. Enter Kevin Chase, one of the finest debugging engineers you're likely to meet at Microsoft. After Kevin made his changes, our collection time for the global Exchange topology was down to two hours, and individual servers would scan in less than five minutes.

The tool was now working really well, but its analysis capabilities were limited. We needed a richer set of the rules that are applied to collected data. Fortunately, the rules for data collection and analysis were stored in an XML file that we could easily update without having to recompile the code.

After we had tried the tool at several customer sites, it became apparent that e-mail issues aren’t isolated to the Exchange Server software itself. Many times, e-mail problems occur because the underlying infrastructure, such as the operating system or name resolution scheme, isn’t working correctly. We expanded the scope of the rules to cover such dependencies. We had to look holistically at the entire ecosystem, not only at Exchange and the components that run under Exchange, but also at applications that run on top of Exchange. This is where Independent Software Vendors (ISVs), such as the antivirus vendors, were engaged to supply a set of rules for their software.

All this hard work and collaboration paid off when, in early September 2004, we released Exchange Server Best Practices Analyzer Tool v1.0 as a free download.

Change is Good

A key success factor of the Exchange Server Best Practices Analyzer has been our ability to respond quickly to customer needs and product updates. Every week we find better ways of running an Exchange system. The best practice rules that are encoded into the Exchange Server Best Practices Analyzer come from a variety of sources: our own development team, Microsoft IT, Customer Service and Support, Microsoft Services, and customers. The purpose of the Exchange Server Best Practices Analyzer is to encode all this knowledge into a single application. Today, the tool performs well over 1,500 configuration checks on each Exchange server that it scans. It tells you if the server is having problems coping with the load, if your software, whether it is Microsoft software or software from another manufacturer, is out-of-date, and if there are any best practices that you should implement.

We update the rules database every month, so that customers know that they are receiving the latest authoritative information about how to best deploy and run Exchange Server. To date, over 700,000 copies of the Exchange Server Best Practices Analyzer have been downloaded.

Room for New Analyzers

One of the design philosophies behind the Exchange Server Best Practices Analyzer was ease of use. The administrator doesn’t have to understand much about the tool to get it running. The Exchange Server Best Practices Analyzer automatically detects Exchange servers, understands how those servers are being used, and figures out which versions of Exchange Server and Windows are installed.

A subsequent review of the CritSit cases showed that the Exchange Server Best Practices Analyzer was having a significant impact: the number of configuration-related issues was rapidly declining. Now performance and disaster recovery were the primary customer problems, and the Exchange Server Best Practices Analyzer could not be used to tackle these issues. A good troubleshooting tool, one that can pinpoint the root cause of an issue in the shortest amount of time, requires a wizard-based user interface. We couldn't use the Best Practices Analyzer engine for this as it would only start, collect, analyze and stop—with little or no user interaction. In short, we needed a new mechanism to request user input that would support a different troubleshooting workflow. To help create this new mechanism, two new members joined our team: Nicole Allen, an Exchange troubleshooting expert, and Weiguo Zhang, the development lead for our Sustained Engineering team, the folks who produce hot fixes and service packs for the Exchange product. They set about building an implementation of a finite state machine, an engine that would support the workflow and branching logic necessary for an interactive troubleshooting tool. The final result was the "Wizard Engine," which would operate alongside and integrate with the Best Practices Analyzer engine.

Thanks to the hard work of Nicole, Weiguo, and others, in November 2005, we introduced two new members of the Exchange Server Analyzer Tools family:

  • Microsoft Exchange Server Performance Troubleshooting Analyzer Tool
  • Microsoft Exchange Server Disaster Recovery Analyzer Tool

Applying What We've Learned

Today, on the basis of feedback from our three Analyzer tools, we have a much better understanding of the common problems that surface in an Exchange environment. Our ultimate goal is to strengthen Exchange Server against such problems so that it protects itself from service outages. We’ve started integrating this learning into the next version of the Exchange Server product, code-named Exchange Server 2007. Here are a couple of examples of what we're working on:

  • Automatically setting sensible defaults   Today some of our customers have to tune their servers as soon as they install Exchange Server. We already know many of the parameters that they are tuning. Wouldn’t it be great if the product automatically made these changes for you when you install?

  • Implementing an override range   Although manual overrides, which are usually found in the registry, may still be required for some Exchange components, the user should not be allowed to enter a value that doesn't make sense. The product should protect itself, even when it has been overridden.

    Note

    This scenario is more common than you might think. For example, many Microsoft Knowledge Base articles talk about how to edit the registry and make recommendations that are typically expressed in decimals. However, in the Registry Editor (RegEdit.exe), the default input type is hexadecimal. It’s all too easy to enter a value that can be misinterpreted. For example, entering a value of 31,000 may unwittingly result in an actual value of 200,704.

The Future

Exchange Server 2007 is going to be more robust than earlier versions of Exchange Server, thanks in part to the work I've described here, but there will always be a need for Exchange Server Analyzer Tools. Therefore, we’re including these tools in-the-box for Exchange Server 2007. The original team is still intact, with Jon, Jack, Kevin, Nicole, Weiguo, and I spending most of our time researching and implementing new ways of making Exchange Server the best messaging product on the market.