Click to Rate and Give Feedback
TechNet
TechNet Library
Monitoring and Troubleshooting Microsoft.com

Technical Case Study

Published: February 9, 2006

Microsoft.com is one of the busiest yet most available sites on the Internet. A combination of proactive and reactive monitoring is a critical factor in maximizing site availability and scalability.

Download

Download Technical Case Study, 297 KB, Microsoft Word file

Situation

Solution

Benefits

Products & Technologies

Microsoft.com is one of the busiest IIS-based Web sites on the Internet with thousands of supporting servers and applications. The operations team for this Web site needs to be able to quickly identify, troubleshoot, and correct any issues with Web site availability and performance.

By using a combination of various Microsoft technologies and an independent, global third-party monitoring service, the Microsoft.com operations team has implemented systems and processes to accurately determine what server assets they have, how they are configured, and how they are performing in the data center. The system also provides information about how the overall site is performing from an end-user perspective from 35 different regions around the world.

  • Microsoft.com is the most available Web site as measured by Keynote (99.83 percent for three3 years running).
  • The operations team helps to drive Microsoft product improvements by quickly and accurately identifying issues that can arise under the most demanding volumes. Because the operations team is an early adopter, the feedback frequently occurs before products are released to customers.
  • Valuable employee time is focused on designing future systems instead of solving problems on current systems.
  • Microsoft Windows Server 2003
  • Internet Information Services 6.0
  • Microsoft Operations Manager 2005
  • Microsoft SQL Server 2005
  • Microsoft Identity Integration Server 2003
  • Active Directory and Active Directory Application Mode
  • Performance Monitor
  • SQL Server Reporting Services

Introduction

Microsoft.com is one of the busiest Internet Information Services (IIS)-based Web sites on the Internet, and it consists of thousands of supporting servers and applications. The operations team that is responsible for the Microsoft.com site must quickly identify, troubleshoot, and correct any issues with Web site availability and performance. The team must also predict and plan for capacity increases for the site.

To accomplish its tasks, the operations team for Microsoft.com has many partnerships with other Microsoft organizations, such as application development, test, and release management teams. The operations team also collaborates with various product teams at Microsoft, interacts with providers of services (such as the data centers that host the servers), and interacts with the content delivery network (CDN) partners Akamai Technologies, Inc., and SAVVIS Communications for global load balancing and content caching. Troubleshooting performance or availability issues related to Microsoft.com often requires interaction with one or more of these partner organizations. Monitoring the overall site performance involves collecting and correlating data from various sources.

The purpose of this case study is to provide a high-level overview of the monitoring solution that Microsoft.com operations team designed and developed and to demonstrate the value of current Microsoft products concerning highly available, high-performance Web sites. This case study briefly discusses the reactive and proactive monitoring elements of the solution, best practices for monitoring and troubleshooting, and some of the benefits that Microsoft.com gained from such a solution.

This case study assumes that readers are technical decision makers and are already familiar with Microsoft® Windows Server™ 2003 Web server technologies, including IIS version 6.0 and Performance Monitor, in addition to associated technologies, such as Microsoft Operations Manager (MOM) 2005, the Active Directory® directory service, Microsoft Identity Integration Server (MIIS) 2003, and Microsoft SQL Server™ Reporting Services. Any organization can apply the principles and techniques that this case study describes to plan a Web monitoring system, and the design considerations for monitoring a highly available Web server infrastructure can be applied to most any enterprise-scale IT environment. However, this case study is based on the experience and recommendations of the Microsoft.com operations team. It is not intended to serve as a procedural guide. Each enterprise environment has unique circumstances; therefore, each organization should adapt the plans and lessons learned described in this case study to meet its specific needs.

Situation

The sheer scale of the Microsoft.com Web site presents an enormous challenge with regard to monitoring and troubleshooting. Spanning three data centers and consisting of thousands of servers and applications that service an average of 13 million unique users a day, Microsoft.com is one of the largest, busiest, and most complex Web sites on the Internet. More than 600 developers from hundreds of organizations, many outside Microsoft, submit code and content from all over the world. Many of them submit content and application updates such as ASP.NET pages, XML files, and VBScript directly to their respective virtual directories by using their own release management processes of varying maturity levels.

Microsoft.com is constantly under attack from various locations around the globe. Because of this activity and complexity, monitoring all of the individual components that make up Microsoft.com in addition to the overall site is a monumental task.

Monitoring Philosophy

The philosophy of the Microsoft.com operations team is that a monitoring solution should provide the ability to quickly and accurately identify error sources so they can be isolated, investigated, and ultimately resolved. The monitoring solution should be intelligent so that it facilitates the correlation of events from multiple sources and issues actionable alerts.

Left to their default configurations, most monitoring systems generate an excessive number of alerts that become like spam to administrators. Especially with large systems, it is important for organizations to carefully define what should be monitored and what events or combination of events should be raised to the attention of operations personnel. An organization must also plan to learn from the data collected. As with alert planning, this aspect of the solution is a significant undertaking. It requires creating data retention and aggregation policies, and combining and correlating all of the data into a data warehouse from which administrators can generate both predefined and impromptu reports.

The actions taken as a response to intelligent alerts or analysis of historical data should drive improvements in one of three general categories:

  • Applications

  • Networks

  • Operations excellence

In many cases, performance problems with an application are due to problems with the way the application was designed or coded. The source of the problem can range from bugs to memory leaks to inefficient database queries. Application issues are the most challenging to monitor and identify, but they offer the most significant improvement opportunities.

Network issues are rare, but they can and do happen from time to time. Redundancy of components at every layer minimizes the risk of network outages, but when they do occur, they are critical and require immediate action.

The operations excellence category encompasses what many organizations simply call human error. The humans in question are typically network, server, storage area network (SAN), or application administrators. The Microsoft.com operations team uses the term operations excellence to emphasize that a production system should have controls, processes, and procedures in place to minimize the potential occurrence and the impact of human errors.

Solution

Monitoring the Microsoft.com Web site involves always knowing the health of the individual network and server components and individual applications, in addition to the overall availability, performance, and capacity of the site. No single technology is capable of monitoring all of these things, and certainly no product exists that can intelligently monitor such a complex environment out of the box. For these reasons, the operations team developed a monitoring solution built on multiple Microsoft products and some third-party products.

The monitoring solution implemented for Microsoft.com starts with the operations workbench. The operations workbench, created by the operations team, is a framework of best practices that aggregates many loosely coupled solution components. The framework is a custom written, extensible Microsoft .NET Framework application that communicates with and coordinates the other aspects of the monitoring solution, typically by using Web services. The following technologies make up the other primary components of the monitoring solution:

  • Active Directory is the enterprise directory service that is included with Windows Server 2003 and that hosts the security objects for all the servers, arranged by property-specific organizational units (OUs.)

  • Active Directory Application Mode (ADAM) is a version of Active Directory designed for use as an application-specific data store. It stores detailed information about the assets and objects that make up Microsoft.com.

  • MIIS 2003 is a metadirectory and synchronization engine that automates updates between Active Directory and the ADAM-based data store that the operations workbench uses.

  • Windows Management Instrumentation (WMI) scripts query servers for detailed configuration information that is added to the ADAM-based data store.

  • MOM 2005 is an enterprise-class event and performance monitoring infrastructure that provides the majority of the event and performance collection functionality for the solution.

  • Microsoft SQL Server 2000 and SQL Server 2005 provide enterprise-class data storage, presentation, reporting, and correlation.

  • Mercury SiteScope is a third-party monitoring product that provides application-specific, end-to-end transaction testing inside the data center.

  • A highly customized version of Cluster Sentinel based on the version included in the Microsoft Windows® 2000 Server Resource Kit. It provides server monitoring inside the data center, in addition to automated cluster membership management.

  • Keynote Global 35 is a third-party service that monitors the availability of the overall Microsoft.com site and conducts application-specific transaction testing from 35 regions around the world.

  • IIS Log Monitor is a custom application that collects the IIS logs. Log Parser parses the enormous volumes of IIS logs.

The overall system processes over 60,000 alerts a day, conducts approximately 11.5 million availability tests a day, parses 1.7 terabytes of IIS log data a day, and collects 185 million performance counters a day at a sampling rate of 45 seconds. However, to reach this degree of monitoring sophistication was a long process and required significant effort and cross-organizational coordination. The development of the solution followed a four-step progression, with each step building upon the previous one:

  1. Asset management

  2. Reactive monitoring

  3. Proactive testing and monitoring

  4. Reporting and analysis

Asset Management

After creating the operations workbench framework, the next step in creating the monitoring solution was to gain control of asset management. Before any operations team can accurately and completely monitor all the servers that make up a system, it must know what servers are deployed, where they are, how they are being used, and who or what owns them. The team must also know how to determine when a server reached its maximum capacity or reached the end of its useful life.

For asset management, Microsoft.com uses Active Directory to organize all servers into OUs by property type and network connectivity. Only Microsoft.com administrators and specific data center personnel have permission to manage servers in these OUs. An MIIS management agent then regularly searches Active Directory for updates to synchronize information with a more detailed, ADAM-based asset store. The operations team chose ADAM as the asset store instead of a relational database because the team wanted to organize the data in a hierarchical structure for which ADAM is specifically designed.

After the data in ADAM is updated from Active Directory, WMI-based scripts query each server to gather detailed characteristics. The ADAM store then reflects new data and updates.

Reactive Monitoring

After the operations team accurately identified and characterized the population of servers on a regular basis, the team implemented a real-time monitoring system to react to problems as soon as possible after they arose. The team uses MOM 2005 for event and performance data collection, utilizing a subset of settings from various MOM management packs and events forwarded from several other sources.

The operations team implemented two main types of reactive monitoring for Microsoft.com. One type of monitoring was designed to measure server and application performance and availability. The other, and perhaps the more important and challenging type, is monitoring customer-perceived performance and availability.

Server and Application Performance

The Microsoft.com operations team applies a multi-layered approach to monitoring the overall health of the environment. The approach includes monitoring server hardware and basic server network connectivity, key elements of the operating system for individual servers, overall NLB server cluster availability and the response of specific features of applications. The team also co-monitors network anomaly detection and mitigation devices.

For a global view of application health and to provide automated fail-over to redundant systems, the operations team partners with Akamai and SAVVIS to utilize their global load balancing services. These services are configured to constantly monitor each NLB cluster of servers from distributed non-Microsoft networks and will automatically pull clusters out of rotation should they begin to fail. These services are a key aspect of the team's high availability model and business continuity design.

Server monitoring starts with hardware agents provided by enterprise-class server vendors that monitor various aspects of the physical components of a server, such as power supply and cooling fan operation, CPU temperature, and hard disk drive function. In some cases, the hardware agents are able to predict imminent failure of a particular component based on trends in data sampling. In addition to providing a high degree of component reliability and redundancy, employing enterprise-class servers to build critical production systems can yield the added benefit of seamless integration with most monitoring frameworks. In the case of Microsoft.com, the hardware agents forward event and alert notifications to the MOM 2005 infrastructure.

As described earlier, the operations team has highly customized Cluster Sentinel to monitor disk space and provide monitoring to determine whether a server is responding to simple network requests, also known as heartbeat monitoring. In addition, Cluster Sentinel uses a combination of custom ASP and ASP.NET files to provide end-to-end testing of applications based on specific application health criteria defined by the application developers. This implementation of Cluster Sentinel can automatically remove a server from the cluster if these tests fail. For more in-depth application monitoring, the operations team uses SiteScope specifically for transactional tests that require interaction to determine successful completion of a transaction. Lastly, IIS Log Monitor is a custom application that collects the IIS logs, after which Log Parser parses all IIS errors generated by applications.

For SQL Server database servers, MOM 2005 agents are also used to monitor the service status. Cluster Sentinel regularly executes an <SP_Who> query to test that the SQL Server service is responding in a timely fashion.

Performance counters are regularly collected on all servers at intervals of 45 seconds. A standard set of 42 counters are configured, collected and forwarded to the MOM 2005 infrastructure. These counters represent a cross section of hardware-based (CPU, memory utilization, disk utilization) and software-based (IIS, SQL, TCP) objects that provide further insight into overall health of the system. For more detail on the performance counters collected by the Microsoft.com operations team, see the Troubleshooting and Debugging Web Applications webcast at: http://msevents.microsoft.com/CUI/EventDetail.aspx?EventID=1032283908&Culture=en-US.

Microsoft.com uses Cisco Guard devices to detect traffic anomalies that may represent denial of service (DoS) or other attacks. These devices forward Simple Network Management Protocol (SNMP) traps to a SQL Server database that provides a view of the traffic patterns and specific events in the operations workbench. The Cisco Guard devices and the remainder of the network devices are primarily monitored and managed by the respective operations teams for the data centers that host the Microsoft.com servers.

Customer-Perceived Performance

Microsoft.com partners with a third party, Keynote Systems, to constantly measure and report on the overall site and specific application availability and performance as experienced from 35 locations around the world.

The Keynote Global 35 service has agents on servers that run a test approximately once a minute by trying to load the home page of Microsoft.com. If anything fails to load, such as a .gif file, a text string, or any component that makes up the home page that is being called (even if not visible) the test will generate an error. The operations team has written a Web service to forward Keynote metrics to the Microsoft.com monitoring system as they are gathered. At the end of each day, Keynote also provides a summary availability report.

In addition to caching content in multiple global locations along with SAVVIS, CDN partner Akamai monitors specific servers from outside the data center and has the ability to remove a specific server from a cluster if it is not responding.

Proactive Testing and Monitoring

After implementing and stabilizing the asset management and reactive monitoring systems, the focus of the operations team shifted to proactive testing of applications and defining proactive monitoring events. Extensive end-to-end application transaction testing and application stress testing helps to expose many potential problems prior to releasing the application to production. The testing process also helps to determine what events are meaningful, and what corrective actions are appropriate in the case of those events. All of the information learned from transactional and stress testing is thoroughly documented as part of the release management process of the Microsoft Solutions Framework (MSF) that many of the development teams use.

The same systems that provide alerts for events requiring immediate, reactive action are also used to provide alerts that are predictive in nature, indicating whether a condition is trending toward a problem situation. After thorough application testing and observing the behavior of the components of the overall system over time, combined with the results of problem resolution activities, it becomes possible to identify the symptoms or causes that precipitate certain application errors. Defining and refining proactive events over time exemplifies the evolutionary nature of the monitoring system, and it emphasizes the importance of constantly learning from the data and putting the knowledge to use.

Reporting and Analysis

The Microsoft.com operations team has developed custom Web services to merge data in real time from all sources to provide a holistic overview. Each night, an automated process aggregates the data into a data warehouse for longer-term analysis. The data is normalized into a common schema and presents related data from different sources side by side. The types of data include Keynote Global 35 data and all other monitoring data. Detailed performance data is aggregated into hourly and daily averages, with sample count, standard deviation, minimum, maximum and standard error retained for detailed trend analysis.

A snapshot of the asset configuration store is also created daily. Server-specific information that is maintained in this asset store includes the following data:

  • Detailed asset information (such as who owns the server, who administers the server, and the server model)

  • Warranty expiration

  • Performance trends

  • Change request history

  • Disk space utilization

  • IIS error trends

The operations team then uses SQL Server Reporting Services to generate standard daily, 30-day, 90-day, and year-to-date reports that include the following information:

  • Assets and objects

  • Performance trends

  • Availability (both internal and external)

  • IIS log errors

  • Egress reports

  • Disk space utilization

  • Service level agreement (SLA) performance

Debugging teams and other organizations have need to access the data in the warehouse for deeper analysis. Instead of granting access to the data warehouse, the operations team uses SQL Server Data Transformation Services (DTS) to provide data feeds for custom and recurring needs. By limiting access to the data warehouse, the operations team avoids the risk of unmanaged, one-time query activity that may have adverse performance impacts on the data warehouse.

Monitoring and Reporting Tomorrow

The operations team has plans to be an early adopter of Microsoft System Center Reporting Manager 2006 to replace some of the currently customized functionality in the operations workbench. The team is also investing considerable effort in developing a configuration management enhancement to the operations workbench based on XML manifests which define standard configurations. Both platform-specific and application-specific configuration manifests will be created as part of the development cycle prior to release to production.

Other efforts are underway to standardize application instrumentation. Hundreds of developers provide code to Microsoft.com, all using different methods to instrument their code for event logging at varying levels of detail. The operations team wants to create a common eventing and logging class, based on recommendations from the Microsoft Patterns and Practices group, with deep application tracing. For more information about Microsoft Patterns and Practices, go to http://msdn.microsoft.com/practices/.

Problem management is another focus. The operations team wants to learn as much as possible from the enormous amount of data by correlating data sources and annotating those correlations. The primary benefit of the data warehouse is to learn from the data. The operation team wants to use artificial intelligence engines with genetic algorithms that can learn normal patterns to search through data and look for statistical anomalies. Those algorithms can then be applied to live data to predict problems well before they manifest themselves.

As with many organizations, Microsoft.com envisions a "lights out data center" where administrators rarely if ever need to physically visit and all servers can be managed remotely and automatically to the greatest extent possible. For example, remote control boards embedded in the servers can be instructed programmatically to turn servers off when usage is low and to power them back on when needed. The "lights out strategy" is intended to enable systems engineers to focus their time on the most important activities such as deeper engagement with the development teams they support, and to focus on new technologies and developing best practices.

Troubleshooting

Performance issues with the Microsoft.com site can generally be grouped into three categories:

  • Applications

  • Networks

  • Operations excellence

The monitoring system was designed to present only events that prompt support personnel to take action. However, the operations team wants to document all events during application testing in such a way that most event actions will have associated corrective actions that can be performed at the first level of support. Some events require escalation to the next level of support. Like most large organizations, the Microsoft.com operations team has several tiers of support for incident, or ticket escalation.

Tier 1

Tier 1 provides support 24 hours a day, seven days a week and provides initial recognition, routing, and if possible, resolution of tickets. Tier 1 personnel use monitoring information and troubleshooting guides which are dynamically associated with the asset or property by the operations workbench to perform initial research and to aid in incident resolution. If an incident cannot be resolved by Tier 1 support personnel, it is assigned and routed to Tier 2 support.

Tier 2

Tier 2 provides infrastructure support for server hardware issues, and specific application support, by personnel with in-depth knowledge of the applications. Tier 2 personnel also use monitoring information and troubleshooting guides by using the operations workbench to aid in incident resolution. If an incident cannot be resolved by Tier 2 support personnel, it is assigned and routed to Tier 3 support. If the incident is resolved at this tier, the troubleshooting guides are updated with any newly discovered information that can aid in future resolutions.

Tier 3

Tier 3 performs in-depth analysis and consists of systems engineers and database administrators assigned to specific Web properties, and product managers and developers for specific applications. Tier 3 personnel use monitoring information to develop troubleshooting guides, and as administrators, they may also perform other troubleshooting activities on a server. If an incident cannot be resolved by Tier 3 support personnel, it is assigned and routed to Tier 4 support, also known as the debug team. If the incident is resolved at this tier, the troubleshooting guides are updated with any newly discovered information that can aid in future resolutions.

Tier 4

Tier 4 performs application and kernel debugging by the debug team. Generally, this type of invasive troubleshooting requires the server to be removed from production for the duration of testing. Tier 3 personnel review the troubleshooting history to clearly identify the incident and reproduce the problem steps. The goal of the debug team is to identify the root cause of the problem so that comprehensive information can be routed to the appropriate application development team to implement an update. If the incident is resolved at this tier, the troubleshooting guides are updated with any newly discovered information that can aid in future resolutions.

Note: For more detail on debugging steps that the Microsoft.com operations team commonly uses, see the webcast titled "Troubleshooting & Debugging Web Applications" at http://blogs.technet.com/mscom/.

Best Practices

Best practices for monitoring include the following:

  • Centralize as much as possible. Design a solution that aggregates as much meaningful information as possible into a central location to aid in incident and problem management. A MOM 2005 infrastructure is capable of collecting a great number of events, data that is crucial to a monitoring solution. Processes and troubleshooting guides should also be centralized and standardized so that all members of the team approach problems in a similar manner and have access to the same information, and can update documentation when appropriate.

  • Manage assets. Know which assets are in use, their purpose, and who owns them, and keep this information up to date by using an automated process. A hierarchical structure provides the ability to document and view the relationships among assets and, potentially, operations personnel and documentation.

  • Determine which data is important. Most monitoring products generate enormous amounts of data and events by default. It is extremely important to identify and enable only meaningful data and events that are useful and actionable.

  • Implement both reactive monitoring and proactive testing and monitoring. After the reactive monitoring system is in place, focus on proactive testing of applications. Attempt to identify events that are predictive in nature.

  • Learn from the data. Aggregate, correlate, and annotate data from different sources to identify patterns. Consider using genetic algorithms to mine the data to determine baselines and identify anomalies.

Best practices for troubleshooting include the following:

  • Stress test applications. In addition to end-to-end transaction testing, applications should be stress tested to see how they perform under a heavy load. Include alert definitions and associated corrective actions learned from the testing process as part of the release management process. This activity can help identity proactive monitoring events in particular, in addition to exposing scalability issues for a given application before it is released to production.

  • Make all events actionable. Provide as much context as possible for Tier 1 and Tier 2 support groups so that they can resolve known problems without escalation. Make troubleshooting guides easily accessible so that they can capture valuable lessons learned from problem resolutions.

  • Conduct reviews. The most serious problems are often caused by lack of operational excellence. Conduct a thorough review of these situations to help identify processes that need improvement.

Benefits

A primary objective of the Microsoft.com operations team is to achieve the highest availability on the Internet. Achieving that availability can be accomplished only by using a comprehensive monitoring solution that includes both reactive monitoring and proactive testing and monitoring, actionable alerts and ready access to intelligent troubleshooting information. By implementing such a solution, Microsoft.com has achieved 99.83 percent availability over the past three years, as measured by the Keynote Global 35. During those three years, Microsoft.com has ranked first in availability compared to all other major Web sites in the industry.

Being able to quickly and accurately identify potential issues with Microsoft products when implemented under the most demanding volumes enables the operations team to help drive Microsoft product improvements. As an early adopter of many Microsoft technologies, the operations team frequently provides information on potential problems that may adversely affect product scalability and reliability, before those products are released to customers. Products that are scalable and reliable provide a solid foundation for highly available, high volume services.

Another huge benefit of a comprehensive monitoring solution is that valuable engineers can focus on engineering, rather than spending their time reacting to current problems. When architects and engineers are able to spend their time designing and testing future systems and upgrades, the resulting systems are far more likely to be well architected and tested prior to deployment.

The Microsoft.com operations team is currently planning to enhance the operations workbench by adding a manifest-based configuration management system for servers and applications. This enhancement will provide the additional benefit of ensuring complete consistency across all systems. Currently, automated scripts used to initially install and update servers and applications ensure consistency in those particular actions, but no method currently exists to detect configuration changes outside those actions. In the future, a periodic, comprehensive configuration scan will detect any such anomalies.

Conclusion

Monitoring Microsoft.com, one of the busiest Web sites in the world, requires always knowing the health of the individual hardware components and individual applications, in addition to the overall availability, performance, and capacity of the entire site from various places around the world.

No single technology is available to monitor all of these aspects, and certainly no product exists that can intelligently monitor such a complex environment out of the box. Consequently, the operations team developed a monitoring solution built on multiple Microsoft products and several third-party products and external monitoring services.

A comprehensive monitoring solution includes reactive monitoring and proactive testing and monitoring to detect and avoid problem situations before they arise. The foundation of the Microsoft.com solution is an extensible framework that aggregates loosely coupled system components and an asset and object management system that dynamically tracks all of the objects that make up the system. A strong release management process is in place to ensure applications are properly tested and documented prior to release to production. Documentation of important events and how to react to them provides streamlined incident management and troubleshooting after the systems are in released to production.

The solution that the Microsoft.com operations team developed incorporates the aspects of asset management, reactive monitoring, proactive testing and monitoring, reporting and analysis, and intelligent troubleshooting support. The design and implementation of the solution took substantial time, effort, and coordination among several organizations to fully implement, but the resulting benefits are clearly evident in the ability to achieve the highest Web site availability in the industry.

For More Information

For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada information Centre at (800) 563-9048. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information via the World Wide Web, go to:

http://www.microsoft.com

http://www.microsoft.com/technet/itshowcase

© 2009 Microsoft Corporation. All rights reserved. Terms of Use | Trademarks | Privacy Statement
Page view tracker