Monitoring and Troubleshooting Microsoft.com
Technical Case Study
Published: February 9, 2006
Microsoft.com is one of the busiest yet most available sites on the Internet. A
combination of proactive and reactive monitoring is a critical factor in maximizing
site availability and scalability.
|
Situation
|
Solution
|
Benefits
|
Products & Technologies
|
|
Microsoft.com is one of the busiest IIS-based Web sites on the Internet with thousands
of supporting servers and applications. The operations team for this Web site needs
to be able to quickly identify, troubleshoot, and correct any issues with Web site
availability and performance.
|
By using a combination of various Microsoft technologies and an independent, global
third-party monitoring service, the Microsoft.com operations team has implemented
systems and processes to accurately determine what server assets they have, how
they are configured, and how they are performing in the data center. The system
also provides information about how the overall site is performing from an end-user
perspective from 35 different regions around the world.
|
- Microsoft.com is the most available Web site as measured by Keynote (99.83
percent for three3 years running).
- The operations team helps to drive Microsoft product improvements by quickly
and accurately identifying issues that can arise under the most demanding volumes.
Because the operations team is an early adopter, the feedback frequently occurs
before products are released to customers.
- Valuable employee time is focused on designing future systems instead of solving
problems on current systems.
|
- Microsoft Windows Server 2003
- Internet Information Services 6.0
- Microsoft Operations Manager 2005
- Microsoft SQL Server 2005
- Microsoft Identity Integration Server 2003
- Active Directory and Active Directory Application Mode
- Performance Monitor
- SQL Server Reporting Services
|
Introduction
Microsoft.com is one of the busiest Internet Information Services (IIS)-based Web
sites on the Internet, and it consists of thousands of supporting servers and applications.
The operations team that is responsible for the Microsoft.com site must quickly
identify, troubleshoot, and correct any issues with Web site availability and performance.
The team must also predict and plan for capacity increases for the site.
To accomplish its tasks, the operations team for Microsoft.com has many partnerships
with other Microsoft organizations, such as application development, test, and release
management teams. The operations team also collaborates with various product teams
at Microsoft, interacts with providers of services (such as the data centers that
host the servers), and interacts with the content delivery network (CDN) partners
Akamai Technologies, Inc., and SAVVIS Communications for global load balancing and
content caching. Troubleshooting performance or availability issues related to Microsoft.com
often requires interaction with one or more of these partner organizations. Monitoring
the overall site performance involves collecting and correlating data from various
sources.
The purpose of this case study is to provide a high-level overview of the monitoring
solution that Microsoft.com operations team designed and developed and to demonstrate
the value of current Microsoft products concerning highly available, high-performance
Web sites. This case study briefly discusses the reactive and proactive monitoring
elements of the solution, best practices for monitoring and troubleshooting, and
some of the benefits that Microsoft.com gained from such a solution.
This case study assumes that readers are technical decision makers and are already
familiar with Microsoft® Windows Server™ 2003 Web server technologies, including
IIS version 6.0 and Performance Monitor, in addition to associated technologies,
such as Microsoft Operations Manager (MOM) 2005, the Active Directory® directory
service, Microsoft Identity Integration Server (MIIS) 2003, and Microsoft SQL Server™
Reporting Services. Any organization can apply the principles and techniques that
this case study describes to plan a Web monitoring system, and the design considerations
for monitoring a highly available Web server infrastructure can be applied to most
any enterprise-scale IT environment. However, this case study is based on the experience
and recommendations of the Microsoft.com operations team. It is not intended to
serve as a procedural guide. Each enterprise environment has unique circumstances;
therefore, each organization should adapt the plans and lessons learned described
in this case study to meet its specific needs.
Situation
The sheer scale of the Microsoft.com Web site presents an enormous challenge with
regard to monitoring and troubleshooting. Spanning three data centers and consisting
of thousands of servers and applications that service an average of 13 million unique
users a day, Microsoft.com is one of the largest, busiest, and most complex Web
sites on the Internet. More than 600 developers from hundreds of organizations,
many outside Microsoft, submit code and content from all over the world. Many of
them submit content and application updates such as ASP.NET pages, XML files, and
VBScript directly to their respective virtual directories by using their own release
management processes of varying maturity levels.
Microsoft.com is constantly under attack from various locations around the globe.
Because of this activity and complexity, monitoring all of the individual components
that make up Microsoft.com in addition to the overall site is a monumental task.
Monitoring Philosophy
The philosophy of the Microsoft.com operations team is that a monitoring solution
should provide the ability to quickly and accurately identify error sources so they
can be isolated, investigated, and ultimately resolved. The monitoring solution
should be intelligent so that it facilitates the correlation of events from multiple
sources and issues actionable alerts.
Left to their default configurations, most monitoring systems generate an excessive
number of alerts that become like spam to administrators. Especially with large
systems, it is important for organizations to carefully define what should be monitored
and what events or combination of events should be raised to the attention of operations
personnel. An organization must also plan to learn from the data collected. As with
alert planning, this aspect of the solution is a significant undertaking. It requires
creating data retention and aggregation policies, and combining and correlating
all of the data into a data warehouse from which administrators can generate both
predefined and impromptu reports.
The actions taken as a response to intelligent alerts or analysis of historical
data should drive improvements in one of three general categories:
-
Applications
-
Networks
-
Operations excellence
In many cases, performance problems with an application are due to problems with
the way the application was designed or coded. The source of the problem can range
from bugs to memory leaks to inefficient database queries. Application issues are
the most challenging to monitor and identify, but they offer the most significant
improvement opportunities.
Network issues are rare, but they can and do happen from time to time. Redundancy
of components at every layer minimizes the risk of network outages, but when they
do occur, they are critical and require immediate action.
The operations excellence category encompasses what many organizations simply call
human error. The humans in question are typically network, server, storage area
network (SAN), or application administrators. The Microsoft.com operations team
uses the term operations excellence to emphasize that a production system should
have controls, processes, and procedures in place to minimize the potential occurrence
and the impact of human errors.
Solution
Monitoring the Microsoft.com Web site involves always knowing the health of the
individual network and server components and individual applications, in addition
to the overall availability, performance, and capacity of the site. No single technology
is capable of monitoring all of these things, and certainly no product exists that
can intelligently monitor such a complex environment out of the box. For these reasons,
the operations team developed a monitoring solution built on multiple Microsoft
products and some third-party products.
The monitoring solution implemented for Microsoft.com starts with the operations
workbench. The operations workbench, created by the operations team, is a framework
of best practices that aggregates many loosely coupled solution components. The
framework is a custom written, extensible Microsoft .NET Framework application that
communicates with and coordinates the other aspects of the monitoring solution,
typically by using Web services. The following technologies make up the other primary
components of the monitoring solution:
-
Active Directory is the enterprise directory service that is included with Windows
Server 2003 and that hosts the security objects for all the servers, arranged by
property-specific organizational units (OUs.)
-
Active Directory Application Mode (ADAM) is a version of Active Directory designed
for use as an application-specific data store. It stores detailed information about
the assets and objects that make up Microsoft.com.
-
MIIS 2003 is a metadirectory and synchronization engine that automates updates between
Active Directory and the ADAM-based data store that the operations workbench uses.
-
Windows Management Instrumentation (WMI) scripts query servers for detailed configuration
information that is added to the ADAM-based data store.
-
MOM 2005 is an enterprise-class event and performance monitoring infrastructure
that provides the majority of the event and performance collection functionality
for the solution.
-
Microsoft SQL Server 2000 and SQL Server 2005 provide enterprise-class data storage,
presentation, reporting, and correlation.
-
Mercury SiteScope is a third-party monitoring product that provides application-specific,
end-to-end transaction testing inside the data center.
-
A highly customized version of Cluster Sentinel based on the version included in
the Microsoft Windows® 2000 Server Resource Kit. It provides server monitoring
inside the data center, in addition to automated cluster membership management.
-
Keynote Global 35 is a third-party service that monitors the availability of the
overall Microsoft.com site and conducts application-specific transaction testing
from 35 regions around the world.
-
IIS Log Monitor is a custom application that collects the IIS logs. Log Parser parses
the enormous volumes of IIS logs.
The overall system processes over 60,000 alerts a day, conducts approximately 11.5
million availability tests a day, parses 1.7 terabytes of IIS log data a day, and
collects 185 million performance counters a day at a sampling rate of 45 seconds.
However, to reach this degree of monitoring sophistication was a long process and
required significant effort and cross-organizational coordination. The development
of the solution followed a four-step progression, with each step building upon the
previous one:
-
Asset management
-
Reactive monitoring
-
Proactive testing and monitoring
-
Reporting and analysis
Asset Management
After creating the operations workbench framework, the next step in creating the
monitoring solution was to gain control of asset management. Before any operations
team can accurately and completely monitor all the servers that make up a system,
it must know what servers are deployed, where they are, how they are being used,
and who or what owns them. The team must also know how to determine when a server
reached its maximum capacity or reached the end of its useful life.
For asset management, Microsoft.com uses Active Directory to organize all servers
into OUs by property type and network connectivity. Only Microsoft.com administrators
and specific data center personnel have permission to manage servers in these OUs.
An MIIS management agent then regularly searches Active Directory for updates to
synchronize information with a more detailed, ADAM-based asset store. The operations
team chose ADAM as the asset store instead of a relational database because the
team wanted to organize the data in a hierarchical structure for which ADAM is specifically
designed.
After the data in ADAM is updated from Active Directory, WMI-based scripts query
each server to gather detailed characteristics. The ADAM store then reflects new
data and updates.
Reactive Monitoring
After the operations team accurately identified and characterized the population
of servers on a regular basis, the team implemented a real-time monitoring system
to react to problems as soon as possible after they arose. The team uses MOM 2005
for event and performance data collection, utilizing a subset of settings from various
MOM management packs and events forwarded from several other sources.
The operations team implemented two main types of reactive monitoring for Microsoft.com.
One type of monitoring was designed to measure server and application performance
and availability. The other, and perhaps the more important and challenging type,
is monitoring customer-perceived performance and availability.
Server and Application Performance
The Microsoft.com operations team applies a multi-layered approach to monitoring
the overall health of the environment. The approach includes monitoring server hardware
and basic server network connectivity, key elements of the operating system for
individual servers, overall NLB server cluster availability and the response of
specific features of applications. The team also co-monitors network anomaly detection
and mitigation devices.
For a global view of application health and to provide automated fail-over to redundant
systems, the operations team partners with Akamai and SAVVIS to utilize their global
load balancing services. These services are configured to constantly monitor each
NLB cluster of servers from distributed non-Microsoft networks and will automatically
pull clusters out of rotation should they begin to fail. These services are a key
aspect of the team's high availability model and business continuity design.
Server monitoring starts with hardware agents provided by enterprise-class server
vendors that monitor various aspects of the physical components of a server, such
as power supply and cooling fan operation, CPU temperature, and hard disk drive
function. In some cases, the hardware agents are able to predict imminent failure
of a particular component based on trends in data sampling. In addition to providing
a high degree of component reliability and redundancy, employing enterprise-class
servers to build critical production systems can yield the added benefit of seamless
integration with most monitoring frameworks. In the case of Microsoft.com, the hardware
agents forward event and alert notifications to the MOM 2005 infrastructure.
As described earlier, the operations team has highly customized Cluster Sentinel
to monitor disk space and provide monitoring to determine whether a server is responding
to simple network requests, also known as heartbeat monitoring. In addition, Cluster
Sentinel uses a combination of custom ASP and ASP.NET files to provide end-to-end
testing of applications based on specific application health criteria defined by
the application developers. This implementation of Cluster Sentinel can automatically
remove a server from the cluster if these tests fail. For more in-depth application
monitoring, the operations team uses SiteScope specifically for transactional tests
that require interaction to determine successful completion of a transaction. Lastly,
IIS Log Monitor is a custom application that collects the IIS logs, after which
Log Parser parses all IIS errors generated by applications.
For SQL Server database servers, MOM 2005 agents are also used to monitor the service
status. Cluster Sentinel regularly executes an <SP_Who> query to test that
the SQL Server service is responding in a timely fashion.
Performance counters are regularly collected on all servers at intervals of 45 seconds.
A standard set of 42 counters are configured, collected and forwarded to the MOM
2005 infrastructure. These counters represent a cross section of hardware-based
(CPU, memory utilization, disk utilization) and software-based (IIS, SQL, TCP) objects
that provide further insight into overall health of the system. For more detail
on the performance counters collected by the Microsoft.com operations team, see
the Troubleshooting and Debugging Web Applications webcast at:
http://msevents.microsoft.com/CUI/EventDetail.aspx?EventID=1032283908&Culture=en-US.
Microsoft.com uses Cisco Guard devices to detect traffic anomalies that may represent
denial of service (DoS) or other attacks. These devices forward Simple Network Management
Protocol (SNMP) traps to a SQL Server database that provides a view of the traffic
patterns and specific events in the operations workbench. The Cisco Guard devices
and the remainder of the network devices are primarily monitored and managed by
the respective operations teams for the data centers that host the Microsoft.com
servers.
Customer-Perceived Performance
Microsoft.com partners with a third party, Keynote Systems, to constantly measure
and report on the overall site and specific application availability and performance
as experienced from 35 locations around the world.
The Keynote Global 35 service has agents on servers that run a test approximately
once a minute by trying to load the home page of Microsoft.com. If anything fails
to load, such as a .gif file, a text string, or any component that makes up the
home page that is being called (even if not visible) the test will generate an error.
The operations team has written a Web service to forward Keynote metrics to the
Microsoft.com monitoring system as they are gathered. At the end of each day, Keynote
also provides a summary availability report.
In addition to caching content in multiple global locations along with SAVVIS, CDN
partner Akamai monitors specific servers from outside the data center and has the
ability to remove a specific server from a cluster if it is not responding.
Proactive Testing and Monitoring
After implementing and stabilizing the asset management and reactive monitoring
systems, the focus of the operations team shifted to proactive testing of applications
and defining proactive monitoring events. Extensive end-to-end application transaction
testing and application stress testing helps to expose many potential problems prior
to releasing the application to production. The testing process also helps to determine
what events are meaningful, and what corrective actions are appropriate in the case
of those events. All of the information learned from transactional and stress testing
is thoroughly documented as part of the release management process of the Microsoft
Solutions Framework (MSF) that many of the development teams use.
The same systems that provide alerts for events requiring immediate, reactive action
are also used to provide alerts that are predictive in nature, indicating whether
a condition is trending toward a problem situation. After thorough application testing
and observing the behavior of the components of the overall system over time, combined
with the results of problem resolution activities, it becomes possible to identify
the symptoms or causes that precipitate certain application errors. Defining and
refining proactive events over time exemplifies the evolutionary nature of the monitoring
system, and it emphasizes the importance of constantly learning from the data and
putting the knowledge to use.
Reporting and Analysis
The Microsoft.com operations team has developed custom Web services to merge data
in real time from all sources to provide a holistic overview. Each night, an automated
process aggregates the data into a data warehouse for longer-term analysis. The
data is normalized into a common schema and presents related data from different
sources side by side. The types of data include Keynote Global 35 data and all other
monitoring data. Detailed performance data is aggregated into hourly and daily averages,
with sample count, standard deviation, minimum, maximum and standard error retained
for detailed trend analysis.
A snapshot of the asset configuration store is also created daily. Server-specific
information that is maintained in this asset store includes the following data:
The operations team then uses SQL Server Reporting Services to generate standard
daily, 30-day, 90-day, and year-to-date reports that include the following information:
Debugging teams and other organizations have need to access the data in the warehouse
for deeper analysis. Instead of granting access to the data warehouse, the operations
team uses SQL Server Data Transformation Services (DTS) to provide data feeds for
custom and recurring needs. By limiting access to the data warehouse, the operations
team avoids the risk of unmanaged, one-time query activity that may have adverse
performance impacts on the data warehouse.
Monitoring and Reporting Tomorrow
The operations team has plans to be an early adopter of Microsoft System Center
Reporting Manager 2006 to replace some of the currently customized functionality
in the operations workbench. The team is also investing considerable effort in developing
a configuration management enhancement to the operations workbench based on XML
manifests which define standard configurations. Both platform-specific and application-specific
configuration manifests will be created as part of the development cycle prior to
release to production.
Other efforts are underway to standardize application instrumentation. Hundreds
of developers provide code to Microsoft.com, all using different methods to instrument
their code for event logging at varying levels of detail. The operations team wants
to create a common eventing and logging class, based on recommendations from the
Microsoft Patterns and Practices group, with deep application tracing. For more
information about Microsoft Patterns and Practices, go to
http://msdn.microsoft.com/practices/.
Problem management is another focus. The operations team wants to learn as much
as possible from the enormous amount of data by correlating data sources and annotating
those correlations. The primary benefit of the data warehouse is to learn from the
data. The operation team wants to use artificial intelligence engines with genetic
algorithms that can learn normal patterns to search through data and look for statistical
anomalies. Those algorithms can then be applied to live data to predict problems
well before they manifest themselves.
As with many organizations, Microsoft.com envisions a "lights out data center" where
administrators rarely if ever need to physically visit and all servers can be managed
remotely and automatically to the greatest extent possible. For example, remote
control boards embedded in the servers can be instructed programmatically to turn
servers off when usage is low and to power them back on when needed. The "lights
out strategy" is intended to enable systems engineers to focus their time on the
most important activities such as deeper engagement with the development teams they
support, and to focus on new technologies and developing best practices.
Troubleshooting
Performance issues with the Microsoft.com site can generally be grouped into three
categories:
-
Applications
-
Networks
-
Operations excellence
The monitoring system was designed to present only events that prompt support personnel
to take action. However, the operations team wants to document all events during
application testing in such a way that most event actions will have associated corrective
actions that can be performed at the first level of support. Some events require
escalation to the next level of support. Like most large organizations, the Microsoft.com
operations team has several tiers of support for incident, or ticket escalation.
Tier 1
Tier 1 provides support 24 hours a day, seven days a week and provides initial recognition,
routing, and if possible, resolution of tickets. Tier 1 personnel use monitoring
information and troubleshooting guides which are dynamically associated with the
asset or property by the operations workbench to perform initial research and to
aid in incident resolution. If an incident cannot be resolved by Tier 1 support
personnel, it is assigned and routed to Tier 2 support.
Tier 2
Tier 2 provides infrastructure support for server hardware issues, and specific
application support, by personnel with in-depth knowledge of the applications. Tier
2 personnel also use monitoring information and troubleshooting guides by using
the operations workbench to aid in incident resolution. If an incident cannot be
resolved by Tier 2 support personnel, it is assigned and routed to Tier 3 support.
If the incident is resolved at this tier, the troubleshooting guides are updated
with any newly discovered information that can aid in future resolutions.
Tier 3
Tier 3 performs in-depth analysis and consists of systems engineers and database
administrators assigned to specific Web properties, and product managers and developers
for specific applications. Tier 3 personnel use monitoring information to develop
troubleshooting guides, and as administrators, they may also perform other troubleshooting
activities on a server. If an incident cannot be resolved by Tier 3 support personnel,
it is assigned and routed to Tier 4 support, also known as the debug team. If the
incident is resolved at this tier, the troubleshooting guides are updated with any
newly discovered information that can aid in future resolutions.
Tier 4
Tier 4 performs application and kernel debugging by the debug team. Generally, this
type of invasive troubleshooting requires the server to be removed from production
for the duration of testing. Tier 3 personnel review the troubleshooting history
to clearly identify the incident and reproduce the problem steps. The goal of the
debug team is to identify the root cause of the problem so that comprehensive information
can be routed to the appropriate application development team to implement an update.
If the incident is resolved at this tier, the troubleshooting guides are updated
with any newly discovered information that can aid in future resolutions.
Note: For more detail on debugging steps that the Microsoft.com operations
team commonly uses, see the webcast titled "Troubleshooting & Debugging Web
Applications" at http://blogs.technet.com/mscom/.
Best Practices
Best practices for monitoring include the following:
-
Centralize as much as possible. Design a solution that aggregates as much
meaningful information as possible into a central location to aid in incident and
problem management. A MOM 2005 infrastructure is capable of collecting a great number
of events, data that is crucial to a monitoring solution. Processes and troubleshooting
guides should also be centralized and standardized so that all members of the team
approach problems in a similar manner and have access to the same information, and
can update documentation when appropriate.
-
Manage assets. Know which assets are in use, their purpose, and who owns
them, and keep this information up to date by using an automated process. A hierarchical
structure provides the ability to document and view the relationships among assets
and, potentially, operations personnel and documentation.
-
Determine which data is important. Most monitoring products generate enormous
amounts of data and events by default. It is extremely important to identify and
enable only meaningful data and events that are useful and actionable.
-
Implement both reactive monitoring and proactive testing and monitoring.
After the reactive monitoring system is in place, focus on proactive testing of
applications. Attempt to identify events that are predictive in nature.
-
Learn from the data. Aggregate, correlate, and annotate data from different
sources to identify patterns. Consider using genetic algorithms to mine the data
to determine baselines and identify anomalies.
Best practices for troubleshooting include the following:
-
Stress test applications. In addition to end-to-end transaction testing,
applications should be stress tested to see how they perform under a heavy load.
Include alert definitions and associated corrective actions learned from the testing
process as part of the release management process. This activity can help identity
proactive monitoring events in particular, in addition to exposing scalability issues
for a given application before it is released to production.
-
Make all events actionable. Provide as much context as possible for Tier
1 and Tier 2 support groups so that they can resolve known problems without escalation.
Make troubleshooting guides easily accessible so that they can capture valuable
lessons learned from problem resolutions.
-
Conduct reviews. The most serious problems are often caused by lack of operational
excellence. Conduct a thorough review of these situations to help identify processes
that need improvement.
Benefits
A primary objective of the Microsoft.com operations team is to achieve the highest
availability on the Internet. Achieving that availability can be accomplished only
by using a comprehensive monitoring solution that includes both reactive monitoring
and proactive testing and monitoring, actionable alerts and ready access to intelligent
troubleshooting information. By implementing such a solution, Microsoft.com has
achieved 99.83 percent availability over the past three years, as measured by the
Keynote Global 35. During those three years, Microsoft.com has ranked first in availability
compared to all other major Web sites in the industry.
Being able to quickly and accurately identify potential issues with Microsoft products
when implemented under the most demanding volumes enables the operations team to
help drive Microsoft product improvements. As an early adopter of many Microsoft
technologies, the operations team frequently provides information on potential problems
that may adversely affect product scalability and reliability, before those products
are released to customers. Products that are scalable and reliable provide a solid
foundation for highly available, high volume services.
Another huge benefit of a comprehensive monitoring solution is that valuable engineers
can focus on engineering, rather than spending their time reacting to current problems.
When architects and engineers are able to spend their time designing and testing
future systems and upgrades, the resulting systems are far more likely to be well
architected and tested prior to deployment.
The Microsoft.com operations team is currently planning to enhance the operations
workbench by adding a manifest-based configuration management system for servers
and applications. This enhancement will provide the additional benefit of ensuring
complete consistency across all systems. Currently, automated scripts used to initially
install and update servers and applications ensure consistency in those particular
actions, but no method currently exists to detect configuration changes outside
those actions. In the future, a periodic, comprehensive configuration scan will
detect any such anomalies.
Conclusion
Monitoring Microsoft.com, one of the busiest Web sites in the world, requires always
knowing the health of the individual hardware components and individual applications,
in addition to the overall availability, performance, and capacity of the entire
site from various places around the world.
No single technology is available to monitor all of these aspects, and certainly
no product exists that can intelligently monitor such a complex environment out
of the box. Consequently, the operations team developed a monitoring solution built
on multiple Microsoft products and several third-party products and external monitoring
services.
A comprehensive monitoring solution includes reactive monitoring and proactive testing
and monitoring to detect and avoid problem situations before they arise. The foundation
of the Microsoft.com solution is an extensible framework that aggregates loosely
coupled system components and an asset and object management system that dynamically
tracks all of the objects that make up the system. A strong release management process
is in place to ensure applications are properly tested and documented prior to release
to production. Documentation of important events and how to react to them provides
streamlined incident management and troubleshooting after the systems are in released
to production.
The solution that the Microsoft.com operations team developed incorporates the aspects
of asset management, reactive monitoring, proactive testing and monitoring, reporting
and analysis, and intelligent troubleshooting support. The design and implementation
of the solution took substantial time, effort, and coordination among several organizations
to fully implement, but the resulting benefits are clearly evident in the ability
to achieve the highest Web site availability in the industry.
For More Information
For more information about Microsoft products or services, call the Microsoft Sales
Information Center at (800) 426-9400. In Canada, call the Microsoft Canada information
Centre at (800) 563-9048. Outside the 50 United States and Canada, please contact
your local Microsoft subsidiary. To access information via the World Wide Web, go
to:
http://www.microsoft.com
http://www.microsoft.com/technet/itshowcase