Enhancing Service Support with System Center Operations Manager 2007 R2 and Management Packs
Technical Case Study
Published: April 2011
Microsoft IT created a workflow process that optimizes its implementation of Microsoft System Center Operations Manager 2007 R2 management packs into its enterprise monitoring service. The iterative process is used to analyze and tune management pack alerts, thresholds, and troubleshooting information for a particular service or application. The data is exposed to all relevant support organizations for the service. By involving all of the support organizations, a predictable implementation is achieved, and support organizations are thoroughly prepared to handle related issues.
Technical Case Study, 324 KB, Microsoft Word file
Products & Technologies
Microsoft IT determined there was a need to more precisely manage the deployment of management packs in its monitoring environment. Tuning management packs to meet the needs of the MSIT environment was overly time consuming and required operations staff to deeply understand what each management pack monitored.
Using the Microsoft Operations Framework as a model, MSIT introduced rigorous change control into the monitoring environment and defined a process for introducing new management packs quickly and efficiently. A workflow process was developed that helped MSIT tune management pack settings, prepare its infrastructure capacity for the effect of management packs, and involve and inform all service stakeholders.
Effectively supporting a service means getting the right information to the right people at the right time. In the Microsoft® enterprise monitoring environment, effective service support is a collaborative effort between multiple parties, such as incident response teams, service managers, and service deck support personnel.
This paper describes the process that Microsoft IT (MSIT) developed to more effectively tune and implement management packs into its Microsoft System Center Operations Manager 2007 R2 production service monitoring environment. Note that this document should not function as a procedural roadmap, because each organization's operational environments are unique. This paper is intended for Technical Decision Makers and IT Professionals.
Measuring the health of any enterprise IT environment is challenging. Management packs contain health measurement criteria for specific applications and services that help IT organizations quickly achieve depth and breadth of health monitoring. Once imported, health monitoring begins, based on the default configurations and thresholds set by the management pack.
Note: For more information please see "Introduction to Management Packs" at http://technet.microsoft.com/library/cc974491.aspx.
Management packs are typically configured to address a wide range of health conditions. Deploying unmodified management packs can cause excessive levels of alerting, create unnecessary work, and increase support costs.
MSIT tuned management packs for release in their production environment, but did not have an efficient process in place to inform and engage with all of the service stakeholders. This resulted in lengthy delivery times and made it more difficult to coordinate with all of the services that might be impacted by the release of a new management pack in the environment. MSIT needed to develop a more consistent management pack tuning and implementation process that would make it easier and faster to tune and deploy management packs into the production monitoring environment. MSIT wanted a workflow process that would help them prepare the infrastructure capacity for the impact of management packs and involve and inform all service stakeholders.
The Tune, Review, and Implement Process, also known as TRIP, was developed to increase management pack release consistency and quality. TRIP delivers two critical tuning components. First, only actionable alerts are raised. Actionability can be defined as the ability to act on the monitoring information provided by the configured management pack. Second, specific troubleshooting knowledge is developed and delivered that helps incident support teams efficiently mitigate any issues.
By design, TRIP uniformly processes System Center Operations Manager 2007 R2 management packs. It supports management packs provided by both Microsoft-internal product development groups, as well as external parties. Because publicly available management packs are considered mature due to development and testing rigor, TRIP limits testing to alert volume analysis, and product knowledge analysis and augmentation. Allowed management pack modifications include:
- Performance rule threshold changes.
- Alert suppression or delay, based on alert volume or duration thresholds.
- Disabling rules not suitable for the monitoring environment.
- Monitor additions, if required for collection or alerting purposes.
- Knowledge area modifications or additions. This applies to either product or company-specific knowledge areas.
Roles and Responsibilities
The roles and responsibilities of the key resources that support the end-to-end TRIP process described in this document are detailed below.
Service Providers initiate the request for a TRIP engagement. Typically, the Service Providers manage one of the internal business applications at Microsoft. They also provide the subject matter expertise to request changes to the management pack, assess the actionability, and develop and deliver training for the configured management pack. The Service Owner reviews and approves decisions at milestone gates on behalf of the service, and owns meeting Service Provider schedule deliverables. The Subject Matter Expert (SME) provides management pack recommendations. These include changes to monitors and rules, threshold additions and changes, and troubleshooting guide (TSG) training development and review.
TRIP Engineering resources manage the engagement and provide technical expertise to analyze and modify the management packs. The TRIP Project Manager drives an individual engagement from start to finish, mitigates issues, and executes on a communication plan that ensures that all parties clearly understand the ongoing state of the engagement. The TRIP Engineer provides System Center Operations Manager 2007 R2 and MSIT with enterprise production environment expertise that supports the analysis, tuning, and testing of the management pack throughout the TRIP life cycle.
Microsoft IT Incident Management team members provide review and approval at milestone gates. Their input and expertise is specific to alert volumes and alert troubleshooting documentation.
Microsoft IT Release Management team members provide release gate guidance and ensure release process adherence throughout the engagement.
Product Group Representative
Microsoft Product Development Group (PG) representatives provide design input to the Service Provider and TRIP Engineering teams. PG participation also ensures that management pack configuration change feedback is factored into future product recommendations.
The Trip Process
TRIP has two execution paths. New Onboarding represents the complete end-to-end process workflow. The Sustain workflow utilizes only the subset of processes within the New Onboarding workflow that are required to implement a quality release based on the requested change. Both paths are explained in detail next.
TRIP New Onboarding has four primary components:
- Test and Tune
- Documentation Review
These components are separated by stakeholder collaboration and sign-off activities. The process is broken down in the following graphic.
Figure 1. TRIP Process Overview
Once the service area has initiated a request for a TRIP engagement and it becomes active, the first step is to validate the request and establish an execution framework. Initialization is performed during a formal meeting where the following occurs:
- Roles and responsibilities are discussed.
- Stakeholders for each role are identified, and resource commitments are secured.
- A high-level description of the process is provided for new service area representatives.
- Initial schedule dates are provided and the next steps are discussed.
Upon completion of the initialization meeting, the process moves into Analysis.
Management Pack Analysis consists of two steps, the Code Analysis and the Content Review.
Figure 2. Management Pack Analysis
MP Code Analysis
A TRIP Engineer uses various means to examine the management pack code. The engineer determines if the management pack is configured per enterprise class standards. Specifically:
- The management pack structure is reviewed against the Management Pack Best Practice Analyzer (MPBPA).
- The group structure is analyzed for compatibility and effectiveness.
- The discovery methods are reviewed for frequency and environmental cost.
As the final Analysis step, the management pack code is converted into an easily readable format, which allows Service Area SMEs to review the monitors and rules for tuning and actionability consideration.
MP Content Review
The Service Area SME reviews the documentation provided from the MP Code Analysis, and determines the following:
- Valuable collection and alerting monitors and rules
- Sample frequency adequacy for the enabled monitors and rules
- Appropriate alert thresholds for the enabled monitors and rules
If appropriate, the Service Area SME then recommends additional monitors and rules for collection and alerting. These help provide a clear understanding of specific server conditions and overall service health.
Tuning and Testing
Tuning and Testing are the core technical steps that match the general purpose management pack downloaded from the catalog to the specific needs of the business. This phase is made up of four steps, three required and one optional, as illustrated and described next.
Figure 3. Management Pack Tune and Test
Based on the findings of the MP Code Analysis, and the recommendations and needs identified during the MP Content Review, TRIP Engineering develops a management pack deployment package. Several factors can expand the time required to complete this stage, such as:
- MP Code Analysis may uncover problems with the discovery or class structure and require remediation. From basic tuning to complete redevelopment, this critical stage ensures that the management pack can work within the MSIT production environment, without impacting other monitoring.
- MP Content Review may produce complex detection, frequency, threshold changes or additions that increase development time.
As part of the Alert Volume Generation and Usability Testing Analysis stages, a test System Center Operations Manager environment is used to review the workflow functionality and volume footprint of the tuned management pack. To enable realistic collection and alerting, a sample set of production level servers that are running the target technology is used. This enables alert generation and usability results to be as accurate as possible.
Alert Volume Generation/Analysis
Alert volume generation is one of the most critical TRIP deliverables. During a specific period of time, by default seven calendar days, collections and alerting volumes are generated from the sample set of production servers configured in the Environmental Setup stage. At the end of the collection, a report is generated that enables the following decisions as part of the Implementations Sign-off:
- Are there unnecessary false alerts during a full day's test with no outages? This can be mitigated by turning off rules that generate false or non-actionable alerts.
- Are there adequate remediation steps in the knowledge articles for each alert that is actionable? If not, augment the documentation with specific company knowledge that is suitable for the environment.
- Will the additional collection (events and counters and alerts) introduced by this management pack overflow the database capacity at current allocation levels? If this is the case, plan for increased capacity levels in the database and data warehouse based on expected counter, event, state changes, and alerts-per-day measures from test.
These questions are action items for TRIP Engineering, Service Area, and Incident Management key stakeholders. If any stakeholder has issue with the volumes, the pack may require retuning, or could be rejected from production environment onboarding. In some cases, where alert volumes are excessive but collections are within tolerances, management pack alerting can be turned off.
Some management packs target very complex technologies and may not produce any alert volumes in certain areas. In some cases, a Usability Testing stage can be added to the process that allows the TRIP Engineer and Service Area SME to manually or synthetically trigger alerting to verify acceptable detection and alerting. Usability Testing is optional, but when needed, it ensures that additional quality bars are in place before production release. The Usability Testing requirement may be established at the initialization meeting, or after the Alert Volume Generation Analysis, if volume generation fails due to complexity of the management pack's detection matrix.
The documentation review process evaluates and augments the product and company knowledge areas of the management packs that technicians use to resolve detected health issues in a timely and efficient manner. The product knowledge that typically comes with a public management pack provides information about the alert, why it was raised, and some best practices for issue resolution. Because the provided documentation is intended for multiple audiences, it typically requires some supplementation or augmentation for use by monitoring and support personnel in a specific enterprise-level monitoring scenario. Reviewing and augmenting the product knowledge helps ensure that the knowledge delivered with the alert is accurate and contextually appropriate for the enterprise. The phase is broken down into two subphases. One is for the actual development of the troubleshooting knowledge, and the other is optional, for technology training purposes.
Figure 4. Management Pack Actionability Development
The troubleshooting guide (TSG) development and review brings together the Service Area SMEs, Release, and Incident Management team resources to ensure that the alert knowledge presented to operational resources is actionable in a predictable and efficient manner. It also includes some Knowledge Base development and management pack augmentation, depending on the results of the initial analysis.
The Service Area SME reviews the management pack product knowledge provided during the initial MP Code Analysis. This information is then transformed into a format that can be understood and used by Level 1 Service Desk resources. A provided template helps the Service Area SME complete this step in a timely and consistent manner.
Incident Management resources review and socialize the information with operational staff at the appropriate support levels to ensure that they understand and can execute on the knowledge presented.
Once the TSG Review is complete, a TSG sign-off is required to ensure that Incident Management agrees to the content and actionability. Based on the TSG content, Incident Management can request additional training to augment the overall technology knowledge of its staff. If sign-off is not achieved, a revision will be requested of the Service Area SMEs, and additional time will be required, adding time to the overall release schedule. If sign-off is achieved, repository and management pack insertion tasks are then completed.
Adding to the Knowledge Base
A resource adds the information into the Knowledge Base repository based on the template provided in the TSG Development Stage. After the information is added to the Knowledge Base, a link to the knowledge is produced and delivered to the TRIP Engineer for management pack inclusion.
Adding Alert Link
The TRIP Engineer edits the management pack code and inserts the knowledge link into the company data, enabling it in the alert stream when the pack is implemented into the production environment.
Training ensures that support personnel are ready to troubleshoot and resolve issues as new technology is introduced into the Incident Management ecosystem. Specific training for management pack release is not always necessary, but sometimes the level of complexity to interpret management pack product or company knowledge does require additional training. Training development and delivery is coordinated between the Service Area Representative, the Service Area SME, Release, and Incident Management. Typically, the training is conducted by a recorded Microsoft Office Live Meeting session to accommodate both immediate and future training needs.
Once the Tune and Test phase and Actionability phase are complete, a review of the Alert Volumes and final TSG confirmation is performed. This is conducted via a meeting with all stakeholders. This critical path item, if not achieved, blocks further progress until issues are resolved.
After implementation sign-off, the resulting management pack can be introduced into the target production environment and reviewed further. The following sections discuss this in detail.
Figure 5. Management Pack Implementation
Implementation takes approximately one week, based on change and release processes. The actual import and validation work takes approximately one business day.
Import, or production injection, is the process of importing the generated management pack into the MSIT core production environment. The following must occur to enable the import:
- Store the final, versioned management pack into source control.
- Import the management pack into one or more target environments using the System Center Operations Manager 2007 console and native import functionality.
Once the management pack has been introduced into the target environment, a review to determine whether the following conditions have been met must be performed before a management pack release can be considered complete:
- Workflows have successfully deployed.
- Discovery is occurring on a sample set of target assets.
- A sample set of target assets shows a configuration refresh indicating that the rules and monitors have been received and are awaiting trigger events.
Post Implementation Review Period
Since the Tune and Test phase only typically applies to a subset of production assets, alert volume projections may not be in alignment with the actual production volumes. To ensure that the monitoring platform and consumers of alerts are protected from the impact of management pack activity that is above projected volume and functionality, a warranty period is provided for up to two weeks, depending on the deployment. This allows both the Service Owners and the consumers of alerts to have elevated priority into TRIP Engineering resources to tune the pack for unforeseen issues. Examples include discovery issues of qualified assets, or alert functionality issues such as alert flooding, which both are described in detail next.
- Discovery Issues. Basic discovery is checked during Implementation Validation. However, issues can be encountered when introduced into the production environment. Discovery issues are limited to assets that meet all prerequisites established by the service. In most cases, these are the primary issues, and TRIP Engineering can provide a report of valid and healthy candidates to allow the service owner to modify either asset configurations or Configuration Management Database (CMDB) configurations that trigger platform qualification detection.
- Alert Volume Issues. As stated previously, only a sample of the production assets is used to validate management pack functionality in a preproduction environment. As such, alert projections can be incorrect, resulting in an overload of the system, the recipient's alert console, or ticket queues. During the Post Implementation Review Period, the alert consumer can notify TRIP Engineering of these issues and either rapid tuning or full rollback of the management pack can occur.
Once the review period is over, additional management pack changes can be requested through the TRIP Sustain processes. TRIP Sustain processes generally take less time to review, tune, and implement.
The need for Sustain processes to keep up with ongoing business needs is required and relied upon. Sustain processes can use all or very few TRIP processes to achieve the desired result. In most cases, TRIP Sustain processes have a shorter implementation timeframe. The following sections describe some of the Sustain processes.
Rule or Monitor Additions
Minor rule or monitor additions may be needed between management pack release cycles to cover gaps. Changes for a single sustain execution are limited to 10. Otherwise, a New Onboarding TRIP engagement is required. Rule and monitor additions include the following TRIP execution components:
- Tune and Test
- Actionability Development
Alert Threshold Tuning
Service Owners may wish to modify thresholds or collection frequencies to enhance alerting for more proactive actionability. This includes both increasing and decreasing sensitivity.
Decreases in rule or monitor sensitivity effectively reduce alert volumes. Based on this, the Tune/Develop portion of the Tune and Test phase is utilized, along with a limited Implementation cycle that includes only basic management pack validation, and excluding discovery validation and agent configuration acceptance.
Increasing rule and monitoring sensitivity has the effect of increasing alert volumes, which requires alert volume analysis and alert consumer buy-off prior to implementation. Based on this, the complete Tune and Test phase is utilized to minimize risk to the platform, and the alert consumer's ability to act in a timely manner is validated before release.
Disable Alerting or Collections
Disabling alerting or collection rules/monitors reduces platform capacity requirements and alert consumption requirements. Because of this, only the actual tuning portion of the Tune and Test phase is utilized, along with the Import step of the Implementation phase.
Knowledge Change Where New Link is Needed or Generated
Typically, modifying content in the Knowledge Base does not modify the link that has been included in the management pack. In the rare instance where a link does change or additional links need to be added for new content included in the Knowledge Base, a Sustain phase has been provided to facilitate the change and release process.
New Version Review
Periodically, new versions of management packs become available and are requested to be reviewed for production implementation consideration. In these cases, the full TRIP New Onboarding process is utilized and expanded. Expansion includes the following and may add additional time to the MP Code Analysis and Alert Volume Generation reporting steps:
- Rule comparison provided as part of the MP Digest for the MP Content Review step.
- Side-by-side comparison between production and pre-production alerting volumes.
TRIP provided both technical and process benefits to MSIT and the Microsoft enterprise.
From a process perspective, the iterative nature of TRIP and participation of stakeholders at all critical milestones meant that Monitoring and Service Owners better understood the alerts generated by Operations Manager 2007 R2 prior to and after implementation. Authors of the management pack were better able to understand the needs of Service Owners and Operations, and could incorporate that into subsequent management pack releases. Service Owners were able to effectively monitor the applications that they supported. The result was a process that enabled review, configuration, and reporting of management pack collection and alerting, prior to production release.
From a technical standpoint, the solution allowed more efficient use of the Operations Manager 2007 R2 monitoring environment. By measuring and tuning the alert volumes prior to production release, Monitoring resources were used for only desired alerts and collections. This helped to ensure that capacity was consumed only on collections that are valuable to collect, to react to, or both.
The TRIP process provided critical data that improved MSIT's implementation decisions. Examples are described next.
Infrastructure Capacity Analysis
By generating realistic alert volumes against sample production sets, Monitoring Operations teams were able to determine whether incoming management packs could be supported from a capacity perspective. This ability has helped MSIT identify management packs that, if introduced into the production environment, could have caused negative monitoring platform impacts. This possibly prevented a decline of monitoring services as a whole.
Workload Impact Analysis
By taking the same alert volume projections generated for infrastructure capacity analysis, Incident Management teams have been better able to understand the new workload resulting from a management pack. Based on the impact projections, MSIT has been able either to put additional resources in place or reject management packs based on volume predictions that were deemed unsupportable. Additional management pack tuning can be performed until it meets the acceptable criteria for release.
Precise Troubleshooting Content
Service Owners were able to review and modify the default product knowledge with company-specific content that more closely reflected the target environment. This allowed the Incident Management team to control the quality of the troubleshooting steps that their technicians follow to resolve underlying issues. By reviewing and augmenting this information, Incident Management was able to react to alerts and resolve issues more quickly.
By involving all of the organizations that support a service in the process, Microsoft IT created a flexible collection and alert tuning and implementation process. The purpose and intent of management packs is better understood by all parties in the process. The iterative nature of the process provides the desired level of alerts, information, and customized content to the variety of organizations that collectively support a service.
All TRIP process stakeholders have benefited from the solution and Microsoft IT has dramatically shortened its delivery timeframe for management pack implementation. Service Owners have a predictable process with clear milestones and timeframes. Incident Management has enhanced its level of service support, and gained value from troubleshooting guide actionability and volume analysis. Both of these features allow Incident Management team members to understand incoming issues from Service Owners, and ensure they can react effectively to alerting, based on the Service Owner's needs and guidance.
For More Information
For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada information Centre at (800) 563-9048. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information via the World Wide Web, go to:
For information about the Management Pack Best Practice Kit, see http://www.microsoft.com/downloads/en/details.aspx?FamilyID=9104af8b-ff87-45a1-81cd-b73e6f6b51f0&displaylang=en (contained in the System Center Operations Manager 2007 R2 Authoring Resource Kit)
© 2011 Microsoft Corporation. All rights reserved.
This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft, Windows, and Windows Server are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.