Impact of Failure in Operations Manager 2007

Article
05/13/2011

Applies To: Operations Manager 2007 R2, Operations Manager 2007 SP1

Various Microsoft System Center Operations Manager 2007 servers and components can potentially fail, impacting Operations Manager functionality.

The amount of data and functionality lost during a failure is different in each failure scenario. It depends on the role of the failing component, on the Operations Manager deployment, on the length of time it takes to restore the failing component, and on the availability of backups.

Impact of Failure

The impact of failure is minimized if the Operations Manager deployment includes failover servers or clustering. The impact is greater if clustering and failover management servers are not implemented. This is because it will take longer to restore a failed component. When it takes longer to restore the functions provided by a failed component, there is a greater risk of data loss occurring and when data loss does occur, the amount of data lost will be greater. For more information about minimizing the impact of failure, see Reduce the Impact of Failure below.

In some failure scenarios, Operations Manager is able to continue to function properly for a short period of time without losing data. Then, after you repair the failing component, complete functionality is automatically restored without any further intervention.

The following table lists the impact of failure of various Operations Manager components. In this table, the assumption is that each server listed performs only a single role, as specified.

Failed Component	Impact: Best-Case Scenario	Impact: Worst-Case Scenario
Management server	Workload on additional management servers in the management group is increased until the failed management server is restored.	Data is queued on managed computers and is not processed because agents are unable to send it to the management server. Agentless computers are not managed. Gateway servers cannot transfer data from agents to the management server.
Root management server	With at least one server in the cluster functioning, there is no impact.	Operations consoles and Web consoles are unable to connect and manage the configuration of the management group. Configuration management for the management group is unavailable. Connections to other management groups are unavailable. Cannot perform any Operations Manager administrative tasks, such as viewing, editing, or managing objects. The Operations console does not function. Connections to third-party management products using connectors are unavailable.
Operations Manager reporting server (OperationsManager database is intact)		Reports are not accessible. The Reporting view in the Operations console does not function.
OperationsManager database	If the OperationsManager database has been installed in a failover cluster, and as long as one of the cluster nodes is functioning, there is no impact. If log shipping is implemented, services might be reduced until the database is rebuilt.	Data from managed computers is not processed and is not stored in the database. This data might eventually be lost. Management servers start to queue data in their cache. When the cache is full, the management servers start to drop data, starting with performance data. Cannot perform any Operations Manager administrative tasks, such as viewing, editing, or managing objects. Reports do not contain up-to-date information. The Operations console does not function. Changes to management packs are not propagated to agents.
Data warehouse server (OperationsManagerDW database is intact)	With at least one server in the cluster functioning, there is no impact. If the OperationsManagerDW database has failed, clustering does not reduce the impact of failure. (See the next column for impact.)	Cannot view, edit or manage reports. Cannot view, edit, or manage Audit Collection Services (ACS) Reports if installed on the same computer that is running SQL Server.
Gateway server	With multiple gateway servers deployed, agents can fail over to another gateway server, and communication with management servers is not interrupted.	Up-to-date monitoring data is not available because agents and management servers cannot communicate.
Audit Collection Database	If the Audit Collection Database is intact, with at least one server in the cluster functioning, there is no impact. If the Audit Collection Database has failed, clustering does not reduce the impact of failure. (See the next column for impact.)	Security events are queued on managed computers and are not processed. This data might eventually be lost. Performance on managed computers is degraded because of the accumulated data. ACS reports do not contain up-to-date information.
Computer hosting the Operations console	Not applicable.	Cannot use the console on the failed computer.
ACS Collector Server	Not applicable.	Audit events from ACS Forwarders are not processed.

Reduce the Impact of Failure

The effects of some server failures can be reduced significantly by adding redundancy or implementing a failover solution, such as clustering. This also reduces the urgency of restoration.

The following list includes configuration options that add redundancy and clustering to the Operations Manager deployment. Implementing any of these options reduces the impact of failure and contributes to the high availability of Operations Manager in your organization:

Add management servers.
Install the root management server into a Cluster service failover cluster.
Place the databases in a Cluster service failover cluster.
Configure gateway servers for failover.
Configure log shipping.
Configure multihoming of agents across management groups.

Each option is further described below. For further information about deployment options that help ensure high availability and help reduce the impact of failure, see the Operations Manager 2007 Deployment Guide (https://go.microsoft.com/fwlink/?LinkId=93785).

Add Management Servers

Deploy more than one management server in a management group. This allows agents to fail over if a management server has failed.

If a management server has failed, the agents that report to that management server automatically start reporting to another management server in the same management group. After the failed management server is restored, agents can resume reporting to the original management server.

If the root management server is failing, you can promote an existing management server to the root management server role. After the root management server is restored, you can demote the temporary root management server and re-promote the restored server to its original root management server role.

Install the Root Management Server into a Cluster Services Failover Cluster

Install the root management server into a Cluster service failover cluster. If a node in the root management server cluster fails, the root management server role moves to another cluster node. This allows the RMS to continue to function normally.

After you restore the failed cluster node, you can move the RMS back to the original node or leave it running on another node in the failover cluster.

Place Databases in a Cluster Services Failover Cluster

Place the OperationsManager, the OperationsManagerDW, and the OperationsManagerAC databases in a Cluster service failover cluster. As in the case of the RMS cluster, if a node fails, all the databases would be moved to another node in the cluster and continue to function normally. If a database becomes corrupted however, you may need to restore it from your most recent backup.

Configure Gateway Server Failover

Deploy multiple gateway servers to allow agents to fail over between gateway servers and to distribute the management workload.

Gateway servers can also be configured for failover between collection management servers in a management group if multiple collection management servers are available.

Configure Log Shipping

Log shipping maintains a copy of an Operations Manager database on a separate Microsoft SQL Server 2005 or SQL Server 2008 server. Log shipping keeps the copy of the database up to date by sending the transaction logs from the source database in the active management group to the destination database in the standby management group.

If a database becomes corrupted, you can configure Operations Manager to temporarily use the standby database. After the original database is restored, you can reconfigure Operations Manager to use that database.