Causes of Failure
Even with the best maintenance practices, hardware might fail, occasionally, and data might become corrupted, causing interruption to MOM functionality. If there are any early signs of failure, it is best to respond immediately in order to minimize the impact of a possible failure.
There are various causes for failure. Some of the most common causes of failure, and tips for minimizing the risk of such failure, are listed below.
A disk failure on a MOM server will prevent that server from providing the proper MOM functionality. The impact of a disk failure depends on the MOM server experiencing the problem, and on your specific MOM deployment.
You can minimize the risk of a disk failure by using RAID arrays, and by performing regular disk checks. This is especially important for critical MOM servers, such as the MOM database server, and also when clustering is not implemented. For more information about minimizing the risk of a disk failure, see Preparing the hard disk drives topic in the "Deploying MOM 2005 across Multiple Computers" chapter of the Microsoft Operations Manager 2005 Deployment Guide.
Security Breach/Virus Infection
A security breach, or a virus infection, on MOM servers, can delete or corrupt data in MOM databases, or on the hard drives. In those scenarios, MOM stops operating properly, and the integrity of the data is no longer guaranteed. The impact of a security breach or a virus infection depends on the MOM server experiencing the problem, and on your specific MOM deployment.
You can minimize the risk of a security breach, or a virus infection, by ensuring that the appropriate security policies are enforced. For information about MOM security, see the Microsoft Operations Manager 2005 Security Guide.
Data can become corrupted for various reasons, such as software failure and human error. Also, a MOM upgrade might fail, causing the newly-upgraded database to be corrupted. If there is no option to reverse the changes that caused the corruption, the only way to restore MOM functionality might be to restore the data, and then return to an earlier point, before the corruption started
Corruption or Loss of Account and NT Group Information
While trying to follow security recommendations, administrators might accidentally delete or corrupt the NT group information or the SQL Server user logins information. For example, an administrator might accidentally remove the DAS account from the MOM Users group. This will prevent MOM from communicating with the MOM Database, stopping all MOM functionality. In addition, changed passwords for important accounts such as DAS may be lost, resulting in similar lack of functionality.
The Repair option of momserver.msi does not restore operating system and SQL Server account information.
MOM Administrators should document changes made to MOM and SQL accounts, such as changed passwords, or altering default accounts as for instance using the Network Service account for DAS. In addition, users added to the various MOM group accounts should be documented for obvious reasons. As mentioned earlier in this chapter, timely backup of the databases used by MOM will permit restoration of SQL account information while minimizing loss. Refer to the Microsoft Operations Manager 2005 Security Guide for MOM and SQL Server account information. Additional information about MOM service and SQL Server accounts can be found in Chapters 2 and 4 of the Microsoft Operations Manager 2005 Operations Guide.
If you have made extensive changes to your MOM and SQL Server accounts, or have particular security requirements, you might want to contact Microsoft Product Support Services for assistance in restoring account information.
In the case that a physical disaster, such as fire or flooding occurs, MOM Servers might be physically damaged, and part, or all, of the MOM data might be lost. Restoration in these cases is possible only if the backup of the data is available, such as if it has been previously stored off-site.
Physically protect MOM servers in the same way that you protect other key servers in your organization. You can minimize the impact of a physical disaster by maintaining management groups in different geographical locations. Implement alert-forwarding and multitiered structure in your MOM deployment, as appropriate. Also, ensure that you have a complete backup of MOM data in different physical locations.