Appendix J: Maintenance and Monitoring Tasks

Published : September 1, 2004

Not all tasks are necessary on each site, and not all tasks are run on the same schedule. Follow best practices to determine how often to run each task . The following is a list of all maintenance tasks for Microsoft® Systems Management Server (SMS) sites.

On This Page

Check SMS Site Database Status
Check Site Server Status
Check Site Systems Status
Check Clients Status
Check the Operating System Event Log
Check the SQL Server Error Log
Check System Performance
Check SMS System Folders
Check Status Filter Rules
Delete Unnecessary Files
Delete Unnecessary SMS Objects
Produce and Distribute End-User Reports
Run Disk Defragmentation Tools
Back Up Application, Security, and System Event Logs
Check SMS Site Database Available Space
Check Available Disk Space
Verify That Scheduled Maintenance Tasks Are Running
Back Up Account Data
Change Accounts and Passwords
Check Network Performance
Review the Security Plan
Review the Maintenance Plan
Perform Recovery Tests in a Test Lab
Check Hardware
Check Site’s Overall Health
Check the Backup Snapshot

Check SMS Site Database Status

Use the Microsoft SQL Server™ DBCC command to check the health of the SMS site database. Use any other tools available to test the health of the SMS site database.

Check Site Server Status

View site status summary information in the SMS Administrator console, or create reports that summarize the server activity and status (such as the Clients that Received a Specific Advertised Program report). If necessary, check status messages of individual components. For further details, in case status messages indicate that a problem exists, view the relevant log files. Isolate and fix conditions that generate errors or warnings. If appropriate, reconfigure the status system so that only relevant and helpful messages are recorded.

Check the status of items such as:

  • Site components and services. Check if any site server component or service is experiencing any problems.

  • Packages and advertisements. Check the status of packages and advertisements in your site. Check package and advertisement status messages to ensure that package source files reach distribution points, and that advertised programs reach clients. Check status messages that are returned from clients to see whether the clients run programs successfully or not.

  • Site-to-Site Communication. Check communication between the site and its parent and child sites (if they exist). Check status messages and, if necessary, check log files of the Replication Manager, Scheduler, and Senders on the site to determine whether the site is having communication problems.

Check Site Systems Status

Check the state of site systems throughout the SMS hierarchy. Use the status system and, if necessary, use log files to determine if site systems are having problems, such as:

  • Low level of available disk space.

  • SMS components that cannot connect with a site system.

It is especially important to check the status of management points.

Check Clients Status

Check the state of clients in the hierarchy. Run queries on status messages to detect any problems that clients might be having, such as:

  • Client components are experiencing problems.

  • Clients are failing to install.

  • Clients are not reporting software inventory or hardware inventory.

  • Clients that are not reporting heartbeat discovery data regularly (or for the past x days).

  • Client count unexpectedly increasing or decreasing at a fast rate.

You can monitor a client’s status only if it creates status messages, and these status messages reach the site server. However, if the client, the CAP, or the management point is experiencing problems that prevent status messages from reaching the site server, you will not be aware of any problems. To detect clients from which you are missing status messages, you need to run a query that returns all clients that have not reported a status message within the last <n> days. In this query, <n> is the length of time you would expect to receive a status message from that client (taking into account the frequency of hardware or software inventory and the regular time it takes for status messages to reach the site server.)

Check the Operating System Event Log

On key servers, check the application, system, and security system event logs. You can access those through the Event Viewer administrative tool on Microsoft Windows( 2000 Server. Look for messages that indicate error conditions or developing problems. Isolate and repair the conditions that generate error or warning messages.

When installing an SMS site server, its default configuration is to write status messages to the event log. This helps you identify any developing problems with SMS.

note.gif  Note
When SMS is configured to write status messages to Windows event logs, SMS error status messages are written as Information events, not Error events.

Save instances of the most recent event log files for future comparison. When you can compare current log files with previous log files, it is easier to detect problems that are developing. After saving the log files, you can clear them from the event log so it is easier to detect new problems.

Check the SQL Server Error Log

Check the SQL Server error log in SQL Enterprise Manager. Look for messages that indicate error conditions. Isolate and repair the conditions that generate error or warning messages.

Check System Performance

To check whether the site server and component servers have sufficient resources and that SMS site services are running optimally, you must monitor site server and component server performance. Use performance-monitoring tools such as System Monitor in the Performance console on Windows 2000 Server. Check the status of critical components on the site server, on the computer running SQL Server, and other SMS site systems.

SMS installs many performance monitor counters, but you can add, remove and configure counters as needed. You can also use the SQL Server performance counters.

Save performance log files for future comparison. It is easier to detect performance trends, and to identify potential bottlenecks, when comparing current performance log file to previous performance log files.

For more information about SMS performance counters, see the SMS 2003 Performance Monitor Counters section in Appendix I, “Maintenance and Monitoring Resources.”

Check SMS System Folders

To confirm that files are being sent up the hierarchy and processed correctly, you need to check the SMS system folders.

Check for backlogs in these folders on site servers. For several folders, you can use the respective SMS 2003 pre-defined performance monitor counters to monitor the state of those folders, shown in Table 1.4.

Table J.1   Component Folder Locations and Contents

Component/Feature

Folder

Content

Data Discovery Manager

SMS\inboxes\DDM.box

DDR files

Client Configuration Manager

SMS\inboxes\CCR.box

SMS\inboxes\CCRretry.box

Configuration requests for client installation

Replication Manager

SMS\inboxes\replMgr.box\outbound

Replication requests

Scheduler

SMS\inboxes\schedule.box\tosend

Files waiting to be sent to another site

Inventory Processor

SMS\inboxes\inventry.box

Hardware inventory data files

Inventory Data Loader

SMS\inboxes\dataldr.box

SMS\inboxes\dataldr.box\process

SMS\inboxes\dataldr.box\process\BadMifs SMS\inboxes\auth\dataldr.box\process\BadMifs

SMS\inboxes\dataldr.box\process\orphans

MIF files that are being processed

MIF files waiting to be processed

Bad MIF files

Bad MIF files

MIF files from orphan clients

ROW Software Inventory Processor

SMS\inboxes\sinv.box\BadMifs

Bad software inventory files

Status System

SMS\inboxes\statmgr.box\queue

SMS\inboxes\statmgr.box\statmsgs

SMS\inboxes\statmgr.box\futureq

SMS\inboxes\statmgr.box\retry

SMS\inboxes\statmgr.box\outbound

SMS\inboxes\compsumm.box

SMS\inboxes\compsumm.box\repl

Status messages

On client access points (CAPs), check for backlogs in the folders shown in Table 1.5.

Table J.2   CAP Folder Locations and Contents

Component/Feature

Folder

Content

Status messages

CAP_<site code>\statmsgs.box

Status message files

Hardware inventory

CAP_<site code>\inventry.box

Hardware inventory MIF files

Software inventory

CAP_<site code>\sinv.box

Software inventory files

Client configuration

CAP_<site code>\CCR.box

Client configuration change requests

Client discovery

CAP_<site code>\DDR.box

Data discovery files

An unusual backlog in any of these folders means that data is not being updated in the SMS site database. If there is a backlog, you need to isolate and repair the problem.

Check Status Filter Rules

Check whether it is possible to reduce the amount of traffic generated by status messages being replicated throughout the hierarchy. If the site is currently healthy, it might be possible to add status filter rules to prevent replication of status messages, which are not necessary.

Delete Unnecessary Files

If Management Information Format (MIF) or IDMIF files are used to extend hardware inventory in your site, all invalid IDMIF files are placed in the SMS\inboxes\auth\dataldr.box\process\BadMifs folder, and all invalid NOIDMIF files are placed in the SMS\inboxes\dataldr.box\process\BadMifs folder. Because SMS never removes those files, you need to empty this folder manually. If a large number of MIFs are placed in the BadMifs folder, it is likely that a MIF-generating tool is producing the MIFs with an incorrect format. Investigate and repair the cause of the bad MIFs.

Delete Unnecessary SMS Objects

Delete objects, such as collections, queries, and packages that are no longer needed at the site. Deleting unnecessary objects saves disk space, reduces intersite replications, and increases performance.

caution.gif  Caution
When deleting a collection, any advertisements to that collection are also deleted.

Produce and Distribute End-User Reports

SMS provides a wide range of pre-defined reports, and also allows you to build custom reports. Produce reports as needed and distribute them to management and administrators in your organization as required.

For information about SMS reports, see Chapter 11, “Creating Reports,” in the Microsoft Systems Management Server 2003 Operations Guide.

Run Disk Defragmentation Tools

Over time, disk volumes on SMS site systems become fragmented. Site operations such as distributing large software packages might significantly increase fragmentation on site servers and distribution points. As fragmentation increases, disk operations take longer, thus, the overall site performance decreases.

Run disk defragmentation tools on the SMS site server and all other site systems to maintain the performance level of disk operations.

Back Up Application, Security, and System Event Logs

Windows event logs can get full, and by default, new items will start to overwrite older items. To diagnose problems, and for other reasons, it might be necessary to refer to an older event log. It is recommended that you back up Windows event logs, and store the backups in a safe and accessible location. If necessary, increase default logs file size to accommodate larger amounts of data.

Check SMS Site Database Available Space

To find the amount of space used by database devices, run the SQL Server stored procedure sp_spaceused against the SMS site database. For more details about sp_spaceused, see the SQL Server Help. Check the tempdb device at peak usage, when several instances of the SMS Administrator console are using the database and the site is actively processing objects.

Check Available Disk Space

Check the amount of available disk space on the site server, the SMS site database server, and other SMS servers. Ensure that the amount of free disk space is sufficient for SMS and SQL Server to perform properly during regular and increased activity load.

To use the Status System to view information about site system disk space

  1. In the SMS Administrator console, navigate to Site System Status.

Systems Management Server   Site Database    System Status     Site Status      <site name>       Site System Status

  1. In the details pane, view status information of site systems such as free disk space.

Verify That Scheduled Maintenance Tasks Are Running

After you schedule maintenance tasks, check the SMSdbmon.log file to ensure that the tasks are running as scheduled. If maintenance tasks are not running, the database will grow indefinitely. The increased size of the database affects operation and performance of the site.

Back Up Account Data

To properly recover a site server, you must have information about the accounts that SMS used before the site failed. Account data is stored in domain controllers.

Use Microsoft tools, such as the NTBackup.exe tool that comes with Windows 2000 Server, or third-party tools to back up account data as follows:

  • If there are multiple domain controllers in your infrastructure that contain the SMS account database, you need to periodically back up the account database. (If Active Directory® directory service is implemented in your organization, then such a task might be included in the Active Directory maintenance plan.)

  • If the account database is stored on a single domain controller, then back up the account database frequently. Depending on the frequency of changes to account data, you might need to add this task to the site’s daily or weekly maintenance tasks.

  • If the account data is stored on member servers, then regularly back up the whole operating system that contains the account data, using software that backs up account lists and the account database.

  • Whenever there is a change to the password of the Client Push Installation account or to the site system connection accounts, you should note that change. For security reasons, SMS 2003 encrypts the Client Push Installation account and the site system connection accounts. You need to be able to retrieve these accounts’ passwords so that you can re-enter them during a site recovery operation.

In between account database backups, document any changes to accounts. Write down and save any changes made to SMS accounts and share rights so that you can apply those changes again after recovering the site.

Change Accounts and Passwords

To maintain the level of security in your hierarchy, you must periodically change the passwords and the accounts that SMS sites use. Report any changes to the security staff so that security administrators know that these changes are planned and authorized.

To develop an effective security maintenance plan for your SMS hierarchy, you must thoroughly understand how security is deployed in your hierarchy and make the following decisions:

  • Which accounts need to be changed, and for which accounts is it sufficient to change only the password.

  • How often to change passwords and accounts.

  • How to change passwords and accounts (such as by running SMS site reset).

  • Which accounts cannot be configured by the administrator (either the account name cannot be changed, or the password cannot be manually modified).

    note.gif  Note
    After changing passwords or accounts that SMS sites use, you must update the backup of the account database. Follow your SMS backup plan to initiate an immediate backup of the account database.

For more information about risk assessment and maintaining security in your hierarchy, see Chapter 5, “Understanding SMS Security,” in the Microsoft Systems Management Server 2003 Concepts, Planning, and Deployment Guide.

Check Network Performance

Check the available bandwidth and error rates on the networks used by the SMS hierarchy. Use Network Monitor to capture and analyze network frames so you can diagnose network problems and look for optimization opportunities.

For more information about using Network Monitor, see Chapter 10, “Maintaining and Monitoring the Network,” in the Microsoft Systems Management Server 2003 Operations Guide

Review the Security Plan

SMS evolves with time. User roles change, and people might no longer need access to some or any of the SMS functions. Although most changes in access permission should be implemented after role or staff changes, you should also periodically review the access for all users or groups to identify and delete unauthorized access permissions.

The security plan implemented for the SMS hierarchy in your organization needs to support the risk assessment of your organization. As your organization changes, policies can become ineffective.

Review security-related settings such as:

  • Who has access to SQL Server and to the SMS site database.

  • Who can download from SMS distribution points.

  • Which accounts have permissions within SMS security.

Periodically re-evaluate the risk assessment of your organization, and then review and update the security plan accordingly.

For more information about risk assessment and SMS security issues, see Chapter 5, “Understanding SMS Security,” in the Microsoft Systems Management Server 2003 Concepts, Planning, and Deployment Guide.

Review the Maintenance Plan

Use the maintenance plan document to review the SMS maintenance plan. SMS evolves with time, and it might be necessary to adjust the maintenance plan to accommodate growth, development, and other changes in your organization.

If there were any changes in your organization’s security strategy, backup and recovery strategy, or any other strategy that affects SMS, then determine if the maintenance plan needs to be adjusted to reflect these changes.

Review maintenance tasks configuration. Check the amount of data in the site database and evaluate the usefulness of that data against the amount of space that it occupies in the database. If necessary, adjust the settings that determine the number of days that data is retained in the database.

Update the maintenance plan document to reflect any changes to the maintenance plan, and then distribute it to all SMS administrators who are using it.

Perform Recovery Tests in a Test Lab

The best way to be fully prepared for a site recovery operation is to ensure that the recovery plan is adequate and that administrators are familiar with the recovery process. After you develop a recovery plan for your site, and prepare a test lab for recovery tests, it is recommended that you perform periodic recovery tests in the test lab.  

A recovery test should follow the recovery plan developed for the production environment. Plan to perform a recovery test of the central site, and of any other systems deployed in your hierarchy. A recovery test should test all phases of recovery, including:

  • Backing up a site.

  • Archiving the backup snapshot.

  • Simulating a site failure, such as by turning a server off.

  • Recovering the failed site.

  • Verifying the success of the recovery operation.

You might schedule periodic recovery tests. Company policy might require that new administrators always perform a recovery test. It is strongly recommended that you always include a recovery test when testing major changes to the hierarchy.

For example, before upgrading site server operating systems, you should probably first test the upgrade in the test lab. After completing the upgrade in the test lab, you should perform a recovery test to identify any issues or adjustments to the recovery plan associated with the operating system upgrade. This ensures that if you upgrade the servers in the production environment, you will still be able to successfully recover a failed site.

Include a recovery test in every major deployment test, such as:

  • A major operating system upgrade (not service pack).

  • A major change to the networking infrastructure.

  • New equipment deployment or building relocation.

  • An SMS major version site upgrade.

Check Hardware

Even high-quality hardware occasionally fails. Sometimes, it fails gradually, so there might be early signs. Replacing hardware before it completely fails is a key step in preventing site failure. Both Windows and SMS provide performance counters, which you can use to monitor the performance and state of the hardware used in the site.

As soon as you notice any signs of hardware-related unreliable behavior, especially of an SMS site server, swap the computer of the site server. To properly replace the site server computer, you must use the Recovery Expert.

Check Site’s Overall Health

It is recommended that you periodically perform a more thorough health check, as follows:

  • Ensure that all SMS services are running.

  • Review the Status Message System for Critical status.

  • Ensure that all the latest service packs are installed.

  • Ensure that the latest critical security patches are installed.

  • Examine the System and Application Event logs for errors.

    note.gif  Note
    When SMS is configured to write status messages to the system’s event log, SMS error status messages are written as information events, not error events.

  • Run a query to determine if discovery data is being updated correctly in the SMS site database. The query should list all installed clients in which System Resource - Agent Time is not within the heartbeat interval. It is expected that some clients might be offline, but in other cases, it might indicate a problem.

  • Run a query to determine if software inventory data is being updated correctly in the SMS site database. The query should list all installed clients in which Last Software Scan - Last Inventory Collection is not within the software inventory interval. It is expected that some clients might be offline, but in other cases, it might indicate a problem.

  • Run a query to determine if hardware inventory data is being updated correctly in the SMS site database. The query should list all installed clients in which Workstation Status - Last Hardware Scan is not within the hardware inventory interval. It is expected that some clients might be offline, but in other cases, it might indicate a problem.

If any of these tests fail, you need to diagnose the problem and repair it.

Check the Backup Snapshot

At the end of every site backup cycle you should check the validity of the backup snapshot. Periodically, you should perform a more thorough check to ensure that the site’s backup snapshots can be successfully used for recovery.

Restore a recent backup snapshot to a disk and examine file continuity, file size, and other file properties to ensure that they do not seem corrupted. Check critical files by restoring these files to their respective applications to ensure that the application can use the restored file.