Assess MOM Database and Management Server Activity

If there is a general alert latency problem, notifications that the server queue is full, or slow performance on either the database or Management Server, conduct the following assessment.

Verify that the OnePoint database is available

Check the following on the database server:

  • The Microsoft SQL Server instance for the Onepoint database is started.

  • Look in the Application log for any database failures, such as:

    • The OnePoint database is full.

    • The OnePoint log is full.

    • The tempDB is full.

Verify that the Management Server is functioning normally

Check the following on the Management Server:

  • Look in the Application log to see if there are any errors that indicate that the Management Server is not able to contact the database. Confirm that:

    • The Management Server can communicate with the database server.

    • There are no permission issues between the Data Access Server (DAS) and the database.

  • See if the MOMService process is restarting frequently. This process restarts if the private bytes exceed 300 MB on the server. This can occur when the number of agents approaches the supported limit, or if there are several Management Packs installed. This limit can be changed by changing the value for the HKEY_LOCAL_MACHINE\SOFTWARE\Mission Critical Software\OnePoint\MaxServerPrivateBytes registry key.

  • See if the server queue is filling up frequently. If so, conduct the analysis in "Server queue assessment".

  • See if other applications are consuming too many resources on the Management Server, and the SQL instance for the OnePoint database on the database server.

Server queue assessment

Check the following indicators to assess the server queue.

High CPU utilization on the database server

If CPU utilization is above 80% on the database server:

  • Check if the performance counter \MOM Server(*)\DB Disc Simple Count is greater than zero for more than a few minutes. If it is, this indicates that service data was received from a large number of agents.

  • Check your service discovery rules to see what scripts were synchronized to run recently, to see if there is a valid reason for discovery data to change on all of the agents.

    Note

    If discovery data is filling the queue, this situation should resolve itself within 1-4 hours. The length of time depends on the number of agents on which data was changed, and what Management Pack the data changed for. For example, the IIS service discovery packet is larger that the Windows Base Operating System service discovery packet.

  • If the discovery simple count is 0, but the performance counter \MOM Server(*)\DB Alert Simple Count is greater than zero for more than a few minutes, , the system may be experiencing an alert storm. Use the Operator console to view the alerts are coming in, to see if the number of alerts is much higher than usual.

  • Check to see if there are several Operator consoles in the management group that are currently refreshing view. If this is the case, check to see if any queries are taking longer than expected to return data to the consoles.

  • If the Reporting database is installed on the same computer as the operational database, check to see if the reporting DTS job is running. If it is, investigate to see if the job is taking longer that usual to complete.

  • Check to see if there are other SQL Server jobs running at the time that CPU usage is high.

  • If you still havent identified the issue, use the SQL Profiler to see what queries are running either with high CPU usage, or for a long time.

Activity on the disk where the OnePoint database resides is high

If disk idle time is less than 20%:

  • Check the size of the current SampledNumericData partition. Do this using Enterprise Manager. Check the table name, where Current=1 in PartitionTables table, and then check the size of that SND table. If the table is greater than 5 million for every 1 GB of memory that SQL can use, check to see if it recovers soon after the next partitioning job.

    If you see this pattern every night, you may have to add more memory to the database server, and ensure that SQL Server is using the extra memory. Another option is to reduce your performance data load.

  • Repeat the preceding process with the current Event partition. If this is the problem area, then you may have to add more memory to the database server and ensure that SQL Server is using the extra memory. Another option is to reduce your performance data load.

  • Check to see if there are any SQL jobs, in particular the Re-index job, running at the time that disk activity is high.

  • Check to see if the reporting DTS job is running. If it is, see if it is running for longer than usual.

  • If you still havent identified the issue, use the SQL Profiler to see what queries are running with high disk activity.

High CPU utilization by the MOMService process

If the MOMService process CPU usage is over 80% on the Management Server:

  • Repeat the steps used to check the discovery simple count ("High CPU utilization on the database server").

Activity on the Management Server disk is high

If the idle time is less than 20% on the Management Server:

  • Make sure MOM 2005 RTM is installed. MOM 2005 RC had an issue with disk utilization on the Management Server.

Server queue filling up from time to time

If resource consumption is not high, but the server queue is filling up from time to time, check the following patterns:

  • If the server queue is filling up every 15 minutes, it could be the performance counter collection. Check the database disk idle time to see if there is a corresponding spike to the queue filling up. If this is the case, there is disk bottleneck and either faster or additional disks are required.

  • If the server queue is filling up towards the end of the time, and it recovers at midnight, there is probably a high volume of performance data or events. There should be corresponding high disk activity on the database server.

  • If the pattern for the server queue filling up is periodic, look for SQL jobs running at those times.

  • If the server queue is constantly at 100%, see if the server queue simple count is constant. If so, make sure MOM 2005 RTM is installed. MOM 2005 RC had an issue with the server queue getting deadlocked.

  • If there is no pattern for the server queue filling up, enable tracing and check mc8 logs on the server. Look for errors that correspond to the queue filling up.

If there are no resource bottlenecks on the database and Management Servers, and the \MOM Server(*)\Queue Space Percent Used never exceeds 10% for more than a few minutes, but alert latency is still high, then the cause of the latency is likely the agent.