Mapping Requirements to a Design for Operations Manager 2007

Applies To: Operations Manager 2007 R2, Operations Manager 2007 SP1

Mapping Requirements to a Design

In the previous section, you completed the following three tasks:

  • You gathered the business requirements, which help you plan which features of Operations Manager to implement.

  • You gathered the IT requirements, which help you plan the management group topology.

  • You inventoried how your company currently performs monitoring, which helps you plan how to configure Operations Manager.

This section guides you through the design decisions that map all the information and knowledge that has been collected to an actual design. This will be done by applying best practices for sizing and capacity planning for server roles and components. This includes Audit Collection Services (ACS), management servers, RMS, agentless exception monitoring (AEM), gateway servers, collective client monitoring.

Management Group Design

All Operations Manager 2007 implementations consist of at least one management group, and given the scalability of Operations Manager 2007, for some implementations, a single management group might be all that is needed. Depending on the requirements of the company, additional management groups might be needed immediately or might be added over time. The process of distributing Operations Manager services among multiple management groups is called partitioning.

This section addresses the general criteria that would necessitate multiple management groups. Planning the composition of individual management groups, such as determining the sizing of servers and distribution of Operations Manager roles among servers in a management group, is covered in the "Management Group Composition" section.

One Management Group

Approach your Operations Manager management group planning with the same mindset as you have with Active Directory domain planning: start with one management group, and add on as necessary. A single Operations Manager 2007 R2 management group can scale along the following recommended limits:

  • 3,000 agents reporting to a management server.

  • Most scalability, redundancy, and disaster recovery requirements can be met by using from three to five management servers in a management group.

  • 50 Operations consoles open simultaneously.

  • 1,500 agents reporting to a gateway server.

  • 25,000 Application Error Monitoring (AEM) machines reporting to a dedicated management server.

  • 100,000 AEM machines reporting to a dedicated management group.

  • 2,500 Collective Monitoring agents reporting to a management server.

  • 10,000 Collective Monitoring agents reporting to a management group.

  • 6000 total agents and UNIX or Linux computers per management group with 50 open consoles

  • 10,000 total agents and UNIX or Linux computers per management group with 25 open consoles

  • 500 UNIX or Linux computers monitored per dedicated management server.

  • 100 UNIX or Linux computers monitored per dedicated gateway

  • 3000 URLS can be monitored per dedicated management server

  • 12,000 URLs can be monitored per dedicated management group

  • 50 URLs can be monitored per agent

Click this link for the recommended limits for Operations Manager 2007 SP1.

When you consider these limits in conjunction with the security scopes offered through the use of Operations Manager roles to control access to data in the Operations console, a single management group is very scalable and will suffice in many situations.

Multiple Management Groups and Partitioning

As scalable as a management group is, if your requirements include any of the following scenarios, you will need more than one management group:

  • Production and Pre-Production Functionality—In Operations Manager, it is a best practice to have a production implementation that is used for monitoring your production applications and a pre-production implementation that has minimal interaction with the production environment. The pre-production management group is used for testing and tuning management pack functionality before it is migrated into the production environment. In addition, some companies employ a staging environment for servers where newly built servers are placed for a burn-in period prior to being placed into production. The pre-production management group can be used to monitor the staging environment to ensure the health of servers prior to production rollout.

  • Dedicated ACS Functionality—If your requirements include the need to collect the Windows Audit Security log events, you will be implementing the Audit Collection Service (ACS). It might be beneficial to implement a management group that exclusively supports the ACS function if your company's security requirements mandate that the ACS function be controlled and administered by a separate administrative group other than that which administers the rest of the production environment.

  • Disaster Recovery Functionality—In Operations Manager 2007, all interactions with the OperationsManager database are recorded in transaction logs prior to being committed to the database. Those transaction logs can be sent to another Microsoft SQL Server 2005 SP1 or higher or Microsoft SQL Server 2008 SP1 server and committed to a copy of the OperationsManager database there. This technique is called log shipping. The failover location must contain the failover SQL Server that receives the shipped logs and at least one management server that is a member of the source management group. If it is necessary to execute a failover, you must edit the registry on the management server in the failover location to point it to the failover SQL Server and restart the System Center Management Service. Then promote the failover management server to the RMS role. To complete the failover and return the management group to full functionality you then change the registry on all the remaining management servers in the management group to point to the failover SQL server and restart the System Center Management Service on each management server.

  • Increased Capacity—Operations Manager 2007 has no built-in limits regarding the number of agents that a single management group can support. Depending on the hardware that you use and the monitoring load (more management packs deployed means a higher monitoring load) on the management group, you might need multiple management groups in order to maintain acceptable performance.

  • Consolidated Views—When multiple management groups are used to monitor an environment, a mechanism is needed to provide a consolidated view of the monitoring and alerting data from them. This can be accomplished by deploying an additional management group (which might or might not have any monitoring responsibilities) that has access to all the data in all other management groups. These management groups are then said to be connected. The management group that is used to provide a consolidated view of the data is called the Local Management Group, and the others that provide data to it are called Connected Management Groups.

  • Installed Languages—All servers that have an Operations Manager server role installed on them must be installed in the same language. That is to say that you cannot install the RMS using the English version of Operations Manager 2007 and then deploy the Operations console using the Japanese version. If the monitoring needs to span multiple languages, additional management groups will be needed for each language of the operators.

  • Security and Administrative—Partitioning management groups for security and administrative reasons is very similar in concept to the delegation of administrative authority over Active Directory Organizational Units or Domains to different administrative groups. Your company might include multiple IT groups, each with their own area of responsibility. The area might be a certain geographical area or business division. For example, in the case of a holding company, it can be one of the subsidiary companies. Where this type of full delegation of administrative authority from the centralized IT group exists, it might be useful to implement management group structures in each of the areas. Then they can be configured as Connected management groups to a Local management group that resides in the centralized IT data center.

The preceding scenarios should give you a clear picture of how many management groups you will need in your Operations Manager infrastructure. The next section covers the distribution of server roles within a management group and the sizing requirements for those systems.

Management Group Composition

There are few limitations on the arrangement of Operations Manager server components in a management group. They can all be installed on the same server (except the gateway server role), or they can be distributed across multiple servers in various combinations. Some roles can be installed into a Cluster service (formerly known as MSCS) failover cluster for high availability, and multiple management servers can be installed to allow agents to fail over between them. You should choose how to distribute Operations Manager server components and what types of servers will be used based on your IT requirements and optimization goals.

Server Role Compatibility

An Operations Manager 2007 management group can provide a multitude of services. These services can be distributed to specific servers, thereby classifying a server into a specific role. Not all server roles and services can coexist. The following table lists the compatibilities and dependencies and notes whether the role can be installed on a failover cluster:

Server role Compatible with other roles Requirements Can be placed in a quorum failover cluster

Operational database

Yes

SQL

Yes

Audit Collection Services (ACS) database

Yes

SQL

Yes

Reporting Data Warehouse database

Yes

SQL

Yes

Reporting

Yes

Dedicated SQL Server Reporting Services instance; not on a domain controller

No

root management server

Yes

Not compatible with management server or gateway server role

Yes

management server

Yes

Not compatible with root management server

No

Administrator console

Yes

Windows XP, Windows Vista, Windows Server 2003, and Windows Server 2008

N/A

ACS collector

Yes

Can be combined with gateway server and audit database

No

gateway server

Yes

Can be combined with ACS collector only; must be a domain member

No

Web console server

Yes

 

N/A

agent

Yes

Automatically deployed to root management server and management server in a management group

N/A

All the recommendations made here are based on these assumptions:

  • The disk subsystem figures are based on drives that can sustain 125 random I/O operations per second per drive. Many drives can sustain higher I/O rates, and this might reduce the number of drives required in your configuration.

  • In management groups that have management servers deployed in addition to the RMS, all agents should use the management servers as their primary and secondary management servers and no agents should be using the RMS as their primary or secondary management server.

  • The Agentless Exception Monitoring guidance assumes that there are approximately one to two crashes per machine per week, with an average CAB file size of 500 KB.

  • Collective Client Monitoring includes only out-of-the-box client-specific management packs, including the Windows Vista, Windows XP, and Information Worker Management Packs.

  • All connectivity between agents and servers is at 100 Mbps or better.

Availability

The need for high availability for the databases, the RMS, management servers, and gateway servers can be addressed by building redundancy into the management group.

  • Database—All databases used in Operations Manager 2007 require Microsoft SQL Server 2005 SP1 or higher or Microsoft SQL Server 2008 SP1 or higher, which can be installed into a MSCS quorum node failover configuration.

    Note

    For more information on Cluster services, refer to the Windows Server 2003 and Windows Server 2008 online help.

  • RMS—The System Center Data Access service and System Center Management Configuration service run only on the RMS, and this makes them a single point of failure in the management group. Given the critical role that the RMS plays, if your requirements include high availability, the RMS should also be installed into its own two-node failover cluster. For complete details on how to cluster the RMS, see the Operations Manager 2007 Deployment Guide.

  • Management servers—In Operations Manager, agents in a management group can report to any management server in that group. Therefore, having more than one management server available provides redundant paths for agent/server communication. The best practice then is to deploy one or two management servers in addition to the RMS and to use the Agent Assignment and Failover Wizard to assign the agents to the management servers and to exclude the RMS from handling agents.

  • Gateway servers—Gateway servers serve as a communications intermediary between management servers and agents that lie outside the Kerberos trust boundary of the management servers. Agents can fail over between gateway servers just as they can between management servers if communications with the primary server of either one is lost. Likewise, gateway servers can be configured to fail over between management servers, providing a fully redundant set of pathways from the agents to the management servers. See the Operations Manager 2007 Deployment Guide for procedures on how to deploy this configuration.

Cost

The more distributed the management group server roles are, the more resources will be needed to support that configuration. This includes hardware, environment, licensing, operations, and maintenance overhead. Designing with cost control as the optimization goal moves you in the direction of a single-server implementation or minimal role distribution; this in turn reduces redundancy and, potentially, performance.

Performance

With performance as an optimization goal, you will be better served, with a more distributed configuration and higher-end hardware. Commensurately, cost will rise.

Console Distribution and Location of Access Points

The Operations console communicates directly with the RMS and, when the Reporting component is installed, with the Reporting server. Planning the location of the RMS and the database servers, then, in relationship to the Operations console is critical to performance. Be sure to keep these components in close network proximity to each other.

The following tables present recommendations for component distribution and platform sizing for Operations Manager 2007 R2. Click this link for recommendations on component distribution and platform sizing for Operations Manager 2007 SP1. In these tables, DB is a SQL database server, DW is a SQL database server, RS is the Reporting server, RMS is the root management server, and MS is a management server. Basic ACS design and planning is presented later in this paper.

Note

For additional information on sizing Operations Manager infrastructure, see the Operations Manager 2007 R2 Sizing Helper at https://go.microsoft.com/fwlink/?LinkID=200081

Single Server, All-in-One Scenario

# of Monitored Devices Server Roles and Config

15 to 250 Windows computers, 200 UNIX or Linux computers

DB, DW, RS, RMS;

4-disk RAID 0+1, 8 GB RAM, quad processors

Multiple Server, Small Scenario

# of Monitored Devices Server Roles and Config Server Roles and Config

250 to 500 Windows computers, 500 UNIX or Linux computers

DB, DW, RS;

4-disk RAID 0+1, 4 GB RAM, dual processors

RMS;

2-disk RAID 1, 4 GB RAM, dual processors

Multiple Servers, Medium Scenario

To allow for redundancy, you can deploy multiple management servers, each with the described minimum configuration. To provide high availability for the database and RMS servers, you can deploy them into a cluster, with each node having the described minimum configuration plus connections to an externally shared disk for cluster resources.

# of Monitored Devices Server Role and Config Server Role and Config Server Role and Config Server Role and Config

500 to 750 Windows computers, 500 UNIX or Linux computers

DB;

4-disk RAID 0+1, 4 GB RAM, dual processors

MS;

2-disk RAID 1, 4 GB RAM, dual processors

DW, RS;

4-disk RAID 0+1 (data), 2-disk RAID 1 (logs), 4 GB RAM, dual processors

RMS;

2-disk RAID 1, 8 GB RAM, dual processors

Multiple Servers, Large Scenario

To allow for redundancy, you can deploy multiple management servers, each with the described minimum configuration. To provide high availability for the database and RMS servers, you can deploy them into a cluster, with each node having the described minimum configuration plus connections to an externally shared disk for cluster resources.

# of Monitored Devices Server Role and Config Server Role and Config Server Role and Config Server Role and Config Server Role and Config

750 to 1000 Windows computers, Unix or Linux computers

DB;

4-disk RAID 0+1 (data), 2-disk RAID 1 (logs), 8 GB RAM, dual processors

DW;

4-disk RAID 0+1 (data), 2-disk RAID 1 (logs), 8 GB RAM, dual processors. Note: a RAID 5 configuration with similar performance can be used to fulfill the DW storage needs.

RS;

2-disk RAID 1, 4 GB RAM, dual processors

RMS;

2-disk RAID 1, 8 GB RAM, dual processors

MS;

2 disk RAID 1, 4 GB RAM, quad processors

Multiple Server, Enterprise

To allow for redundancy, you can deploy multiple management servers, each with the described minimum configuration. To provide high availability for the database and RMS servers, you can deploy them into a cluster, with each node having the described minimum configuration plus connections to an externally shared disk for cluster resources.

# of Monitored Devices Server Role and Config Server Role and Config Server Role and Config Server Role and Config Server Role and Config

1,000 to 3,000 Windows computers, 500 UNIX or Linux computers

DB;

8-disk RAID 0+1 (data), 2-disk RAID 1 (logs), 8 GB RAM, quad processors

DW;

8-disk RAID 0+1 (data), 2-disk RAID 1 (logs), 8 GB RAM, quad processors

RS;

2-disk RAID 1, 4 GB RAM

quad processors

RMS;

4-disk RAID 0+1, 12 GB RAM, 64-bit quad processors

MS;

4-disk RAID 0+1, 8 GB RAM, quad processors

3,000 to 6,000 Windows computers, UNIX or Linux computers

DB;

14-disk RAID 0+1 (data), 2-disk RAID 1 (logs), 16 GB RAM, quad processors

DW;

14-disk RAID 0+1 (data), 2-disk RAID 1 (logs), 16 GB RAM, quad dual processors. Note a RAID 5 configuration with similar performance can be used to meet the DW storage needs.

RS;

2-disk RAID 1, 4 GB RAM, quad processors

RMS;

4-disk RAID 0+1, 16 GB RAM, quad processors

MS;

2-disk RAID 0+1, 8 GB RAM, quad processors

Component Guidelines and Best Practices

In addition to the sizing guidance just given, there are additional considerations and best practices when planning for each of the Operations Manager server components.

Root Management Server Guidelines and Best Practices

On the RMS, the most critical resources are RAM and CPU, as many of the operations that the RMS performs are memory intensive and thus suffer from excessive paging. Factors that influence RMS load include the following:

  • Number of agents in the management group—Because the RMS must compute the configuration for all agents in the management group, increasing the number of agents increases the amount of memory required on the RMS, regardless of the volume of operations data the agents send.

  • Rate of instance space changes—The instance space is the data that Operations Manager maintains to describe all the monitored computers, services, and applications in the management group. Whenever this data changes frequently, additional resources are needed on the RMS to compute configuration updates for the affected agents. The rate of instance space changes increases as you import additional management packs into your management group. Adding new agents to the management group also temporarily increases the rate of instance space changes.

  • Number of concurrent Operations consoles and other SDK clients—Examples of other SDK clients include the Web console and many third-party tools that interface with Operations Manager. Because the SDK Service is hosted on the RMS, each additional connection uses memory and CPU.

Some best practices around sizing RMS include the following:

  • Use 64-bit hardware and operating system—Using 64-bit hardware enables you to easily increase memory beyond 4 GB. Even if your current deployment does not require more than 4 GB of RAM, using 64-bit hardware gives you room for growth if the requirements change in the future.

  • Limit the number or eliminate agents reporting to the RMS—In management groups with smaller agent counts, it’s typically fine to have agents report directly to the RMS. This reduces the overall cost of the hardware required for your installation. However, as the number of agents increases, you should consider restricting any agents from directly reporting to the RMS. Moving the agent workload to other management servers reduces the hardware requirements for the RMS and generally results in better performance and reliability from the management group.

  • Ensure high bandwidth network connectivity to the OperationsManager database and the Data Warehouse—The RMS frequently communicates with the Operations Database and Data Warehouse. In general, these SQL connections consume more bandwidth and are more sensitive to network latency than connections between agents and the RMS. Therefore, you should generally ensure that the RMS, OperationsManager database, and Data Warehouse database are on the same local area network.

Operations Database Guidelines and Best Practices

As with all database applications, the Operations database performance is most affected by the performance of the disk subsystem. Because all Operations Manager data must flow through the OperationsManager database, the faster the disk the better the performance. CPU and memory affect performance as well. Factors that influence the load on the OperationsManager database include the following:

Note

To calculate the OperationsManager database size, use the Operations Manager 2007 R2 Sizing Helper Tool at https://go.microsoft.com/fwlink/?LinkID=200081

  • The rate of data collection—The RMS frequently communicates with the Operations Database and Data Warehouse. In general, these SQL connections consume more bandwidth and are more sensitive to network latency than connections between agents and the RMS. Therefore, you should generally ensure that the RMS, OperationsManager database, and Data Warehouse database are on the same local area network.

  • The rate of instance space changes—The instance space is the data that Operations Manager maintains to describe all the monitored computers, services, and applications in the management group. Updating this data in the OperationsManager database is costly relative to writing new operational data to the database. Additionally, when instance space data changes, the RMS makes additional queries to the OperationsManager database to compute configuration and group changes. The rate of instance space changes increases as you import additional management packs into your management group. Adding new agents to the management group also temporarily increases the rate of instance space changes.

  • Concurrent Operations console and other SDK clients—Each open instance of the Operations console reads data from the OperationsManager database. Querying this data consumes potentially large amounts of disk activity as well as CPU and RAM. Consoles displaying large amounts of operational data in the Events View, State View, Alerts View, and Performance Data View tend to put the largest load on the database. To achieve maximum scalability, consider scoping views to include only necessary data.

Following are some best practices for sizing the OperationsManager database server:

  • Choose an appropriate disk subsystem—The disk subsystem for the OperationsManager database is the most critical component for overall management group scalability and performance. The disk volume for the database should typically be RAID 0+1 with an appropriate number of spindles. RAID 5 is typically an inappropriate choice for this component because it optimizes storage space at the cost of performance. Because the primary factor in choosing a disk subsystem for the OperationsManager database is performance rather than overall storage space, RAID 0+1 is more appropriate. When your scalability needs do not exceed the throughput of a single drive, RAID 1 is often an appropriate choice because it provides fault tolerance without a performance penalty.

  • The placement of data files and transaction logs—For lower-scale deployments, it is often most cost-effective to combine the SQL data file and transaction logs on a single physical volume because the amount of activity generated by the transaction log isn’t very high. However, as the number of agents increases, you should consider placing the SQL data file and transaction log on separate physical volumes. This allows the transaction log volume to perform reads and writes more efficiently. This is because the workload will consist of mostly sequential writes. A single two-spindle RAID 1 volume is capable of handling very high volumes of sequential writes and should be sufficient for almost all deployments, even at a very high scale.

  • Use 64-bit hardware and operating system—The OperationsManager database often benefits from large amounts of RAM, and this can be a cost-effective way of reducing the amount of disk activity performed on this server. Using 64-bit hardware enables you to easily increase memory beyond 4 GB. Even if your current deployment does not require more than 4 GB of RAM, using 64-bit hardware gives you room for growth if your requirements change in the future.

  • Use a battery-backed write-caching disk controller—Testing has shown that the workload on the OperationsManager Database benefits from write caching on disk controllers. When configuring read caching vs. write caching on disk controllers, allocating 100 percent of the cache to write caching is recommended. When using write-caching disk controllers with any database system, it is important to ensure they have a proper battery backup system to prevent data loss in the event of an outage.

Data Warehouse Guidelines and Best Practices

In Operations Manager 2007, data is written to the Data Warehouse in near real time. This makes the load on it similar to the load on the OperationsManager database machine. Because it is a SQL Server, the disk subsystem is the most critical to overall performance, followed by memory and CPU. Operations Manager Reporting Services also places a slightly different load on the Data Warehouse server. Factors that affect the load on the Data Warehouse include the following:

  • Rate of data insertion—To allow for more efficient reporting, the Data Warehouse computes and stores aggregated data in addition to a limited amount of raw data. Doing this extra work means that operational data collection to the Data Warehouse can be slightly more costly than to the OperationsManager database. This additional cost is typically balanced out by the reduced cost of processing discovery data on the Data Warehouse as opposed to processing it on the OperationsManager database.

  • Numbers of concurrent reporting users—Because reports frequently summarize large volumes of data, each reporting user can put a significant load on the system. Both the number of reports run at the same time and the type of reports being run affect overall capacity needs. Generally, reports that query large date ranges or large numbers of objects demand more system resources.

Following are some best practices when sizing the Data Warehouse server:

  • Choose an appropriate disk subsystem—Because the Data Warehouse is now an integral part of the overall data flow through the management group, choosing an appropriate disk subsystem for the Data Warehouse is very important. As with the OperationsManager database, RAID 0+1 is often the best choice. In general, the disk subsystem for the Data Warehouse should be similar to the disk subsystem for the OperationsManager database.

  • Placement of the data files and the transaction logs—As with the OperationsManager database, separating SQL data and transaction logs is often an appropriate choice as you scale up the number of agents. If both the OperationsManager database and Data Warehouse are located on the same physical machine and you want to separate data and transaction logs, you must put the transaction logs for the OperationsManager database on a separate physical volume from the Data Warehouse to see any benefit. The data files for the OperationsManager database and Data Warehouse can share the same physical volume, as long as it is appropriately sized.

  • Use 64-bit hardware and operating system—The Data Warehouse often benefits from large amounts of RAM, and this can be a cost-effective way of reducing the amount of disk activity performed on this server. Using 64-bit hardware enables you to easily increase memory beyond 4 GB. Even if your current deployment does not require more than 4 GB of RAM, using 64-bit hardware gives you room for growth if your requirements change in the future.

  • Use dedicated server hardware for the Data Warehouse—Although lower-scale deployments can often consolidate the OperationsManager database and Data Warehouse onto the same physical machine, it is advantageous to separate them as the number of agents increases and, consequently, the volume of incoming operational data increases as well. You will also see better reporting performance if the Data Warehouse and Reporting servers are separated.

  • Use a battery-backed write-caching disk controller—Testing has shown that the workload on the Data Warehouse benefits from write caching on disk controllers. When configuring read caching versus write caching on disk controllers, allocating 100 percent of the cache to write caching is recommended. When using write-caching disk controllers with any database system, it is important to ensure they have a proper battery backup system to prevent data loss in the event of an outage.

Management Server Guidelines and Best Practices

The largest portion of the load on a management server is from the collection of operational data and the insertion of that data into the OperationsManager and Data Warehouse databases. It is important to note that management servers perform these operations directly without depending on the RMS. Management servers perform most of the data queuing in memory rather than depending on a slower disk, thereby increasing performance. The most important resource for management servers is the CPU, but testing has shown that they typically do not require high-end hardware. Factors that affect load on a management server include the following:

  • Rate of operational data collection—Because operations data collection is the primary activity performed by a management server, this rate has the biggest impact on overall server utilization. However, testing has shown that management servers can typically sustain high rates of operational data processing with low to moderate utilization. The primary factor affecting the rate of operational data collection is which management packs are deployed in the management group.

Following are some best practices when sizing a management server:

  • Do not oversize management server hardware.—For most scenarios, using a standard utility server is sufficient for the work performed by a management server. Following the hardware guidelines in this document should be sufficient for most workloads.

  • Do not exceed an agent-to-management-server ratio of about 3,000 to 1—Actual server performance will vary based on the volume of operations data collected, but testing has shown that management servers typically do not have issues supporting 2,000 agents each with a relatively high volume of operational data coming in. Having 2,000 agents per management server is a guideline based on test experience and not a hard limit. You might find that a management server in your environment is able to support a higher or lower number of agents.

  • To maximize the UNIX or Linux computer-to-management-server ratio (500:1), use dedicated management servers for cross-platform monitoring.

  • Use the minimum number of management servers per management group to satisfy redundancy requirements—The main reason for deploying multiple management servers should be to provide for redundancy and disaster recovery rather than scalability. Based on testing, most deployments will not need more than three to five management servers to satisfy these needs.

Gateway Server Guidelines and Best Practices

Gateway servers relay communications between management servers and agents that lie on opposite sides of Kerberos trust boundaries from each other. The gateway server uses certificate-based authentication to perform mutual authentication with the management server, and it does so using a single connection rather than multiple connections as would be required between the agents and the management server. This makes managing certificate-based authentication to untrusted domains easier and more manageable. Factors that affect load on a gateway server include the following:

  • Rate of operations data collection—The primary factor that influences the load on a gateway is the rate of operations data collection. This rate is a function of the number of agents reporting to the gateway and the management packs deployed within the management group.

Following are some best practices when sizing a gateway server:

  • Gateway servers can be beneficial in managing bandwidth utilization—From a performance perspective, gateways are recommended as a tool to optimize bandwidth utilization in low-bandwidth environments as it performs a level of compression on all communications with the management server.

  • Do not exceed an agent-to-Gateway-Server ratio of about 1,500 to 1—Testing has shown that having more than 1,000 agents per gateway can adversely affect the ability to recover in the event of a sustained (multi-hour) outage that causes a gateway to be unable to communicate with the management server. If you need more agents than this to be reporting to a gateway, consider using multiple gateway servers. If you want to exceed 1,500 agents per gateway, it is highly recommended that you test your system to ensure that the gateway is able to quickly empty its queue after a sustained outage between the gateway and the management server if gateway recovery time is a concern in your environment.

  • For large numbers of gateways and gateway connected agents, use a dedicated management server—Having all gateways connect to a single management server with no other agents connected to it can speed recovery time in the event of a sustained outage.

Application Error Monitoring Guidelines and Best Practices

The management server used for AEM receives the data from the Error Reporting Client and stores it to a file share. If that file share is local, this will affect the management server.

Following are some best practices when planning for AEM:

  • Disk storage for the files hare can be local or on a Network Attached Storage (NAS) or storage area network (SAN) device.

  • The disk used for AEM should be separate from the disk used for the Data Warehouse or OperationsManager databases.

  • If the storage is set up on a Distributed File System (DFS), DFS replication should be disabled.

  • A gateway server should not be used as an AEM collector.

# of Monitored Devices management server for AEM file share

0 to 10,000

200 GB of disk as 2 drives RAID 1, 4 GB RAM, dual processors

10,000 to 25,000

500 GB of disk as 2 drives RAID 1, 8 GB RAM, quad processors

URL Monitoring Guidelines and Best Practices

URL monitoring can be performed by the Health Service of an agent or a management server. If you are monitoring more than 1000 URLs from a management server, you should increase the Health Service Version Store page size from the default of 5120 pages to 10240 pages. This is done in the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\HealthService\Parameters\Persistence Version\Store Maximum. A management server that is performing URL monitoring will have a heavy load placed on its CPU and disk resources, and it is recommended to use a battery backed cache controller.

Collective Client Monitoring Guidelines and Best Practices

Collective Health Monitoring is performed by gathering event and performance data from many machines and aggregating the data based on groups of systems for reporting and analysis. For example, individual memory performance data is gathered from Windows XP and Windows Vista clients on different types of hardware. Collective Health Monitoring aggregates this data and provides reports based on memory performance for specific groups of systems, such as by operating system or by hardware vendor. This makes analysis of overall performance easier than the alternative of digging through long lists of individual system performance reports. Collective monitoring mode also enables alerting and monitoring at a collective, rather than an individual, level.

Collective Client monitoring management packs include the following: Information Worker, Windows Client, Windows XP, Windows Vista, Network Address Protocol, and other client-focused management packs.

Each client that is monitored by an agent typically generates summary events periodically, and these events are used to calculate the collective health of the client population. Alerting on the individual agent is disabled, and hence, there will not be any alerts data generated by the agents running on the clients.

Depending on the number of management packs deployed and agent traffic, each management server can manage up to 3,000 to 4,000 agent-managed clients.

When planning the rollout of collective monitoring clients, the agents should be approved in batches of no more than 1,000 at a time to allow the agents to get synchronized with the latest configuration.

Designing Audit Collection Services

This section provides high-level guidance to help you get started with planning your ACS implementation.

ACS is not a stand-alone solution. ACS can be hosted only in an existing management group because its agent is integrated and installed with the Operations Manager agent, and the ACS Collector can be installed only on a management server or gateway server. The remaining components, the ACS database and ACS Reporting, can be installed on the same SQL Server 2005 server or instance as the rest of the OperationsManager database and reporting components as well. However, for performance, capacity, and security reasons, you will probably choose to install these on dedicated hardware.

Design Decisions

There are four fundamental design decisions to make when planning your ACS implementation. As you make these decisions, keep in mind that there is a one-to-one relationship between the ACS Collector server and its ACS database. An ACS database can have only one ACS Collector feeding data to it at a time, and every ACS Collector needs its own ACS database. It is possible to have multiple ACS Collector/Database pairs in a management group; however, there are no procedures available out of the box for integrating the data from multiple ACS databases into a single database.

The first decision that must be made is whether or not to deploy a management group that is exclusively used to support ACS or to deploy ACS into a management group that also provides health monitoring and alerting services. Here are the characteristics of these two ACS deployment scenarios.

  • ACS hosted in a production management group scenario:

    • Scaled usage of ACS—Given that ACS collects every security event from the systems that ACS Forwarders are enabled on, the use of ACS can generate a huge amount of data. Unless you are using dedicated hardware for the ACS Collector and Database roles, processing this data might negatively affect the performance of the hosting management group, particularly in the database layer.

    • Separate administration and security is not required—Because ACS is hosted in a management group, people with administrative control in the management group will have administrative rights in ACS. If the business, regulator/audit, and IT requirements mandate that ACS be under nonproduction IT control, deploying ACS into a production management group scenario is not an option.

  • ACS hosted on a dedicated management group scenario:

    • Separate administration and security is required—If there is a separate administrative group that is responsible for audit and security controls at your company, hosting ACS on a dedicated management group administered by the audit/security group is recommended.

The second decision that must be made is whether or not to deploy ACS Reporting into the same SQL Server 2005 Reporting Services instance as the Operations Manager 2007 Reporting component. Here are the characteristics of these two scenarios.

  • ACS reporting integrated with Operations Manager Reporting:

    • Single console for all reports—When ACS Reporting is installed with Operations Manager Reporting, the ACS reports are accessed via the Operations Manager Operations console.

    • Common security model—When Operations Manager 2007 Reporting is installed into SQL Server 2005 Reporting Services, it overwrites the default security model, replacing it with the Operations Manager role-based security model. ACS Reporting is compatible with this model. All users who have been assigned the Report Operator role will have access to the ACS Reports as long as they also have the necessary permissions on the ACS database.

    Note

    If Operations Manager Reporting is later uninstalled, the original SRS security model must be restored manually using the ResetSRS.exe utility found on the installation media in the SupportTools directory.

  • ACS reporting installed on a dedicated SQL Server Reporting Services instance:

    • Separate console for ACS and Operations Manager reports—When installed on a dedicated SRS instance, the ACS Reports are accessed via the SRS Web site that is created for it at installation. This provides greater flexibility in configuring the folder structure and in using SRS Report designer.

    • Separate security model—A consequence of using a dedicated SRS instance is that you can create security roles as needed to meet the business and IT requirements to control access to the ACS reports. Note that the necessary permissions must still be granted on the ACS database.

The third design decision that must be made is how many ACS Collector/Database pairs to deploy to support your environment. The rate that a single ACS Collector/Database pair can support an ongoing event collection and insertion is not an absolute number. This rate is dependent upon the performance of the storage subsystem that the database server is attached to. For example a low-end SAN solution can typically support up to 2,500 to 3,000 security events per second. Independent of this the ACS Collector has been observed supporting bursts of 20,000 security events per second. Following are factors that affect the number of security events generated per second:

  • Audit Policy Configuration—The more aggressive the audit policy, the greater the number of Security events that are generated from audited machines

  • The role of the machine that the ACS forwarder is enabled on, given the default Audit Policy, Domain Controller will generate the most security events. Member servers will generate the next highest amount, and workstations will generate the least.

Machine Role Approximate Number of Unfiltered Security Events per Second generated under high load

Windows Server 2003 Domain Controller

40 events per second

Windows Server 2003 Member Server

2 events per second

Workstation

0.2 events per second

  • Using the numbers in the preceding table, a single, high-end ACS Collector/Database pair can support up to 150 Domain Controllers, 3,000 Member Servers, or 20,000 Workstations (with the appropriate ACS Collector filter applied).

  • The amount of user activity on the network—If your network is used by high-end users conducting a large number of transactions, as is experienced, for example, at Microsoft, more events will be generated. If your network users conduct relatively few transactions, such as might be the case at a retail kiosk or in a warehouse scenario, you should expect fewer security events.

  • The ACS Collector Filter configuration—ACS collects all security events from a monitored machine's security event log. Out of all the events collected, you might be interested in only a smaller subset. ACS provides the ability to filter out the undesired events, allowing only the desired ones to be processed by the Collector and then inserted into the ACS database. As the amount of filtering increases, fewer events will be processed and inserted into the ACS database.

The last design decision that must be made is the version of SQL Server 2005 or SQL Server 2008 to use for the ACS database. ACS supports the use of SQL Server 2005 Standard edition and SQL Server 2005 Enterprise edition or SQL Server 2008 Standard or Enterprise editions. Which version is used has an impact on how the system will behave during the daily database maintenance window. During the maintenance window, database partitions whose time stamps lie outside the data retention schedule (with 14 days being a typical configuration for data retention) are dropped from the database. If SQL Server Standard edition is used, Security event insertion halts and events queue up on the ACS Collector until maintenance is completed. If SQL Server Enterprise edition is used, insertion of processed Security events continues, but at only 30 percent to 40 percent of the regular rate. This is one reason why you should carefully pick the timeframe for daily database maintenance, selecting a time when there is the least amount of user and application activity on the network.

Sizing Audit Collection Services

This section helps you size ACS hardware components before you deploy them by determining how many disks, ACS collectors, and ACS databases are needed.

Important

To effectively size ACS, you must determine the number of disks required for ACS disk I/O and you must determine the ACS database size. The processes of calculating these values are detailed in the "Sizing ACS" section. Each ACS collector must have its own ACS database. The rate of data insertion to the database, which is dictated by the performance of the storage subsystem, determines the capacity of a single ACS collector. The more disks that a single disk array can support, the better it can perform.

Tip

ACS supports the use of SQL Server 2005 Standard Edition and SQL Server 2005 Enterprise Edition; however, the edition you use affects how the system performs during the daily database maintenance window. During the maintenance window, database partitions with time stamps outside of the default 14-day data retention schedule are dropped from the database. If SQL Server 2005 Standard Edition is used, Security event insertion halts and events queue in the ACS Collector until maintenance is completed. If SQL Server 2005 Enterprise Edition is used, Security event insertion continues, but at only at 30 to 40 percent of the regular rate. Therefore, you should carefully pick the timeframe for daily database maintenance, selecting a time when there is the least amount of user and application activity on the network.

Sizing ACS

The number of ACS collectors and the sizing of the ACS database and the sizing of the disk subsystem for the database are entirely dictated by the volume of security events that get forwarded to it as measured in events per second. You perform ACS sizing calculations to find out three things:

  1. The number of ACS Collectors you will need

  2. How much space you will need to allot for the database

  3. How many disks you will need to support the expected throughput on the database

Ideally, you could determine the number of security events generated by computers in your organization by installing a pilot ACS collector to measure the incoming event rate. If you have a pilot ACS collector, you can monitor the ACS Collector\Incoming Event per Sec performance monitor counter. However, if you do not have a pilot ACS collector, you can use the sizing guidelines and script sample that follow to produce similar results.

Use the following procedure to measure the number of events per second for all computers in your organization by using the Events Generated Per Second Script. After you determine the number of events, you use it this number to calculate the number of disks required to handle I/O and the total ACS database size as described in the subsequent sections.

To estimate the number of events per second for all computers

  1. Identify groups of computers that perform similar functions; for example, domain controllers, member servers, and desktop computers.

  2. Count the number of computers in each group for all computers in your organization.

  3. Run the script sample contained in the Events Generated Per Second Script section over a 48-hour period on at least one computer in each group to record data. The computer you run the script on represents all computers included in its group.

  4. Record the data in a spreadsheet for consolidation and analysis.

  5. Based on the data you collect, identify when peak usage occurs.

  6. For each computer you collect data from, determine how many events occur per second during peak usage and then multiply it by the number of computers in the represented group. Repeat this step for each group.

  7. Add the values together from the previous step to determine the number of events per second for all computers in your organization.

    You will use the total value to calculate the number of disk required to handle I/O and to calculate the total ACS database size in the following sections.

Calculating the number of disks required to handle I/O During testing at Microsoft, the estimated average number of logical disk I/O per event for ACS database logs was 1.384 and the ACS database was 0.138. However, these values may differ slightly depending on the environment. This assumed that the disk revolutions per minute (RPM) has a 1:1 ratio with the logical disk I/O and that a RAID 0+1 configuration is used.

You can use the following formulas to calculate the number of disks required to handle I/O.

For the log drives:

[Average number of disk I/O per event for transaction log] * [Events per second for all computers] / [disk RPM] * 60 sec/minute = [number of required drives] * 2 (for RAID 1)

Values for the preceding variables are described in the following table.

Variable Value

Average number of logical disk I/O per event (for the transaction log file)

1.384

Estimated events per second for all computers

Estimated by using the script and the To estimate the number of events per second for all computers procedure

Disk RPM

Varies, determined by disk device

For the database drives:

[Average number of disk I/O per event for database file] * [Events per second for all computers] / [drive RPM] * 60 sec/minute = [number of required drives] * [2 for RAID 1]

Values for the preceding variables are described in the following table.

Variable Value

Average number of logical disk I/O per event (for the database file)

0.138

Estimated events per second for all computers

Estimated by using the script and the To estimate the number of events per second for all computers procedure

Disk RPM

Varies, determined by disk device

If the number of disks required to handle I/O for events exceeds the number of disks you can have in a disk array, you will need to divide the events into multiple collectors.

Calculating the total ACS Database size
To determine the total ACS database size, use the following formula:

[Events per second for all computers] * [0.4 KB, which is the size of event] * 60 sec *60 min * 24 hr /1024 MB /1024 GB /1024 TB * [retention period, which is days to keep in database] = total size of database

Audit Collection Service Guidelines and Best Practices

The overall performance of the ACS system is most affected by the performance of the ACS database and its disk subsystem. Given that many thousands of events per second will be inserted continuously, with potential peaks of tens of thousands per second, this is easy to see. It is not uncommon with a large number of monitored devices, including domain controllers, to accumulate more than a terabyte of data in a 14-day time span in the ACS database. Following are some best practices for ACS:

  • Use 64-bit hardware and operating system for the Collector and SQL Server, along with a high-performance SAN solution.

  • Separate the database files from the transaction logs.

  • Use dedicated hardware to host ACS if warranted.

  • Use tight filters to reduce the number of noise Security events that get inserted into the database.

  • Plan your Windows Audit policy carefully so that only relevant events are logged on monitored systems.

  • Enable the ACS Forwarder only on necessary systems.

  • Configure Security Event logs with sufficient space so that if communication is lost with the ACS Collector, the Security Event log file will not wrap on itself and overwrite previous events, resulting in a loss of event data.