Mapping Requirements to a Design for Operations Manager 2007 SP1

Applies To: Operations Manager 2007 SP1

In the previous section, you completed the following three tasks:

  • You gathered the business requirements, which help you plan which features of Operations Manager to implement.

  • You gathered the IT requirements, which help you plan the management group topology.

  • You inventoried how your company currently performs monitoring, which helps plan how to configure Operations Manager.

This section guides you through the design decisions that map all the information and knowledge that has been collected to an actual design. This will be done by applying best practices for sizing and capacity planning for server roles and components.

Management Group Design

All Operations Manager 2007 implementations consist of at least one management group, and given the scalability of Operations Manager 2007, for some implementations, a single management group might be all that is needed. Depending on the requirements of the company, additional management groups might be needed immediately or might be added over time. The process of distributing Operations Manager services among multiple management groups is called partitioning.

This section addresses the general criteria that would necessitate multiple management groups. Planning the composition of individual management groups, such as determining the sizing of servers and distribution of Operations Manager roles among servers in a management group, is covered in the next section, "Management Group Composition."

One Management Group

Approach your Operations Manager management group planning with the same mindset as you have with Active Directory domain planning: start with one management group, and add on as necessary. A single management group can scale along the following recommended limits:

  • 2000 agents reporting to a management server.

  • Most scalability, redundancy, and disaster recovery requirements can be met by using from three to five management servers in a management group.

  • 50 Operations consoles open simultaneously.

  • 800 agents reporting to a gateway server.

  • 25,000 Application Error Monitoring (AEM) machines reporting to a management server.

  • 100,000 AEM machines reporting to a management group.

  • 2500 Collective Monitoring agents reporting to a management server.

  • 10,000 Collective Monitoring agents reporting to a management group.

When you consider these limits in conjunction with the security scopes offered through the use of Operations Manager roles to control access to data in the Operations console, a single management group is very scalable and will suffice in many situations.

Multiple Management Groups and Partitioning

As scalable as a management group is, if your requirements include any of the following scenarios, you will need more than one management group:

  • Production and Pre-Production Functionality--In Operations Manager, it is a best practice to have a production implementation that is used for monitoring your production applications and a pre-production implementation that has minimal interaction with the production environment. The pre-production management group is used for testing and tuning management pack functionality before it is migrated into the production environment. In addition, some companies employ a staging environment for servers where newly built servers are placed for a burn-in period prior to being placed into production. The pre-production management group can be used to monitor the staging environment to ensure the health of servers prior to production rollout.

  • Dedicated ACS Functionality--If your requirements include the need to collect the Windows Audit Security log events, you will be implementing the Audit Collection Service (ACS). It might be beneficial to implement a management group that exclusively supports the ACS function if your company's security requirements mandate that the ACS function be controlled and administered by a separate administrative group other than that which administers the rest of the production environment.

  • Disaster Recovery Functionality--In Operations Manager 2007, all interactions with the OperationsManager database are recorded in transaction logs prior to being committed to the database. Those transaction logs can be sent to another Microsoft SQL Server 2005 server and committed to a copy of the OperationsManager database there. This technique is called log shipping. The destination, or failover, management group does not need to be a fully populated and active management group. It can consist of only a root management server (RMS) and a SQL Server 2005 server. If it is necessary to execute a failover, the remaining management servers in the source management group require a registry change and a restart to start acting as members of the failover management group.

  • Increased Capacity--Operations Manager 2007 has no built-in limits regarding the number of agents that a single management group can support. Depending on the hardware that you use and the monitoring load (more management packs deployed means a higher monitoring load) on the management group, you might need multiple management groups in order to maintain acceptable performance.

  • Consolidated Views--When multiple management groups are used to monitor an environment, a mechanism is needed to provide a consolidated view of the monitoring and alerting data from them. This can be accomplished by deploying an additional management group (which might or might not have any monitoring responsibilities) that has access to all the data in all other management groups. These management groups are then said to be connected. The management group that is used to provide a consolidated view of the data is called the Local Management Group, and the others that provide data to it are called Connected Management Groups.

  • Installed Languages--All servers that have an Operations Manager server role installed on them must be installed in the same language. That is to say that you cannot install the RMS using the English version of Operations Manager 2007 and then deploy the Operations console using the Japanese version. If the monitoring needs to span multiple languages, additional management groups will be needed for each language of the operators.

  • Security and Administrative--Partitioning management groups for security and administrative reasons is very similar in concept to the delegation of administrative authority over Active Directory Organizational Units or Domains to different administrative groups. Your company might include multiple IT groups, each with their own area of responsibility. The area might be a certain geographical area or business division. For example, in the case of a holding company, it can be one of the subsidiary companies. Where this type of full delegation of administrative authority from the centralized IT group exists, it might be useful to implement management group structures in each of the areas. Then they can be configured as Connected management groups to a Local management group that resides in the centralized IT data center.

The preceding scenarios should give you a clear picture of how many management groups you will need in your Operations Manager infrastructure. The next section covers the distribution of server roles within a management group and the sizing requirements for those systems.

Management Group Composition

There are few limitations on the arrangement of Operations Manager server components in a management group. They can all be installed on the same server (except the gateway server role), or they can be distributed across multiple servers in various combinations. Some roles can be installed into a Cluster services (MSCS) failover cluster for high availability, and multiple management servers can be installed to allow agents to fail over between them. You should choose how to distribute Operations Manager server components and what types of servers will be used based on your IT requirements and optimization goals.

Important

The manual decision process described here can be used in conjunction with the SCCP tool. The SCCP tool takes your IT requirements, optimization goals, and environment inventory as inputs and calculates a management group design that best meets those goals. In addition, SCCP can run performance simulations as you change certain aspects of the management group, such as changing the number of agents that are monitored or using faster processors or a faster disk for the database server. Please see Appendix A of this guide, ‘Planning an Operations Manager 2007 Deployment with Capacity Planner’..

Server Role Compatibility

An Operations Manager 2007 management group can provide a multitude of services. These services can be distributed to specific servers, thereby classifying a server into a specific role. Not all server roles and services can co-exist. The following table lists the compatibilities and dependencies and notes whether the role can be installed on a failover cluster:

Server role Compatible with other roles Requirements Can be placed in a failover cluster

Operational database

Yes

SQL

Yes

Audit Collection Services (ACS) database

Yes

SQL

Yes

Reporting Data Warehouse database

Yes

SQL

Yes

Reporting database

Yes

Dedicated SQL Server Reporting Services instance; not on a domain controller

Yes

root management server

Yes

Not compatible with management server or gateway server role

Yes

management server

Yes

Not compatible with root management server

No

Administrator console

Yes

Windows XP

N/A

ACS collector

Yes

Can be combined with gateway server and audit database

No

gateway server

Yes

Can be combined with ACS collector only; must be a domain member

No

Web console server

Yes

 

N/A

agent

Yes

Automatically deployed to root management server and management server in a management group

N/A

All the recommendations made here are based on these assumptions:

  • The disk subsystem figures are based on drives that can sustain 125 random I/O operations per second per drive. Many drives can sustain higher I/O rates, and this might reduce the number of drives required in your configuration.

  • In management groups that have management servers deployed in addition to the RMS, all agents should use the management servers as their primary and secondary management servers and no agents should be using the RMS as their primary or secondary management server.

  • The Agentless Exception Monitoring guidance assumes that there is approximately one crash per machine per day, with an average CAB file size of 500 KB.

  • Collective Client Monitoring includes only out-of-the-box client-specific management packs, including the Windows Vista, Windows XP, and Information Worker Management Packs.

  • All connectivity between agents and servers is at 100 Mbps or better.

Availability

High-availability needs are addressed by building redundancy into the management group for the OperationsManager database, RMS, management servers, and gateway servers.

  • Database--All databases used in Operations Manager 2007 require Microsoft SQL Server 2005 SP1 or SP2, which can be installed into a failover cluster configuration. Only two node clusters running in active-passive are supported.

    Note

    For more information on Cluster services, refer to the Windows Server 2003 and Windows Server 2008 online help.

  • RMS--The SDK and Config Services run only on the RMS, and this makes them a single point of failure in the management group. Given the critical role that the RMS plays, if your requirements include high availability, the RMS should also be installed into its own two-node failover cluster. For complete details on how to cluster the RMS, see the Operations Manager 2007 Deployment Guide.

  • Management servers--In Operations Manager, agents in a management group can report to any management server in that group. Therefore, having more than one management server available provides redundant paths for agent/server communication. The best practice then is to deploy one or two management servers in addition to the RMS and to use the Agent Assignment and Failover Wizard to assign the agents to the management servers and to exclude the RMS from handling agents.

  • Gateway servers--Gateway servers communicate with management servers and agents. Agents can fail over between gateway servers just as they can between management servers if communications with the primary of either be lost. Likewise, gateway servers can be configured to fail over between management servers, providing a fully redundant set of pathways from the agents to the management servers. See the Operations Manager 2007 Deployment Guide for procedures on how to deploy this configuration.

Cost

The more distributed the management group server roles are, the more resources will be needed to support that configuration. This includes hardware, environment, licensing, operations, and maintenance overhead. Designing with cost control as the optimization goal moves you in the direction of a single-server implementation or minimal role distribution, this in turn reduces redundancy and, potentially, performance.

Performance

With performance as an optimization goal, you will be better served, with a more distributed configuration and higher end hardware. Commensurately, cost will rise.

Console Distribution and Location of Access Points

The Operations console communicates directly with the RMS and, when the Reporting component is installed, with the Reporting server. Planning the location of the RMS and the database servers, then, in relationship to the Operations console is critical to performance. Be sure to keep these components in close network proximity to each other.

The following tables present recommendations for component distribution and platform sizing. In these tables, DB is a SQL database server, DW is a SQL database server, RS is the Reporting server, RMS is the root management server, and MS is a management server. Basic ACS design and planning is presented later in this paper.

Single Server, All-in-One Scenario

# of Monitored Devices Server Roles and Config

15 to 250

DB, DW, RS, RMS;

4-drive RAID 0+1, 8 GB RAM, dual processors

Multiple Server, Small Scenario

# of Monitored Devices Server Roles and Config Server Roles and Config

250 to 500

DB, DW, RS;

4-drive RAID 0+1, 4 GB RAM, dual processors

RMS;

2-drive RAID 1, 4 GB RAM, dual processors

Multiple Servers, Medium Scenario

To allow for redundancy, you can deploy multiple management servers, each with the described minimum configuration. To provide high availability for the database and RMS servers, you can deploy them into a cluster, with each node having the described minimum configuration plus connections to an externally shared disk for cluster resources.

# Monitored Devices Server Role and Config Server Role and Config Server Role and Config Server Role and Config

500 to 750

DB;

4-drive RAID 0+1, 4 GB RAM, dual processors

MS;

2-drive RAID 1, 2 GB RAM, dual processors

RS;

4-drive RAID 0+1 (data), 2-drive RAID 1 (logs), 4 GB RAM, dual processors

RMS;

2-drive RAID 1, 4 GB RAM, dual processors

Multiple Servers, Large Scenario

To allow for redundancy, you can deploy multiple management servers, each with the described minimum configuration. To provide high availability for the database and RMS servers, you can deploy them into a cluster, with each node having the described minimum configuration plus connections to an externally shared disk for cluster resources.

# Monitored Devices Server Role and Config Server Role and Config Server Role and Config Server Role and Config Server Role and Config

750 to 1000

DB;

4-drive RAID 0+1 (data), 2-drive RAID 1 (logs), 8 GB RAM, dual processors

DW;

4-drive RAID 0+1 (data), 2-drive RAID 1 (logs), 8 GB RAM, dual processors

RS;

2-drive RAID 1, 2 GB RAM, dual processors

RMS;

2-drive RAID 1, 8 GB RAM, dual processors

MS;

2 drive RAID 1, 4 GB RAM, dual processors

Multiple Server, Enterprise

To allow for redundancy, you can deploy multiple management servers, each with the described minimum configuration. To provide high availability for the database and RMS servers, you can deploy them into a cluster, with each node having the described minimum configuration plus connections to an externally shared disk for cluster resources.

# Monitored Devices Server Role and Config Server Role and Config Server Role and Config Server Role and Config Server Role and Config

1000 to 3000

DB;

8-drive RAID 0+1 (data), 2-drive RAID 1 (logs), 8 GB RAM, 64-bit dual processors

DW;

8-drive RAID 0+1 (data), 2-drive RAID 1 (logs), 8 GB RAM, 64-bit dual processors

RS;

2-drive RAID 1, 4 GB RAM

dual processors

RMS;

4-drive RAID 0+1, 12 GB RAM, 64-bit dual processors

MS;

2-drive RAID 1, 4 GB RAM, dual processors

3000 to 6000

DB;

14-drive RAID 0+1 (data), 2-drive RAID 1 (logs), 8 GB RAM, 64-bit dual processors

DW;

14-drive RAID 0+1 (data), 2-drive RAID 1 (logs), 8 GB RAM, 64-bit dual processors

RS;

2-drive RAID 1, 4 GB RAM, dual processors

RMS;

4-drive RAID 0+1, 16 GB RAM, 64-bit dual processors

MS;

2-drive RAID 1, 4 GB RAM, dual processors

Component Guidelines and Best Practices

In addition to the sizing guidance just given, there are additional considerations and best practices when planning for each of the Operations Manager server components.

Root Management Server Guidelines and Best Practices

On the RMS, the most critical resources are RAM and CPU, as many of the operations that the RMS performs are memory intensive and thus suffer from excessive paging. Factors that influence RMS load include the following:

  • Number of agents in the management group--Because the RMS must compute the configuration for all agents in the management group, increasing the number of agents increases the amount of memory required on the RMS, regardless of the volume of operations data the agents send.

  • Rate of instance space changes--The instance space is the data that Operations Manager maintains to describe all the monitored computers, services, and applications in the management group. Whenever this data changes frequently, additional resources are needed on the RMS to compute configuration updates for the affected agents. The rate of instance space changes increases as you import additional management packs into your management group. Adding new agents to the management group also temporarily increases the rate of instance space changes.

  • Number of concurrent Operations consoles and other SDK clients--Examples of other SDK clients include the Web console and many third-party tools that interface with Operations Manager. Because the SDK Service is hosted on the RMS, each additional connection uses memory and CPU.

Some best practices around sizing RMS include the following:

  • Use 64-bit hardware and operating system--Using 64-bit hardware enables you to easily increase memory beyond 4 GB. Even if your current deployment does not require more than 4 GB of RAM, using 64-bit hardware gives you room for growth if the requirements change in the future.

  • Limit the number or eliminate agents reporting to the RMS--In management groups with smaller agent counts, it’s typically fine to have agents report directly to the RMS. This reduces the overall cost of the hardware required for your installation. However, as the number of agents increases, you should consider restricting any agents from directly reporting to the RMS. Moving the agent workload to other management servers reduces the hardware requirements for the RMS and generally results in better performance and reliability from the management group.

  • Ensure high bandwidth network connectivity to the OperationsManager database and the Data Warehouse--The RMS frequently communicates with the Operations Database and Data Warehouse. In general, these SQL connections consume more bandwidth and are more sensitive to network latency than connections between agents and the RMS. Therefore, you should generally ensure that the RMS, OperationsManager database, and Data Warehouse database are on the same local area network.

Operations Database Guidelines and Best Practices

As with all database applications, the Operations database performance is most affected by the performance of the disk subsystem. Because all Operations Manager data must flow through the OperationsManager database, the faster the disk the better the performance. CPU and memory affect performance as well. Factors that influence the load on the OperationsManager database include the following:

  • The rate of data collection--The RMS frequently communicates with the Operations Database and Data Warehouse. In general, these SQL connections consume more bandwidth and are more sensitive to network latency than connections between agents and the RMS. Therefore, you should generally ensure that the RMS, OperationsManager database, and Data Warehouse database are on the same local area network.

  • The rate of instance space changes--The instance space is the data that Operations Manager maintains to describe all the monitored computers, services, and applications in the management group. Updating this data in the OperationsManager database is costly relative to writing new operational data to the database. Additionally, when instance space data changes, the RMS makes additional queries to the Operations Manager database to compute configuration and group changes. The rate of instance space changes increases as you import additional management packs into your management group. Adding new agents to the management group also temporarily increases the rate of instance space changes.

  • Concurrent Operations console and other SDK clients--Each open instance of the Operations console reads data from the OperationsManager database. Querying this data consumes potentially large amounts of disk activity as well as CPU and RAM. Consoles displaying large amounts of operational data in the Events View, State View, Alerts View, and Performance Data View tend to put the largest load on the database. To achieve maximum scalability, consider scoping views to include only necessary data.

Following are some best practices for sizing the OperationsManager database server:

  • Choose an appropriate disk subsystem--The disk subsystem for the OperationsManager database is the most critical component for overall management group scalability and performance. The disk volume for the database should typically be RAID 0+1 with an appropriate number of spindles. RAID 5 is typically an inappropriate choice for this component because it optimizes storage space at the cost of performance. Because the primary factor in choosing a disk subsystem for the OperationsManager database is performance rather than overall storage space, RAID 0+1 is more appropriate. When your scalability needs do not exceed the throughput of a single drive, RAID 1 is often an appropriate choice because it provides fault tolerance without a performance penalty.

  • The placement of data files and transaction logs--For lower scale deployments, it is often most cost-effective to combine the SQL data file and transaction logs on a single physical volume because the amount of activity generated by the transaction log isn’t very high. However, as the number of agents increases, you should consider placing the SQL data file and transaction log on separate physical volumes. This allows the transaction log volume to perform reads and writes more efficiently. This is because the workload will consist of mostly sequential writes. A single two-spindle RAID 1 volume is capable of handling very high volumes of sequential writes and should be sufficient for almost all deployments, even at a very high scale.

  • Use 64-bit hardware and operating system--The OperationsManager database often benefits from large amounts of RAM, and this can often be a cost-effective way of reducing the amount of disk activity performed on this server. Using 64-bit hardware enables you to easily increase memory beyond 4 GB. Even if your current deployment does not require more than 4 GB of RAM, using 64-bit hardware gives you room for growth if your requirements change in the future.

  • Use a battery-backed write-caching disk controller--Testing has shown that the workload on the OperationsManager Database benefits from write caching on disk controllers. When configuring read caching vs. write caching on disk controllers, allocating 100 percent of the cache to write caching is recommended. When using write-caching disk controllers with any database system, it is important to ensure they have a proper battery backup system to prevent data loss in the event of an outage.

Data Warehouse Guidelines and Best Practices

In Operations Manager 2007, data is written to the Data Warehouse in near real time. This makes the load on it similar to the load on the OperationsManager database machine. Because it is a SQL Server 2005 server, the disk subsystem is the most critical to overall performance, followed by memory and CPU. Operations Manager Reporting Services also places a slightly different load on the Data Warehouse server. Factors that affect the load on the Data Warehouse include the following:

  • Rate of data insertion--To allow for more efficient reporting, the Data Warehouse computes and stores aggregated data in addition to a limited amount of raw data. Doing this extra work means that operational data collection to the Data Warehouse can be slightly more costly than to the OperationsManager database. This additional cost is typically balanced out by the reduced cost of processing discovery data on the Data Warehouse as opposed to processing it on the OperationsManager database.

  • Numbers of concurrent reporting users--Because reports frequently summarize large volumes of data, each reporting user can put a significant load on the system. Both the number of reports run at the same time and the type of reports being run affect overall capacity needs. Generally, reports that query large date ranges or large numbers of objects demand more system resources.

Following are some best practices when sizing the Data Warehouse server:

  • Choose an appropriate disk subsystem--Because the Data Warehouse is now an integral part of the overall data flow through the Management Group, choosing an appropriate disk subsystem for the Data Warehouse is very important. As with the OperationsManager database, RAID 0+1 is often the best choice. In general, the disk subsystem for the Data Warehouse should be similar to the disk subsystem for the OperationsManager database.

  • Placement of the data files and the transaction logs--As with the OperationsManager database, separating SQL data and transaction logs is often an appropriate choice as you scale up the number of agents. If both the OperationsManager database and Data Warehouse are located on the same physical machine and you want to separate data and transaction logs, you must put the transaction logs for the OperationsManager database on a separate physical volume from the Data Warehouse to see any benefit. The data files for the OperationsManager database and Data Warehouse can share the same physical volume, as long as it is appropriately sized.

  • Use 64-bit hardware and operating system--The Data Warehouse often benefits from large amounts of RAM, and this can often be a cost-effective way of reducing the amount of disk activity performed on this server. Using 64-bit hardware enables you to easily increase memory beyond 4 GB. Even if your current deployment does not require more than 4 GB of RAM, using 64-bit hardware gives you room for growth if your requirements change in the future.

  • Use dedicated server hardware for the Data Warehouse--Although lower-scale deployments can often consolidate the OperationsManager database and Data Warehouse onto the same physical machine, it is advantageous to separate them as the number of agents increases and, consequently, the volume of incoming operational data increases as well. You will also see better reporting performance if the Data Warehouse and Reporting servers are separated.

  • Use a battery-backed write-caching disk controller--Testing has shown that the workload on the Data Warehouse benefits from write caching on disk controllers. When configuring read caching versus write caching on disk controllers, allocating 100 percent of the cache to write caching is recommended. When using write-caching disk controllers with any database system, it is important to ensure they have a proper battery backup system to prevent data loss in the event of an outage.

Management Server Guidelines and Best Practices

Agents communicate only with management servers. The largest portion of the load on a management server is from the collection of operational data and the insertion of that data into the OperationsManager and Data Warehouse databases. It is important to note that management servers perform these operations directly without depending on the RMS. Management servers perform most of the data queuing in memory rather than depending on a slower disk, thereby increasing performance. The most important resource for management servers is the CPU, but testing has shown that they typically do not require high-end hardware. Factors that affect load on a management server include the following:

  • Rate of operational data collection-Because operations data collection is the primary activity performed by a management server, this rate has the biggest impact on overall server utilization. However, testing has shown that management servers can typically sustain high rates of operational data processing with low to moderate utilization. The primary factor affecting the rate of operational data collection is which management packs are deployed in the management group.

Following are some best practices when sizing a management server:

  • Do not oversize management server hardware.--For most scenarios, using a standard utility server is sufficient for the work performed by a management server. Following the hardware guidelines in this document should be sufficient for most workloads.

  • Do not exceed an agent to management server ratio of about 2000 to 1--Actual server performance will vary based on the volume of operations data collected, but testing has shown that management servers typically do not have issues supporting 2000 agents each with a relatively high volume of operational data coming in. Having 2000 agents per management server is a guideline based on test experience and not a hard limit. You might find that a management server in your environment is able to support a higher or might only be able to support lower number of agents.

  • Use the minimum number of management servers per management group to satisfy redundancy requirements--The main reason for deploying multiple management servers should be to provide for redundancy and disaster recovery rather than scalability. Based on testing, most deployments will not need more than three to five management servers to satisfy these needs.

Gateway Server Guidelines and Best Practices

Gateway servers relay communications between management servers and agents that lie on opposite sides of Kerberos trust boundaries from each other. The gateway server uses certificate-based authentication to perform mutual authentication with the management server, and it does so using a single connection rather than multiple connections as would be required between the agents and the management server. This makes managing certificate-based authentication to untrusted domains easier and more manageable. Factors that affect load on a gateway server include the following:

  • Rate of operations data collection--The primary factor that influences the load on a gateway is the rate of operations data collection. This rate is a function of the number of agents reporting to the gateway and the management packs deployed within the management group.

Following are some best practices when sizing a gateway server:

  • Gateway servers can be beneficial in managing bandwidth utilization--From a performance perspective, gateways are recommended as a tool to optimize bandwidth utilization in low bandwidth environments as it performs a level of compression on all communications with the management server.

  • Do not exceed an agent to Gateway Server ratio of about 800 to 1--Testing has shown that having more than 800 agents per gateway can adversely affect the ability to recover in the event of a sustained (multi-hour) outage that causes a gateway to be unable to communicate with the management server. If you need more agents than this to be reporting to a gateway, consider using multiple gateway servers. If you want to exceed 800 agents per gateway, it is highly recommended that you test your system to ensure that the gateway is able to quickly empty its queue after a sustained outage between the gateway and the management server if gateway recovery time is a concern in your environment.

  • For large numbers of gateways and gateway connected agents, use a dedicated management server--Having all gateways connect to a single management server with no other agents connected to it can speed recovery time in the event of a sustained outage.

AEM Guidelines and Best Practices

The management server used for AEM receives the data from the Error Reporting Client and stores it to a file share. If that file share is local, this will affect the management server.

Following are some best practices when planning for AEM:

  • Disk storage for the files hare can be local or on a Network Attached Storage (NAS) or storage area network (SAN) device.

  • The disk used for AEM should be separate from the disk used for the Data Warehouse or OperationsManager databases.

  • If the storage is set up on a Distributed File System (DFS), DFS replication should be disabled.

  • A gateway server should not be used as an AEM collector.

# of Monitored Devices management server for AEM file share

0 to 10,000

125 GB of disk as 2 drives RAID 1, 2 GB RAM

10,000 to 25,000

250 GB of disk as 2 drives RAID 1, 4 GB RAM

25,000 to 50,000

500 GB of disk as 4 drives RAID 0+1, 4 GB RAM

50,000 to 75,000

750 GB of disk as 4 drives RAID 0+1, 4 GB RAM

75,000 to 100,000

1000 GB of disk as 4 drives RAID 0+1, 8 GB RAM

Collective Client Monitoring Guidelines and Best Practices

Collective Health Monitoring is performed by gathering event and performance data from many machines and aggregating the data based on groups of systems for reporting and analysis. For example, individual memory performance data is gathered from Windows XP and Windows Vista clients on different types of hardware. Collection Health Monitoring aggregates this data and provides reports based on memory performance for specific groups of systems, such as by operating system or by hardware vendor. This makes analysis of overall performance easier than the alternative of digging through long lists of individual system performance reports. Collective monitoring mode also enables alerting and monitoring at a collective, rather than an individual, level.

Collective Client monitoring management packs include the following: Information Worker, Windows Client, Windows XP, Windows Vista, Network Address Protocol, and other client-focused management packs.

Each client that is monitored by an agent typically generates summary events periodically, and these events are used to calculate the collective health of the client population. Alerting on the individual agent is disabled, and hence, there will not be any alerts data generated by the agents running on the clients.

Depending on the number of management packs deployed and agent traffic, each management server can manage up to 3000 to 4000 agent managed clients.

When planning the rollout of collective monitoring clients, the agents should be approved in batches of no more than 1000 at a time to allow the agents to get synchronized with the latest configuration.

Designing Audit Collection Services

This section provides high-level guidance to help you get started with planning your ACS implementation.

ACS is not a standalone solution. ACS can be hosted only in an existing management group because its agent is integrated and installed with the Operations Manager agent, and the ACS Collector can be installed only on a management server or gateway server. The remaining components, the ACS database and ACS Reporting, can be installed on the same SQL Server 2005 server or instance as the rest of the Operations Manager database and reporting components as well. However, for performance, capacity, and security reasons, you will probably choose to install these on dedicated hardware.

Design Decisions

There are four fundamental design decisions to make when planning your ACS implementation. As you make these decisions, keep in mind that there is a one-to-one relationship between the ACS Collector server and its ACS database. An ACS database can have only one ACS Collector feeding data to it at a time, and every ACS Collector needs its own ACS database. It is possible to have multiple ACS Collector/Database pairs in a management group; however, there are no procedures available out of the box for integrating the data from multiple ACS databases into a single database.

The first decision that must be made is whether or not to deploy a management group that is exclusively used to support ACS or to deploy ACS into a management group that also provides health monitoring and alerting services. Here are the characteristics of these two ACS deployment scenarios:

  • ACS hosted in a production management group scenario:

    • Scaled usage of ACS--Given that ACS collects every security event from the systems that ACS Forwarders are enabled on, use of ACS can generate a huge amount of data. Unless you are using dedicated hardware for the ACS Collector and Database roles, processing this data might negatively affect the performance of the hosting management group, particularly in the database layer.

    • Separate administration and security is not required--Because ACS is hosted in a management group, people with administrative control in the management group will have administrative rights in ACS. If the business, regulator/audit, and IT requirements mandate that ACS be under non-production IT control, deploying ACS into a production management group scenario is not an option.

  • ACS hosted on a dedicated management group scenario:

    • Separate administration and security is required--If there is a separate administrative group that is responsible for audit and security controls at your company, hosting ACS on a dedicated management group administered by the audit/security group is recommended.

The second decision that must be made is whether or not to deploy ACS Reporting into the same SQL Server 2005 Reporting Services instance as the Operations Manager 2007 Reporting component is deployed to. Here are the characteristics of these two scenarios.

  • ACS reporting integrated with Operations Manager Reporting:

    • Single console for all reports--When ACS Reporting is installed with Operations Manager Reporting, the ACS reports are accessed via the Operations Manager Operations console.

    • Common security model--When Operations Manager 2007 Reporting is installed into SQL Server 2005 Reporting Services, it overwrites the default security model, replacing it with the Operations Manager role-based security model. ACS Reporting is compatible with this model. All users who have been assigned the Report Operator role will have access to the ACS Reports as long as they also have the necessary permissions on the ACS database.

    Note

    If Operations Manager Reporting is later uninstalled, the original SRS security model must be restored manually using the ResetSRS.exe utility found on the installation media in the SupportTools directory.

  • ACS reporting installed on a dedicated SQL Server Reporting Services instance:

    • Separate console for ACS and Operations Manager reports--When installed on a dedicated SRS instance, the ACS Reports are accessed via the SRS Web site that is created for it at installation. This provides greater flexibility in configuring the folder structure and in using SRS Report designer.

    • Separate security model--A consequence of using a dedicated SRS instance is that you can create security roles as needed to meet the business and IT requirements to control access to the ACS reports. Note that the necessary permissions must still be granted on the ACS database.

The third design decision that must be made is how many ACS Collector/Database pairs to deploy to support your environment. The rate that a single ACS Collector/Database pair can support an ongoing event collection and insertion is not an absolute number. This rate is dependent upon the performance of the storage subsystem that the database server is attached to. For example a low end SAN solution can typically support up to 2500 to 3000 security events per second. Independent of this the ACS Collector has been observed supporting bursts of 20,000 security events per second. Following are factors that affect the number of security events generated per second:

  • Audit Policy Configuration--The more aggressive the audit policy, the greater the number of Security events that are generated from audited machines

  • The role of the machine that the ACS forwarder is enabled on, given the default Audit Policy, Domain Controller will generate the most security events. Member servers will generate the next highest amount, and workstations will generate the least.

Machine Role Approximate Number of Unfiltered Security Events per Second generated under high load

Domain Controller

40 events per second

Member Server

2 events per second

Workstation

0.2 events per second

  • Using the numbers in the preceding table, a single, high-end ACS Collector/Database pair can support up to 150 Domain Controllers, 3000 Member Servers, or 20,000 Workstations (with the appropriate ACS Collector filter applied).

    Important

    These estimates will be superseded by the ACS sizing and capacity models in SCCP 2007.

  • The amount of user activity on the network--If your network is used by high-end users conducting a large number of transactions, as is experienced for example at Microsoft, more events will be generated. If your network users conduct relatively few transactions, such as might be the case at a retail kiosk or in a warehouse scenario, you should expect fewer security events.

  • The ACS Collector Filter configuration--ACS collects all security events from a monitored machine's security event log. Out of all the events collected, you might be interested in only a smaller subset. ACS provides the ability to filter out the undesired events, allowing only the desired ones to be processed by the Collector and then inserted into the ACS database. As the amount of filtering increases, fewer events will be processed and inserted into the ACS database.

The last design decision that must be made is the version of SQL Server 2005 to use for the ACS database. ACS supports the use of SQL Server 2005 Standard edition and SQL Server 2005 Enterprise edition. Which version is used has an impact on how the system will behave during the daily database maintenance window. During the maintenance window, database partitions whose time stamps lie outside the data retention schedule (with 14 days being a typical configuration for data retention) are dropped from the database. If SQL Server 2005 Standard edition is used, Security event insertion halts and events queue up on the ACS Collector until maintenance is completed. If SQL Server 2005 Enterprise edition is used, insertion of processed Security events continues, but at only 30 to 40 percent of the regular rate. This is one reason why you should carefully pick the timeframe for daily database maintenance, selecting a time when there is the least amount of user and application activity on the network.

ACS Guidelines and Best Practices

The overall performance of the ACS system is most affected by the performance of the ACS database and its disk subsystem. Given that many thousands of events per second will be inserted continuously, with potential peaks of tens of thousands per second, this is easy to see. It is not uncommon with a large number of monitored devices, including domain controllers, to accumulate more than a terabyte of data in a 14-day time span in the ACS database. Following are some best practices for ACS:

  • Use 64-bit hardware and operating system for the Collector and SQL Server, along with a high-performance SAN solution.

  • Separate the database files from the transaction logs.

  • Use dedicated hardware to host ACS if warranted.

  • Use tight filters to reduce the number of noise Security events that get inserted into the database.

  • Plan your Windows Audit policy carefully so that only relevant events are logged on monitored systems.

  • Enable the ACS Forwarder only on necessary systems.

  • Configure Security Event logs with sufficient space so that if communication is lost with the ACS Collector, the Security Event log file will not wrap on itself and overwrite previous events, resulting in a loss of event data.