Export (0) Print
Expand All
2 out of 2 rated this helpful - Rate this topic

Exchange 2010 Tested Solutions: 16000 Mailboxes in a Single Site Deployed on IBM and Brocade Hardware

 

Topic Last Modified: 2012-05-10

Rob Simpson, Program Manager, Microsoft Exchange Server; Roland Mueller, Solutions Architect, IBM

December 2010

In Exchange 2010 Tested Solutions, Microsoft and participating server, storage, and network partners examine common customer scenarios and key design decision points facing customers who plan to deploy Microsoft Exchange Server 2010. Through this series of white papers, we provide examples of well-designed, cost-effective Exchange 2010 solutions deployed on hardware offered by some of our server, storage, and network partners.

You can download this document from the Microsoft Download Center.

Microsoft Exchange Server 2010 release to manufacturing (RTM)

Microsoft Exchange Server 2010 with Service Pack 1 (SP1)

Windows Server 2008 R2

Windows Server 2008 R2 Hyper-V

Table of Contents

This document provides an example of how to design, test, and validate an Exchange Server 2010 solution for environments with 16,000 mailboxes deployed on the latest IBM System Storage DS5000 series, IBM XIV Storage System, and Brocade ServerIron ADX Series solutions. Following the step-by-step methodology in this document, we walk through the important design decision points that help address these key challenges in designing single site Exchange 2010 solutions that involve multiple role servers and use of lagged database copies. After we have determined the optimal solution for this customer, the solution undergoes a standard validation process to ensure that it holds up under simulated production workloads for normal operating, maintenance, and failure scenarios.

Return to top

The following tables summarize the key Exchange and hardware components of this solution.

Exchange components

Exchange component Value or description

Target mailbox count

16000

Target mailbox size

750 megabytes (MB)

Tiered mailbox size

120 @ 5 gigabytes (GB)

2400 @ 1 GB

13480 @ 600 MB

Target message profile

100 messages per day

Total database copy count

3

High availability database copy count

2

Lagged database copy count

1

Volume Shadow Copy Service (VSS) backup

None

Site resiliency

No

Virtualization

No

Exchange server count

4

Physical server count

4

Hardware components

Hardware component Value or description

Server partner

IBM

Server model

X3650 M3

Server type

Rack

Processor

Intel Xeon X5680

Storage partner

IBM

Storage model

XIV Storage System 2810

Storage type

Storage area network (SAN)

Disk type

1 terabyte 7.2 kilobyte (KB) Serial ATA (SATA) 3.5"

Return to top

One of the most important first steps in Exchange solution design is to accurately summarize the business and technical requirements that are critical to making the correct design decisions. The following sections outline the customer requirements for this solution.

Determine mailbox profile requirements as accurately as possible because these requirements may impact all other components of the design. If Exchange is new to you, you may have to make some educated guesses. If you have an existing Exchange environment, you can use the Microsoft Exchange Server Profile Analyzer tool to assist with gathering most of this information. The following tables summarize the mailbox profile requirements for this solution.

Mailbox count requirements

Mailbox count requirements Value

Mailbox count (total number of mailboxes including resource mailboxes)

14500

Projected growth percent (%) in mailbox count (projected increase in mailbox count over the life of the solution)

10% approximately (~)

Expected mailbox concurrency % (maximum number of active mailboxes at any given time)

100

Target mailbox count (mailbox count including growth × expected concurrency)

16000

Mailbox size requirements

Mailbox size requirements Value

Average mailbox size in MB

750 MB

Tiered mailbox size

Yes

120 @ 5 GB

2400 @ 1 GB

13480 @ 600 MB

Average mailbox archive size in MB

0

Projected growth (%) in mailbox size in MB (projected increase in mailbox size over the life of the solution)

Included

Target average mailbox size in MB

750 MB

Mailbox profile requirements

Mailbox profile requirements Value

Target message profile (average total number of messages sent plus received per user per day)

100 messages per day

Tiered message profile

No

Target average message size in KB

75 KB

% in MAPI cached mode

100%

% in MAPI online mode

0

% in Outlook Anywhere

0

% in Microsoft Office Outlook Web App (Outlook Web Access in Exchange 2007 and previous versions)

Limited

% in Exchange ActiveSync

0

Return to top

Understanding the distribution of mailbox users and datacenters is important when making design decisions about high availability and site resiliency.

The following table outlines the geographic distribution of people who will be using the Exchange system.

Geographic distribution of people

Mailbox user site requirements Value

Number of major sites containing mailbox users

1

Number of mailbox users in site 1

16000

The following table outlines the geographic distribution of datacenters that could potentially support the Exchange e-mail infrastructure.

Geographic distribution of datacenters

Datacenter site requirements Value

Total number of datacenters

1

Number of active mailboxes in proximity to datacenter 1

16000

Requirement for Exchange to reside in more than one datacenter

No

Return to top

It's also important to define server and data protection requirements for the environment because these requirements will support design decisions about high availability and site resiliency.

The following table identifies server protection requirements.

Server protection requirements

Server protection requirement Value or description

Number of simultaneous server or virtual machine (VM) failures within site

2

Number of simultaneous server or VM failures during site failure

Not applicable

The following table identifies data protection requirements.

Data protection requirements

Data protection requirement Value or description

Requirement to maintain a backup of the Exchange databases outside of the Exchange environment (for example, third-party backup solution)

No

Requirement to maintain copies of the Exchange databases within the Exchange environment (for example, Exchange native data protection)

Yes

Requirement to maintain multiple copies of mailbox data in the primary datacenter

Yes

Requirement to maintain multiple copies of mailbox data in a secondary datacenter

No

Requirement to maintain a lagged copy of any Exchange databases

Yes

Lagged copy period in days

3

Target number of database copies

3

Deleted Items folder retention window in days

14

Return to top

This section includes information that isn't typically collected as part of customer requirements, but is critical to both the design and the approach to validating the design.

The following table describes the peak CPU utilization targets for normal operating conditions, and for site server failure or server maintenance conditions.

Server utilization targets

Target server CPU utilization design assumption Value

Normal operating for Mailbox servers

<70%

Normal operating for Client Access servers

<70%

Normal operating for Hub Transport servers

<70%

Normal operating for multiple server roles (Client Access, Hub Transport, and Mailbox servers)

<70%

Normal operating for multiple server roles (Client Access and Hub Transport servers)

<70%

Node failure for Mailbox servers

<80%

Node failure for Client Access servers

<80%

Node failure for Hub Transport servers

<80%

Node failure for multiple server roles (Client Access, Hub Transport, and Mailbox servers)

<80%

Node failure for multiple server roles (Client Access and Hub Transport servers)

<80%

Return to top

The following tables summarize some data configuration and input/output (I/O) assumptions made when designing the storage configuration.

Data configuration assumptions

Data configuration assumption Value or description

Data overhead factor

20%

Mailbox moves per week

1%

Dedicated maintenance or restore logical unit number (LUN)

No

LUN free space

20%

Log shipping compression enabled

Yes

Log shipping encryption enabled

Yes

I/O configuration assumptions

I/O configuration assumption Value or description

I/O overhead factor

20%

Additional I/O requirements

None

Return to top

The following section provides a step-by-step methodology used to design this solution. This methodology takes customer requirements and design assumptions and walks through the key design decision points that need to be made when designing an Exchange 2010 environment.

When designing an Exchange 2010 environment, many design decision points for high availability strategies impact other design components. We recommend that you determine your high availability strategy as the first step in the design process. We highly recommend that you review the following information prior to starting this step:

Return to top

If you have more than one datacenter, you must decide whether to deploy Exchange infrastructure in a single datacenter or distribute it across two or more datacenters. The organization's recovery service level agreements (SLAs) should define what level of service is required following a primary datacenter failure. This information should form the basis for this decision.

*Design Decision Point*

In this solution, the office is located in a single geographic location, and the server infrastructure is located on the premises. There's no budget to maintain infrastructure in a second geographic location, so a site resilient deployment can't be justified. The Exchange 2010 design will be based on a single site deployment with no site resiliency.

Exchange 2010 includes several new features and core changes that, when deployed and configured correctly, can provide native data protection that eliminates the need to make traditional data backups. Backups are traditionally used for disaster recovery, recovery of accidentally deleted items, long term data storage, and point-in-time database recovery. Exchange 2010 can address all of these scenarios without the need for traditional backups:

  • Disaster recovery   In the event of a hardware or software failure, multiple database copies in a DAG enable high availability with fast failover and no data loss. DAGs can be extended to multiple sites and can provide resilience against datacenter failures.
  • Recovery of accidentally deleted items   With the new Recoverable Items folder in Exchange 2010 and the hold policy that can be applied to it, it's possible to retain all deleted and modified data for a specified period of time, so recovery of these items is easier and faster. For more information, see Messaging Policy and Compliance, Understanding Recoverable Items, Understanding Retention Tags and Retention Policies.
  • Long-term data storage   Sometimes, backups also serve an archival purpose. Typically, tape is used to preserve point-in-time snapshots of data for extended periods of time as governed by compliance requirements. The new archiving, multiple-mailbox search, and message retention features in Exchange 2010 provide a mechanism to efficiently preserve data in an end-user accessible manner for extended periods of time. For more information, see Understanding Personal Archives, Understanding Multi-Mailbox Search, and Understanding Retention Tags and Retention Policies.
  • Point-in-time database snapshot   If a past point-in-time copy of mailbox data is a requirement for your organization, Exchange provides the ability to create a lagged copy in a DAG environment. This can be useful in the rare event that there's a logical corruption that replicates across the databases in the DAG, resulting in a need to return to a previous point in time. It may also be useful if an administrator accidentally deletes mailboxes or user data.

There are technical reasons and several issues that you should consider before using the features built into Exchange 2010 as a replacement for traditional backups. Prior to making this decision, see Understanding Backup, Restore and Disaster Recovery.

*Design Decision Point*

In this example, maintaining tape backups has been difficult, and testing and validating restore procedures hasn't occurred on a regular basis. Using Exchange native data protection in place of traditional backups as a database resiliency strategy would be an improvement.

Recovering from accidental delete operations is a concern, so a strategy is needed to detect and recover from logical database corruption. Subsequent design decisions will be made based on the assumption that the design won't include a VSS backup solution. Using lagged copies to recover from logical corruption will be considered.

The next important decision when defining your database resiliency strategy is to determine the number of database copies to deploy. We strongly recommend deploying a minimum of three copies of a mailbox database before eliminating traditional forms of protection for the database, such as Redundant Array of Independent Disks (RAID) or traditional VSS-based backups.

For additional information, see Understanding Mailbox Database Copies.

*Design Decision Point*

In a previous step, it was decided not to deploy a VSS-based backup solution. Therefore, the design should have a minimum of three copies of each database.

There are two types of database copies:

  • High availability database copy   This database copy is configured with a replay lag time of zero. As the name implies, high availability database copies are kept up-to-date by the system, can be automatically activated by the system, and are used to provide high availability for mailbox service and data.
  • Lagged database copy   This database copy is configured to delay transaction log replay for a period of time. Lagged database copies are designed to provide point-in-time protection, which can be used to recover from store logical corruptions, administrative errors (for example, deleting or purging a disconnected mailbox), and automation errors (for example, bulk purging of disconnected mailboxes).
    With logical store corruptions, data is added, deleted, or manipulated in a way that the user doesn't expect. These cases generally involve third-party applications or accidental deletions caused by administrator error or poorly written scripts. Although it appears as corruption to the user, the Exchange store considers the transaction that produced the logical corruption to be a series of valid MAPI operations. Because lagged database copies are deployed to mitigate operational risks, lagged database copies shouldn't be activated (and typically aren't activated due to the best copy selection process, which is described in Understanding Active Manager). After activation, a mount request is issued, and log replay begins replaying all required log files to bring the database up-to-date and place it in a clean shutdown state, thus losing the point-in-time recovery capability. For more information about how to use activation to block Mailbox servers or suspend database copies, to prevent a database copy, such as a lagged database copy, from being automatically activated, see Set-MailboxServer and Suspend-MailboxDatabaseCopy.
    If you choose to use lagged copies, be aware of the following implications for their use:
    • Unlike standby continuous replication (SCR) in Microsoft Exchange Server 2007, which had a hard-coded replay lag of 50 log files, there's no hard-coded number of lagged log files. Instead, the replay lag time is an administrator-configured value, and by default, it's disabled.
    • The replay lag time setting has a default setting of 0 days, and a maximum setting of 14 days.
    • Lagged copies aren't considered highly available copies. Instead, they are designed for disaster recovery purposes, to protect against store logical corruption.
    • The greater the replay lag time, the longer the database recovery process. Depending on the number of log files that need to replayed during recovery, and the speed at which your hardware can replay them, it may take several hours or more to recover a database.
    • We recommend that you determine whether lagged copies are critical for your overall disaster recovery strategy. If using them is critical to your strategy, we recommend using multiple lagged copies, or using RAID to protect a single lagged copy, if you don't have multiple lagged copies. If you lose a disk or if corruption occurs, you don't lose your lagged point in time.
    • Lagged copies aren't patchable with the Exchange Storage Engine (ESE) single page restore feature. If a lagged copy encounters database page corruption (for example, a -1018 error), it will have to be reseeded (which will lose the lagged aspect of the copy).

*Design Decision Point*

In this example, logical corruption occurred in the past, due to an issue with a third-party archival application, and a poorly written script resulted in unwanted bulk mailbox deletion. Recovering from these types of occurrences in the future is important. It's difficult to detect logical corruption and determine when the corruption occurred. When recovering from logical corruption, messaging data delivered or modified after the corruption occurred may be lost. Although the Deleted Items folder retention feature can be used to recover deleted content or entire mailboxes, a point-in-time recovery is preferred.

The decision is to use one of the three database copies as a lagged copy. With this decision, three high availability database copies are no longer available. With no VSS-backup solution in place, we recommend that a fourth copy be added to provide three high availability copies (recommended for customers using Exchange native data protection). If this were a site resilient deployment, a fourth copy would be required to maintain two high availability copies in the primary datacenter.

Because this is a single site deployment, a fourth copy doesn't provide any site resiliency value. A redundant SAN solution can be used as the storage back-end server. If a fourth copy were added, it would likely exist on the same physical storage device as one of the other copies. A fourth copy doesn't provide any additional value in protecting against a physical array failure. When considering costs vs. benefits of adding a fourth copy, a decision is made to use two high availability copies and one lagged copy for this solution.

*New Feature Alert*

New database repair cmdlets are introduced in Exchange 2010 Service Pack (SP1) that can help detect and fix logical corruption. This feature further reduces the need to deploy lagged database copies. For more information, see New-MailboxRepairRequest.

If you deploy Exchange 2010 SP1, you may decide that a lagged copy is no longer needed. In this case, a lagged copy can be converted to a high availability copy by setting the replay lag time to 0. For information about optimal database layouts in a high availability three copy configuration, see Solution Variation later in this document.

Replay lag time is a property of a mailbox database copy that specifies the amount of time, in minutes, to delay log replay for the database copy. The replay lag timer starts when a log file has been replicated to the passive copy and has successfully passed inspection. By delaying the replay of logs to the database copy, you have the capability to recover the database to a specific point in time in the past.

When determining your point-in-time recovery objectives and the corresponding duration for the replay lag time, consider the following:

  • Detection   How long will it take for your organization to be able to detect and initiate recovery from an event that results in store logical corruption?
  • Recovery time   What does your organization consider to be an acceptable amount of time to replay logs forward and restore data to high availability databases?
  • Data loss   What is an acceptable amount of data loss should the point-in-time recovery require you to abandon any data received or changed since the time of failure?

*Design Decision Point*

In this example, operating policy requires that changes to the Exchange environment that aren't time sensitive occur from 21:00 to 00:00 (midnight) on Friday evenings. This allows a recovery window of 48 hours over the weekend, should a recovery from a change-related issue be needed. Store logical corruption isn't easily detected by an Exchange administrator. More likely, mailbox users detect and report these types of issues. Any major issues that require a point-in-time recovery are likely to be identified by mailbox users during the next regular work day. The replay lag time will be set to 72 hours, which provides until Monday at 21:00 to detect and initiate a point-in-time recovery. The time required to replay 72 hours of logs, when 60 of those hours occur during an off-peak weekend period, should be quick. If the corruption is identified early on Monday morning and the point-in-time recovery requires loss of data, that data loss should also be minimized.

A DAG is the base component of the high availability and site resilience framework built into Exchange 2010. A DAG is a group of up to 16 Mailbox servers that hosts a set of replicated databases and provides automatic database-level recovery from failures that affect individual servers or databases.

A DAG is a boundary for mailbox database replication, database and server switchovers and failovers, and for an internal component called Active Manager. Active Manager is an Exchange 2010 component, which manages switchovers and failovers. Active Manager runs on every server in a DAG.

From a planning perspective, you should try to minimize the number of DAGs deployed. You should consider going with more than one DAG if:

  • You deploy more than 16 Mailbox servers.
  • You have active mailbox users in multiple sites (active/active site configuration).
  • You require separate DAG-level administrative boundaries.
  • You have Mailbox servers in separate domains. (DAG is domain bound.)

*Design Decision Point*

Because there is no site resiliency required, there will likely be less than 16 Mailbox servers required. Because there are no special requirements that warrant more than one DAG, the design will have a single DAG.

Exchange 2010 has been re-engineered for mailbox resiliency. Automatic failover protection is now provided at the mailbox database level instead of at the server level. You can strategically distribute active and passive database copies to Mailbox servers within a database availability group (DAG). Determining how many database copies you plan to activate on a per-server basis is a key aspect to Exchange 2010 capacity planning. There are different database distribution models that you can deploy, but generally we recommend one of the following:

  • Design for all copies activated   In this model, the Mailbox server role is sized to accommodate the activation of all database copies on the server. For example, a Mailbox server may host four database copies. During normal operating conditions, the server may have two active database copies and two passive database copies. During a failure or maintenance event, all four database copies would become active on the Mailbox server. This solution is usually deployed in pairs. For example, if deploying four servers, the first pair is servers MBX1 and MBX2, and the second pair is servers MBX3 and MBX4. In addition, when designing for this model, you will size each server for no more than 40 percent of available resources during normal operating conditions. In a site resilient deployment with three database copies and six servers, this model can be deployed in sets of three servers, with the third server residing in the secondary datacenter. This model provides a three-server building block for solutions using an active/passive site resiliency model.
    This model can be used in the following scenarios:
    • Active/Passive multisite configuration where failure domains (for example, racks, blade enclosures, and storage arrays) require easy isolation of database copies in the primary datacenter
    • Active/Passive multisite configuration where anticipated growth may warrant easy addition of logical units of scale
    • Configurations that aren't required to survive the simultaneous loss of any two Mailbox servers in the DAG
    • Configurations with only two high availability database copies
    This model requires servers to be deployed in pairs for single site deployments and sets of three for multisite deployments. The following table illustrates a sample database layout for this model.
    Mailbox server resiliency strategy
    In the preceding table, the following applies:
    • C1 = active copy (activation preference value of 1) during normal operations
    • C2 = passive copy (activation preference value of 2) during normal operations
    • C3 = passive copy (activation preference value of 3) during site failure event
  • Design for targeted failure scenarios   In this model, the Mailbox server role is designed to accommodate the activation of a subset of the database copies on the server. The number of database copies in the subset will depend on the specific failure scenario that you're designing for. The main goal of this design is to evenly distribute active database load across the remaining Mailbox servers in the DAG.
    This model should be used in the following scenarios:
    • All single site configurations with three or more database copies
    • Configurations required to survive the simultaneous loss of any two Mailbox servers in the DAG
    The DAG design for this model requires between 3 and 16 Mailbox servers. The following table illustrates a sample database layout for this model.
    Mailbox server resiliency strategy
    In the preceding table, the following applies:
    • C1 = active copy (activation preference value of 1) during normal operations
    • C2 = passive copy (activation preference value of 2) during normal operations
    • C3 = passive copy (activation preference value of 3) during normal operations

*Design Decision Point*

In a previous step, it was decided to deploy one of the three database copies as a lagged copy. With only two high availability copies, the design for all copies activated Mailbox server resiliency strategy is generally a good fit.

Design decisions about database copy layout for this model are discussed later in the document.

The number of Mailbox servers required to support the workload and the minimum number of Mailbox servers required to support the DAG design may be different. In this step, a preliminary result is obtained. The final number of Mailbox servers will be determined in a later step.

*Design Decision Point*

This example uses three high availability database copies. To support three copies, a minimum of three Mailbox servers in the DAG is required. The design for all copies activated Mailbox server resiliency model is selected. In this model, we recommend that a minimum of four Mailbox servers be deployed to support the DAG design.

Return to top

Many factors influence the storage capacity requirements for the Mailbox server role. For additional information, we recommend that you review Understanding Mailbox Database and Log Capacity Factors.

The following steps outline how to calculate mailbox capacity requirements. These requirements will then be used to make decisions about which storage solution options meet the capacity requirements. A later section covers additional calculations required to properly design the storage layout on the chosen storage platform.

Microsoft has created a Mailbox Server Role Requirements Calculator that will do most of this work for you. To download the calculator, see E2010 Mailbox Server Role Requirements Calculator. For additional information about using the calculator, see Exchange 2010 Mailbox Server Role Requirements Calculator.

Before attempting to determine what your total storage requirements are, you should know what the mailbox size on disk will be. A full mailbox with a 1-GB quota requires more than 1 GB of disk space because you have to account for the prohibit send/receive limit, the number of messages the user sends or receives per day, the Deleted Items folder retention window (with or without calendar version logging and single item recovery enabled), and the average database daily variations per mailbox. The Mailbox Server Role Requirements Calculator does these calculations for you. You can also use the following information to do the calculations manually.

The following calculations are used to determine the mailbox size on disk for the three mailbox tiers in this solution:

  • Tier 1 (600 MB mailbox quota, 100 messages per day message profile, 75 KB average message size)
    • Whitespace = 100 messages per day × 75 ÷ 1024 MB = 7.3 MB
    • Dumpster = (100 messages per day × 75 ÷ 1024 MB × 14 days) + (600 MB × 0.012) + (600 MB x 0.058) = 144 MB
    • Mailbox size on disk = mailbox limit + whitespace + dumpster
      = 600 MB + 7.3 MB + 144 MB
      = 751 MB
  • Tier 2 (1024 MB mailbox quota, 100 messages per day message profile, 75 KB average message size)
    • Whitespace = 100 messages per day × 75 ÷ 1024 MB = 7.3 MB
    • Dumpster = (100 messages per day × 75 ÷ 1024 MB × 14 days) + (1024 MB × 0.012) + (1024 MB × 0.058) = 174 MB
    • Mailbox size on disk = mailbox limit + whitespace + dumpster
      = 1024 MB + 7.3 MB + 174 MB
      = 1205 MB
  • Tier 3 (5120 MB mailbox quota, 100 messages per day message profile, 75 KB average message size)
    • Whitespace = 100 messages per day × 75 ÷ 1024 MB = 7.3 MB
    • Dumpster = (100 messages per day × 75 ÷ 1024 MB × 14 days) + (5120 MB × 0.012) + (5120 MB × 0.058) = 461 MB
    • Mailbox size on disk = mailbox limit + whitespace + dumpster
      = 5120 MB + 7.3 MB + 461 MB
      = 5588 MB

In this step, the high level storage capacity required for all mailbox databases is determined. The calculated capacity includes database size, catalog index size, and 20 percent free space.

To determine the storage capacity required for all databases, use the following formulas:

  • Tier 1 (600 MB mailbox quota, 100 messages per day message profile, 75 KB average message size)
    • Database size = (number of mailboxes × mailbox size on disk × database growth factor) × (20% data overhead)
      = (13480 × 751 × 1) × 1.2
      = 12148176 MB
      = 11863 GB
    • Database index size = 10% of database size
      = 1186 GB
    • Total database capacity = (database size + index size) ÷ 0.80 to add 20% volume free space
      = (11863 + 1186) ÷ 0.8
      = 16312 GB
  • Tier 2 (1024 MB mailbox quota, 100 messages per day message profile, 75 KB average message size)
    • Database size = (number of mailboxes × mailbox size on disk × database overhead growth factor) × (20% data overhead)
      = (2400 × 1205 × 1) × 1.2
      = 3470400 MB
      = 3389 GB
    • Database index size = 10% of database size
      = 339 GB
    • Total database capacity = (database size + index size) ÷ 0.80 to add 20% volume free space
      = (3389 + 339) ÷ 0.8
      = 4660 GB
  • Tier 3 (5120 MB mailbox quota, 100 messages per day message profile, 75 KB average message size)
    • Database size = (number of mailboxes × mailbox size on disk × database overhead growth factor) × (20% data overhead)
      = (120 × 5588 × 1) × 1.2
      = 804672 MB
      = 786 GB
    • Database index size = 10% of database size
      = 79 GB
    • Total database capacity = (database size + index size) ÷ 0.80 to add 20% volume free space
      = (786 + 79) ÷ 0.8
      = 1081 GB
    • Total database capacity = 22053 GB
      = 21.5 terabytes

To ensure that the Mailbox server doesn't sustain any outages as a result of space allocation issues, the transaction logs also need to be sized to accommodate all of the logs that will be generated during the backup set. Provided that this architecture is leveraging the mailbox resiliency and single item recovery features as the backup architecture, the log capacity should allocate for three times the daily log generation rate in the event that a failed copy isn't repaired for three days. (Any failed copy prevents log truncation from occurring). In the event that the server isn't back online within three days, you would want to temporarily remove the copy to allow truncation to occur.

For deployments with lagged copies, the log capacity should be designed to accommodate the log replay lag time. In this solution, the log replay lag time is 72 hours. Log capacity should be sized to accommodate three days of log files.

To determine the storage capacity required for all transaction logs, use the following formulas:

  • Tier 1 (600 MB mailbox quota, 100 messages per day message profile, 75 KB average message size)
    • Log files size = (log file size × number of logs per mailbox per day × log replay lag time in days × number of mailbox users) + (1% mailbox move overhead)
      = (1 MB × 20 × 3 × 13480) + (13480 × 0.01 × 600 MB)
      = 889680 MB = 869 GB
    • Total log capacity = log files size ÷ 0.80 to add 20% volume free space
      = (869) ÷ 0.80
      = 1086
  • Tier 2 (1024 MB mailbox quota, 100 messages per day message profile, 75 KB average message size)
    • Log files size = (log file size × number of logs per mailbox per day × log replay lag time in days × number of mailbox users) + (1% mailbox move overhead)
      = (1 MB × 20 × 3 × 2400) + (2400 × 0.01 × 2048 MB)
      = 193152 MB
      = 189 GB
  • Tier 3 (5120 MB mailbox quota, 100 messages per day message profile, 75 KB average message size)
    • Log files size = (log file size × number of logs per mailbox per day × log replay lag time in days × number of mailbox users) + (1% mailbox move overhead)
      = (1 MB × 20 × 3 × 120) + (120 × 0.01 × 5120 MB)
      = 13344 MB
      = 13 GB
    • Total log capacity = 1264 GB

The following table summarizes the high level storage capacity requirements for this solution. In a later step, you will use this information to make decisions about which storage solution to deploy. You will then take a closer look at specific storage requirements in later steps.

Summary of storage capacity requirements

Disk space requirements Value

Average mailbox size on disk (MB)

855

Database capacity required (GB)

22053

Log capacity required (GB)

1264

Total capacity required (GB)

23317

Total capacity required for 3 database copies

69951

Total capacity required for 3 database copies (terabytes)

68

The high level storage capacity requirements are approximately 70 terabytes. When choosing a storage solution, ensure that the solution meets this capacity requirement.

Return to top

When designing an Exchange environment, you need an understanding of database and log performance factors. We recommend that you review Understanding Database and Log Performance Factors.

Because it's one of the key transactional I/O metrics needed for adequately sizing storage, you should understand the amount of database I/O per second (IOPS) consumed by each mailbox user. Pure sequential I/O operations aren't factored in the IOPS per Mailbox server calculation because storage subsystems can handle sequential I/O much more efficiently than random I/O. These operations include background database maintenance, log transactional I/O, and log replication I/O. In this step, you calculate the total IOPS required to support all mailbox users, using the following:

  • Total required IOPS = IOPS per mailbox user × number of mailboxes × I/O overhead factor
  • Estimated IOPS per mailbox user = 0.100
    noteNote:
    To determine the IOPS profile for a different message profile, see the table "Database cache and estimated IOPS per mailbox based on message activity" in Understanding Database and Log Performance Factors.
  • Number of mailbox users = 16000
  • I/O overhead factor = 20%
  • Total required IOPS = 0.10 × 16000 × 1.20 = 1920

The high level storage IOPS requirements are approximately 1920. When choosing a storage solution, ensure that the solution meets this requirement.

Return to top

Exchange 2010 includes improvements in performance, reliability, and high availability that enable organizations to run Exchange on a wide range of storage options.

When examining the storage options available, being able to balance the performance, capacity, manageability, and cost requirements is essential to achieving a successful storage solution for Exchange.

For more information about choosing a storage solution for Exchange 2010, see Mailbox Server Storage Design.

A number of server models on the market today support from 8 through 16 internal disks. These servers are a fit for some Exchange deployments and provide a solid solution at a low price point. If your storage capacity and I/O requirements are met with internal storage and you don’t have a specific requirement to use external storage, you should consider using server models with an internal disk for Exchange deployments. If your storage and I/O requirements are higher or your organization has an existing investment in SANs, you should examine larger external direct-attached storage (DAS) or SAN solutions.

*Design Decision Point*

In this example, the current Exchange environment is deployed on SAN infrastructure, and SAN is used for storing all of the data in the environment. A SAN solution will continue to be used for Exchange 2010 deployment.

Return to top

Use the following steps to choose a storage solution.

In this example, the preferred storage vendor is IBM. IBM provides high quality, reliable, and cost effective storage solutions, enhanced by years of experience in the industry.

IBM has two storage platforms that could meet the requirements of this Exchange environment.

Option 1: IBM System Storage DS5000 series (DS5100/DS5300)

The DS5000 series is designed to support transactional applications such as databases, enterprise mail, and online transaction processing (OLTP); throughput-intensive applications such as handheld computers (HPC) and rich media; and concurrent workloads for consolidation and virtualization.

The DS5000 series storage system offers multi-dimensional scalability, which enables the DS5000 to extend beyond the normal three-year to four-year life cycle, protecting your storage investment by delaying (or even eliminating) the expense of migrating data to a new system, and allowing you to amortize acquisition costs over extended periods of time. This life cycle longevity enables the DS5000 series to continue delivering value long after other systems have been retired.

The DS5000 series drive-level encryption offers affordable data security with no performance penalty. The DS5000 series multiple replication options, drive level encryption, and persistent cache backup can help ensure that any data in cache is captured and safe in the event of a power outage.

For more information about the IBM System Storage DS5000 series, see IBM System Storage DS5000 series.

Option 2: IBM XIV Storage System (Model 2810)

As one of the newer members of the IBM family of storage systems, the IBM XIV Storage System is an enterprise-class array, offering superior performance, solid-protection, and ease of manageability.

The XIV provides reliability. By design, the XIV Storage System architecture is considered self-healing because data in the background is automatically checked for integrity and mirrored to providing data redundancy. Before data is subject to loss, the valid data is already mirrored. Should the XIV lose a 1 terabyte disk, the system triggers an event alerting IBM support while rebuilding the faulty disk. The disk rebuild process is transparently handled without user intervention and can finish in as few as 30 minutes when the XIV capacity is 100 percent utilized.

Consistent performance is maintained even during the loss of a hard disk. The rebuild time of losing one disk doesn't degrade performance, unlike other competing storage systems.

Should a storage subcomponent fail on monolithic storage systems, about 50 percent (CPU and read/write cache) of performance is lost. This not only impacts productivity, but could have serious implications affecting lost revenue. Losing an entire data module on the XIV Storage System would temporarily account for only 1/15th reduced performance.

To maintain data redundancy, all data on the storage and memory cache is mirrored. The cache data is never stored in the same data module. This not only provides data cache protection, but eliminates loss of data in case a data module's power is lost or interrupted.

Each data module is supplied with two field replaceable, hot-swappable power supplies, which are standard equipment in all data modules.

Three UPS devices provide interim power in the event of interrupted or lost main power. Enough power is provided to allow safe cache destaging and a controlled shutdown of the XIV Storage System.

Another innovative feature is the ability to consume disk space only as actual data is written to the volume. Although a logical unit number (LUN) may be allocated 100 GB of capacity, if only 10 GB is in actual use, only 10 GB is physically allocated and reported. This concept is referred to as a volume's hard or physical utilization. This can result in a significant savings in both space and cost. This differs from traditional storage systems, in which disk space is flagged as 0s and marked as used storage. Traditional storage systems quickly outgrow space by immediately pre-allocating storage due to the inefficient method of marking free space as reserved. On the XIV Storage System, it is tracked by the system as free space, and can be used for other hosts, maximizing overall storage utilization.

For more information about the IBM XIV Storage System, see IBM XIV Storage System.

In this solution, the XIV Storage System 2810 is selected as the storage solution. The XIV provides advanced features, like ease of provisioning, capacity and path virtualization, self-healing and tuning, snapshot efficiency, and built-in performance tools, which all benefit Exchange 2010 implementations.

Other key features that set the XIV apart from the DS5100/DS5300 are the ease of management and the quick rebuild times in the event of disk or module failures.

The XIV system's highly intuitive GUI simplifies day-to-day storage administration, enabling most tasks to be executed in just a few clicks. The result is a reduction in the time required to provision storage, resize volumes, create storage pools, and even take snapshots. In most cases, the reduction is from hours and days to a few minutes or less. Ease of use helps reduce costs by requiring fewer IT individuals to handle storage management tasks and data migration tasks. Because there isn't a dedicated SAN support team in this example, easy SAN management is a high priority.

Return to top

In previous steps, it was determined that the storage solution needed to support 70 terabytes and 1,800 IOPS. A fully populated XIV with 1 terabyte disks provides 79 terabytes of usable storage and over 8,000 IOPS. A single XIV will meet the IOPS and capacity requirements of this solution. However, without any VSS-based backup solution in place, it's a best practice to have the database copies spread across multiple physical storage devices so that any issues with the array won't result in a full outage of the Exchange service or complete data loss.

*Design Decision Point*

Even though the XIV is a highly available storage array with redundant components, it's preferred not to have all of the Exchange data located on a single hardware storage device. Two XIVs will be purchased, which provides storage redundancy in combination with the database resiliency strategy.

Return to top

The XIV has two disk options available:

  • 1 terabyte 7,200 rpm SATA
  • 2 terabyte 7,200 rpm SATA

The following table summarizes the capacity options based on using 1 terabyte or 2 terabyte disks. In the table, the usable capacity is after factoring in disk space used by mirroring for redundancy, spaces, and metadata.

Capacity options

Number of modules Number of disks Usable capacity 1 terabyte Usable capacity 2 terabyte

6

72

27

55

9

108

43

87

10

120

50

102

11

132

54

111

12

144

61

125

13

156

66

134

14

168

73

149

15

180

79

161

*Design Decision Point*

Because only 70 terabytes of total storage capacity and 35 terabytes of capacity per array are required, there is no need to deploy 2 terabyte disks. At the time of testing, 2 terabyte disks are significantly more expensive than 1 terabyte disks. Therefore, 1 terabyte disks are selected for this solution.

Return to top

To meet the capacity requirements of 35 terabytes per storage array, a minimum of nine modules (disk shelves) are required.

With the IBM Capacity on Demand purchasing model, you can pay for current capacity requirements, and then add available storage as needed. With this model, you can potentially increase mailbox quotas without having to go through a redesign process to accommodate growth. As long as the average message profile for mailbox users in the environment remains the same, the IOPS requirements will be met as you increase capacity.

*Design Decision Point*

In this example, the IBM Capacity on Demand purchasing model is used to pay for the nine modules, which satisfies the current capacity requirements during the initial deployment. With this model, capacity can be increased to accommodate future growth in mailbox sizes, and cost of that growth can be deferred to future budget cycles.

Return to top

Choosing the XIV Storage System as the storage platform makes the RAID level decision easier. The XIV uses native data protection built into the storage array, referred to as RAID-X.

Rebuild overhead affects overall performance during component failures. Well known rebuild overhead for traditional RAID-10 design is above 25 percent. (We recommend using 35 percent in the Exchange sizing tool.) RAID-5 is worse than 50 percent. XIV RAID-X architecture has much less performance impact during the rebuild (<5 percent).

Today's storage administrators must decide which protection scheme to choose for their data: mirroring or parity-based. The XIV system uses mirroring protection, in which each piece of data is written on two disks. When comparing the XIV system to other systems, keep in mind that the proposed configurations of other systems often involve RAID-5 or even RAID-6 protection, which create several performance problems:

  • Each host write translates into two disk writes and two disk reads (or even three writes and three reads in RAID-6) compared to two disk writes in mirroring.
  • RAID-5 or RAID-6-based rebuild time is much longer, extending the time of reduced performance due to disk rebuild whenever a disk fails.
  • With RAID-5 or RAID-6, upon a rebuild, each read request to the failed area is served through multiple reads and computing an XOR, creating performance overhead.

Because the XIV spreads data into 1-MB chunks and spreads them across all drives, it provides a revolutionary, rapid self-healing, which can rebuild a 1 terabyte disk drive in 40 minutes or less.

Return to top

Sizing memory correctly is an important step in designing a healthy Exchange environment. We recommend that you review Understanding Memory Configurations and Exchange Performance and Understanding the Mailbox Database Cache.

The Extensible Storage Engine (ESE) uses database cache to reduce I/O operations. In general, the more database cache available, the less I/O generated on an Exchange 2010 Mailbox server. However, there's a point where adding additional database cache no longer results in a significant reduction in IOPS. Therefore, adding large amounts of physical memory to your Exchange server without determining the optimal amount of database cache required may result in higher costs with minimal performance benefit.

The IOPS estimates that you completed in a previous step assume a minimum amount of database cache per mailbox. These minimum amounts are summarized in the table "Estimated IOPS per mailbox based on message activity and mailbox database cache" in Understanding the Mailbox Database Cache.

The following table outlines the database cache per user for various message profiles.

Database cache per user

Messages sent or received per mailbox per day (about 75 KB average message size) Database cache per user (MB)

50

3 MB

100

6 MB

150

9 MB

200

12 MB

In this step, you determine high level memory requirements for the entire environment. In a later step, you use this result to determine the amount of physical memory needed for each Mailbox server. Use the following information:

  • Total database cache = profile specific database cache × number of mailbox users
  • Total database cache = 6 × 16000
    = 96000 MB
    = 94 GB

The total database cache requirements for the environment are 94 GB.

Return to top

Mailbox server capacity planning has changed significantly from previous versions of Exchange due to the new mailbox database resiliency model provided in Exchange 2010. For additional information, see Mailbox Server Processor Capacity Planning.

In the following steps, you calculate the high level megacycle requirements for active and passive database copies. These requirements will be used in a later step to determine the number of Mailbox servers needed to support the workload. Note that the number of Mailbox servers required also depends on the Mailbox server resiliency model and database copy layout.

Using megacycle requirements to determine the number of mailbox users that an Exchange Mailbox server can support isn't an exact science. A number of factors can result in unexpected megacycle results in test and production environments. Megacycles should only be used to approximate the number of mailbox users that an Exchange Mailbox server can support. It's always better to be conservative rather than aggressive during the capacity planning portion of the design process.

The following calculations are based on published megacycle estimates as summarized in the following table.

Megacycle estimates

Messages sent or received per mailbox per day Megacycles per mailbox for active mailbox database Megacycles per mailbox for remote passive mailbox database Megacycles per mailbox for local passive mailbox

50

1

0.1

0.15

100

2

0.2

0.3

150

3

0.3

0.45

200

4

0.4

0.6

In this step, you calculate the megacycles required to support the active database copies, using the following:

  • Tier 1 (600 MB mailbox quota, 100 messages per day message profile, 75 KB average message size)
    • Active mailbox megacycles required = profile specific megacycles × number of mailbox users
      = 2 × 13480
      = 26960
  • Tier 2 (1024 MB mailbox quota, 100 messages per day message profile, 75 KB average message size)
    • Active mailbox megacycles required = profile specific megacycles × number of mailbox users
      = 2 × 2400
      = 4800
  • Tier 3 (5120 MB mailbox quota, 100 messages per day message profile, 75 KB average message size)
    • Active mailbox megacycles required = profile specific megacycles × number of mailbox users
      = 2 × 120
      = 240

Total = 32000 megacycles

In a design with three copies of each database, there is processor overhead associated with shipping logs required to maintain database copies on the remote servers. This overhead is typically 10 percent of the active mailbox megacycles for each remote copy being serviced. Calculate the requirements, using the following:

  • Tier 1 (600 MB mailbox quota, 100 messages per day message profile, 75 KB average message size)
    • Remote copy megacycles required = profile specific megacycles × number of mailbox users × number of remote copies
      = 0.2 × 13480 × 2
      = 5392
  • Tier 2 (1024 MB mailbox quota, 100 messages per day message profile, 75 KB average message size)
    • Remote copy megacycles required = profile specific megacycles × number of mailbox users × number of remote copies
      = 0.2 × 2400 × 2
      = 960
  • Tier 3 (5120 MB mailbox quota, 100 messages per day message profile, 75 KB average message size)
    • Remote copy megacycles required = profile specific megacycles × number of mailbox users × number of remote copies
      = 0.2 × 120 × 2
      = 48

Total = 6400

In a design with three copies of each database, there is processor overhead associated with maintaining the local passive copies of each database. In this step, you calculate the high level megacycles required to support local passive database copies. You refine these numbers in a later step so that they match the server resiliency strategy and database copy layout. Calculate the requirements, using the following:

  • Tier 1 (600 MB mailbox quota, 100 messages per day message profile, 75 KB average message size)
    • Passive mailbox megacycles required = profile specific megacycles × number of mailbox users × number of passive copies
      = 0.3 × 13480 × 2
      = 8088
  • Tier 2 (1024 MB mailbox quota, 100 messages per day message profile, 75 KB average message size)
    • Passive mailbox megacycles required = profile specific megacycles × number of mailbox users × number of passive copies
      = 0.3 × 2400 × 2
      = 1440
  • Tier 3 (5120 MB mailbox quota, 100 messages per day message profile, 75 KB average message size)
    • Passive mailbox megacycles required = profile specific megacycles × number of mailbox users × number of passive copies
      = 0.3 × 120 × 2
      = 72

Total = 9600

Calculate the total requirements, using the following:

  • Total megacycles required = active mailbox + remote copies + local passive copies
    = 32000 + 6400 + 9600
    = 48000

The total megacycles required to support the environment are approximately 48,000.

Return to top

In a traditional Exchange deployment, you may deploy the Client Access, Hub Transport, and Mailbox server roles on different physical servers. However, there are reasons why you may want to combine the Client Access and Hub Transport server roles on the same physical server or VM. There are also scenarios where deploying the Client Access, Hub Transport, and Mailbox server roles on the same physical server or VM makes sense. For more information, see Understanding Multiple Server Role Configurations in Capacity Planning.

*Design Decision Point*

In this example, multiple role servers (Client Access, Hub Transport, and Mailbox server roles deployed on the same physical server) will be deployed. This provides a building block unit of scale and removes the complexity of determining the optimal number of individual Client Access and Hub Transport servers to deploy in various server failure scenarios.

Return to top

Several factors are important when considering server virtualization for Exchange. For more information about supported configurations for virtualization, see Exchange 2010 System Requirements.

The main reasons customers use virtualization with Exchange are as follows:

  • If you expect server capacity to be underutilized and anticipate better utilization, you may purchase fewer servers as a result of virtualization.
  • You may want to use Windows Network Load Balancing when deploying Client Access, Hub, Transport, and Mailbox server roles on the same physical server.
  • If your organization is using virtualization in all server infrastructure, you may want to use virtualization with Exchange, to be in alignment with corporate standard policy.

*Design Decision Point*

In this solution, there is no perceived value in adding virtualization. The Client Access, Hub Transport, and Mailbox server roles will be deployed on the same physical server. If it's decided to deploy these server roles on their own VM on the same physical server, a root operating system and three additional operating systems (one for each VM) would need to be managed for each physical server deployed. This would result in additional management overhead with minimal benefit.

Return to top

You can use the following steps to determine the server model for multiple role servers.

In this solution, the preferred server vendor is IBM. IBM is an industry leader with the experience, knowledge, and solutions to address corporate challenges. IBM offers proven technology and innovative approaches to business, providing reliability and performance.

IBM System x enterprise servers meet the requirements of this Exchange environment. IBM System x enterprise servers are built on X-Architecture and feature eX5 technologies, including a unique chipset and other advanced capabilities that provide higher throughput and exceptional reliability.

Within the line of System x enterprise servers, the following models may work for this solution.

Option 1: IBM System x3850 X5

Built on the next generation of IBM enterprise X-Architecture technology and Intel Xeon processors, the IBM System x3850 X5 offers performance and reliability within an energy-efficient and low-cost design, which includes flexibility.

The x3850 X5 workload-optimized systems are turnkey solutions configured for high performing database or virtualization environments. You can choose a model with a balance of processing power, memory, high IOPS storage, networking, and software for your virtualized environment or database application.

The x3850 X5 server provides flexible configurations plus memory expansion and node partitioning capabilities. You can use the modular building block design to customize the system for current needs and react to changing workloads. You can expand a 4-socket, 64-DIMM x3850 X5 to 4 sockets and 96 DIMMs or up to 8 sockets and 192 DIMMs. Resources and repartition systems can be reallocated as the environment changes.

For more information about the System x3850 X5, see IBM System x3850 X5 and x3950 X5.

IBM System x3850 X5

Components Description

Form factor/height

Rack/4U per chassis

Processor (max)

Intel Xeon up to 2.26 gigahertz (GHz) (8-core)/1066 megahertz (MHz) memory access

For information, see Intel Xeon.

Number of processors (standard/max)

2/4 per chassis (up to 2 node)

Cache (max)

Up to 24 MB

Memory (max)

16 GB/1.0 terabyte max PC3-10600 DDR III

64 DIMMs

Expansion slots

7 total PCI half-length, (2 hot-plug)

Disk bays (total/hot-swap)

8/8 2.5" Serial Attached SCSI (SAS) or 16/16 SAS solid-state drive (SSD)

Maximum internal storage

4.0 terabyte SAS per chassis (supports 8 x 73.4 GB, 146.8 GB, 300 GB and 500 GB hard disk drives or 16 x 50 GB SSDs)

Network interface

10 gigabits per second (Gbps) Fibre Channel over Ethernet Dual Channel Converged Network Adapter plus integrated dual Gigabit Ethernet with TCP-IP off-load engine

Power supply (standard/max)

1975 W 220 V 2/2

Hot-swap components

Power supplies, fans, hard disk drives, and solid-state-drives

RAID support

Integrated RAID-0 or RAID-1; optional RAID-5

Option 2: IBM System x3650 M3

The IBM System x3650 M3 provides performance for mission-critical applications. Its energy-efficient design supports cores, memory, and data capacity in a scalable 2U package. With more computing power per watt and the latest Intel Xeon processors, you can reduce costs while maintaining speed and availability.

The x3650 M3 offers a flexible, scalable design, an upgrade path to 16 hard disk drives or SSDs, and 192 GB of memory. Comprehensive systems management tools are included, such as advanced diagnostics, a cable management arm, and the ability to control resources from a single point.

For more information about the System x3650 M3, see IBM System x3650 M3.

IBM System x3650 M3

Components Description

Form factor/height

Rack/2U

Processor (max)

Up to two 3.33 GHz six-core (3.46 GHz four-core) Intel Xeon 5600 series processors with QuickPath Interconnect technology, up to 1333 MHz memory access speed.

Number of processors (standard/max)

2

Cache (max)

Up to 12 MB L3

Memory (max)

192 GB DDR-3 RDIMMs1 via 18 DIMM slots or 48 GB DDR-3 UDIMMs1 via 12 DIMM slots

Expansion slots

4

Disk bays (total/hot swap)

Up to 16 2.5" hot-swap Serial Attached SCSI (SAS)/Serial ATA (SATA) hard disk drives or SSDs

Maximum internal storage

Up to 8.0 terabytes hot-swap SAS or up to 8.0 terabytes hot-swap SATA or up to 800 GB hot-swap SSD storage

Network interface

Integrated 2 ports, plus 2 ports optional Gigabit Ethernet

Power supply (standard/max)

1/2; 675 watts each

Hot-swap components

Power supplies, fan modules, disks

RAID support

Hardware RAID-0, RAID-1, RAID-1E or RAID-0, RAID-1, RAID-10 (additional option RAID-5 with Self Encrypting Disk (SED) function) or RAID-0, RAID-1, RAID-10, RAID-5, RAID-50 with 256 MB or 512 MB cache (additional option RAID-6, RAID-60 with SED function and additional option battery backup), model dependent

For this scenario, the choice is the IBM x3650 M3 because it provides the resources needed to support the multiple role model without over-investing in hardware. The x3650 M3 can be configured with up to 192 GB of memory and provides sufficient processing power for this solution.

The x3850 X5 would be a good choice if a virtualization strategy was used. The x3850 X5 large memory capacity and processor core count enables server consolidation scenarios that could support the entire Exchange environment.

Return to top

In previous steps, you calculated the megacycles required to support the number of active mailbox users. In the following steps, you determine how many available megacycles the server model and processor can support, to determine the number of active mailboxes each server can support.

Because the megacycle requirements are based on a baseline server and processor model, you need to adjust the available megacycles for the server against the baseline. To do this, independent performance benchmarks maintained by Standard Performance Evaluation Corporation (SPEC) are used. SPEC is a non-profit corporation formed to establish, maintain, and endorse a standardized set of relevant benchmarks that can be applied to the newest generation of high-performance computers.

To help simplify the process of obtaining the benchmark value for your server and processor, we recommend you use the Exchange Processor Query tool. This tool automates the manual steps to determine your planned processor's SPECInt 2006 rate value. To run this tool, your computer must be connected to the Internet. The tool uses your planned processor model as input, and then runs a query against the Standard Performance Evaluation Corporation Web site returning all test result data for that specific processor model. The tool also calculates an average SPECint 2006 rate value based on the number of processors planned to be used in each Mailbox server Use the following calculations:

  • Processor and server platform = IBM x3650 M3
  • SPECint_rate2006 value = 350
  • SPECint_rate2006 value ÷ processor core = 356 ÷ 12
    = 30

In previous steps, you calculated the required megacycles for the entire environment based on megacycle per mailbox estimates. Those estimates were measured on a baseline system (HP DL380 G5 x5470 3.33 GHz, 8 cores) that has a SPECint_rate2006 value of 150 (for an 8 core server), or 18.75 per core.

In this step, you need to adjust the available megacycles for the chosen server and processor against the baseline processor so that the required megacycles can be used for capacity planning.

To determine the megacycles of the IBM x3650 M3 Intel X5670 2.93 GHz platform, use the following formula:

  • Adjusted megacycles per core = (new platform per core value) × (hertz per core of baseline platform) ÷ (baseline per core value)
    = (30 × 3330) ÷ 18.75
    = 5328
  • Adjusted megacycles per server = adjusted megacycles per core × number of cores
    = 5328 × 12
    = 63936

Now that the adjusted megacycles per server is known, you need to adjust for the target maximum processor utilization. In a previous section, it was decided not to exceed 80 percent processor utilization during peak workloads or failure scenarios. Use the following calculation:

  • Adjusted available megacycles = available megacycles × target max processor utilization
    = 63936 × 0.80
    = 51149

Each server has a usable capacity of 51,149 megacycles.

Return to top

You can use the following steps to determine the number of multiple role servers based on CPU requirements.

For multiple role servers (Client Access, Hub Transport, and Mailbox server roles deployed on same physical server), we recommend that 50 percent of the available megacycles be allocated to the Mailbox server role, and the remaining 50 percent be allocated to Client Access and Hub Transport server roles.

Server role allocation

Server role configuration Recommended processor core ratio

Mailbox:Client Access and Hub Transport combined role

1:1

Use the following calculation:

  • Megacycles allocated to mailbox role = total adjusted available megacycles × 50%
    = 51149 × 0.5
    = 25574

Use the following calculation:

  • Minimum number of servers = required megacycles ÷ available megacycles for Mailbox server role
    = 48000 ÷ 25574
    = 1.9

Based on processor capacity, a minimum of two servers is required to support the anticipated peak work load during normal operating conditions.

In a previous step, it was decided to use the design for all copies activated model. This model requires that Mailbox server roles be deployed in pairs. Use the following calculation:

  • Required number of servers = minimum number of servers × 2
    = 2 × 2
    = 4

Based on the DAG design, a minimum of four servers is required.

Return to top

You can use the following steps to determine the number of active mailboxes per multiple role server.

Use the following calculation:

  • Number of active mailboxes per server = total mailbox count ÷ server count
    = 16000 ÷ 4
    = 4000

Use the following calculation:

  • Number of active mailboxes per server = total mailbox count ÷ server count
    = 16000 ÷ 2
    = 8000

Return to top

You can use the following steps to determine the memory required per multiple role server.

In a previous step, it was determined that the database cache requirements for all mailboxes was 94 GB, and the average cache required per active mailbox was 6 MB.

To design for the worst case failure scenario, calculate based on active mailboxes residing on two of four multiple role servers. Use the following calculation:

  • Memory required for database cache = number of active mailboxes × average cache per mailbox
    = 8000 × 6 MB
    = 48000 MB
    = 47 GB

In this step, reference the following table to determine the recommended memory.

Memory for worst case failure scenario

Server physical memory (RAM) Database cache size: (Mailbox server role only) Database cache size: Multiple role (for example, Mailbox and Hub Transport server)

48 GB

40 GB

32 GB

64 GB

54 GB

44 GB

72 GB

61 GB

50 GB

96 GB

82 GB

68 GB

The recommended memory configuration to support 47 GB of database cache for a multiple role server is 72 GB. Each server will be configured with 18 x 4 GB DIMMs for a total of 72 GB.

Return to top

To determine the optimal number of Exchange databases to deploy, use the Exchange 2010 Mailbox Server Role Requirements Calculator. Enter the appropriate information on the input tab and select Yes for Automatically Calculate Number of Unique Databases / DAG.

Input tab of Exchange Mailbox Role Calculator

On the Role Requirements tab, the recommended number of databases appears.

Mailbox Calculator screen shows database count

In this solution, a minimum of 12 databases will be used. The exact number of databases may be adjusted in future steps to accommodate the database copy layout.

Return to top

Use the following steps to identify failure domains impacting database copy layout.

In a previous step, it was decided to deploy two XIV Storage Systems and to deploy three copies of each database. Two of the three copies are high availability copies, and one of the copies is a lagged copy. To provide maximum protection for each of the high availability database copies, we recommend that no more than one high availability copy of a single database be located on the same physical array. In this scenario, each XIV Storage System represents a failure domain that will impact the layout of database copies in the DAG.

Failure domain with IBM XIV systems

In a previous step, it was determined that four physical servers will be deployed. To align with the failure domains associated with storage, the database copies hosted on two of the servers should reside on the first XIV Storage System, and the database copies hosted on the other two servers should reside on the second XIV Storage System. The two sets of servers should also be deployed in separate racks with separate power sources and with top of rack network switches.

Failure domain with IBM servers

Return to top

You can use the following steps to design a database copy layout.

The easiest way to determine the optimal number of Exchange databases to deploy is to use the Exchange 2010 Mailbox Server Role Requirements Calculator. To download the calculator, see E2010 Mailbox Server Role Requirements Calculator. For additional information about using the storage calculator, see Exchange 2010 Mailbox Server Role Requirements Calculator. Enter the appropriate information on the input worksheet and then select Yes for Automatically Calculate Number of Unique Databases / DAG.

In a previous step, it was determined that the minimum number of unique databases to deploy is 12. For a single site solution with only two high availability copies, the database count should be equally divisible by the number of servers in the DAG. The database count per server should be equally divisible by the number of servers in the failure domain:

12 ÷ 4 = 3 databases per server

Lay out the C1 copies (activation preference value of 1) or the active copies during normal operating conditions as shown in the following table. If there's an attempt to equally distribute the C2 copies (activation preference value of 2) or the passive database copies during normal operating conditions across the servers in the other failure domain, it doesn’t work.

Number of database copies per Mailbox server

Database EX1 EX2 EX3 EX4

DB1

C1

 

C2

 

DB2

C1

   

C2

DB3

C1

 

C2

 

To correctly lay out the C2 copies, increase the unique database count to 16:

16 ÷ 4 = 4 databases per server

Number of database copies per Mailbox server

Database EX1 EX2 EX3 EX4

DB1

C1

 

C2

 

DB2

C1

   

C2

DB3

C1

 

C2

 

DB4

C1

   

C2

The total unique database count is 16. The total number of copies is 48, consisting of 16 C1 (active) copies, 16 C2 (passive) copies, and 16 C3 (lagged) copies. Each Mailbox server will host 4 active database copies, 4 passive database copies, and 4 lagged database copies during normal operating conditions.

The next steps cover how to lay out the remainder of the copies across all servers in the DAG.

Of the 12 copies on each Mailbox server, 4 will be active high availability copies, 4 will be passive high availability copies, and 4 will be lagged copies.

Start by equally distributing the C1 database copies (or the copies with an activation preference Value of 1) to the 4 servers. These copies will be active during normal operating conditions.

Database layout during normal operating conditions

Database EX1 EX2 EX3 EX4

DB1

C1

     

DB2

C1

     

DB3

C1

     

DB4

C1

     

DB5

 

C1

   

DB6

 

C1

   

DB7

 

C1

   

DB8

 

C1

   

DB9

   

C1

 

DB10

   

C1

 

DB11

   

C1

 

DB12

   

C1

 

DB13

     

C1

DB14

     

C1

DB15

     

C1

DB16

     

C1

In the preceding table, the following applies:

  • C1 = active copy (activation preference value of 1) during normal operations

Next, distribute the C2 database copies (or the copies with an activation preference value of 2) to the servers in the second failure domain. Distribute the C2 copies across all servers in the alternate failure domain to ensure that a single server failure has a minimal impact on the servers in the alternate failure domain.

Database layout during normal operating conditions

Database EX1 EX2 EX3 EX4

DB1

C1

 

C2

 

DB2

C1

   

C2

DB3

C1

 

C2

 

DB4

C1

   

C2

DB5

 

C1

C2

 

DB6

 

C1

 

C2

DB7

 

C1

C2

 

DB8

 

C1

 

C2

DB9

C2

 

C1

 

DB10

 

C2

C1

 

DB11

C2

 

C1

 

DB12

 

C2

C1

 

DB13

C2

   

C1

DB14

 

C2

 

C1

DB15

C2

   

C1

DB16

 

C2

 

C1

In the preceding table, the following applies:

  • C1 = active copy (activation preference value of 1) during normal operations
  • C2 = passive copy (activation preference value of 2) during normal operations

Before distributing the C3 copies, examine what happens in a server failure scenario. In the following example, if server EX1 fails, the active database copies will automatically move to servers EX3 and EX4. Notice that each of the two servers in the alternate failure domain is now running with six active databases, and the active databases are equally distributed across the two servers.

Database copy layout during server failure

In the preceding table, the following applies:

  • C1 = active copy (activation preference value of 1) during normal operations
  • C2 = passive copy (activation preference value of 2) during normal operations

In a maintenance scenario, you could move the active mailbox databases from the servers in the first failure domain (EX1 and EX2) to the servers in the second failure domain (EX3 and EX4), complete maintenance activities, and then move the active database copies back to the C1 copies on the servers in the first failure domain. With this configuration, you can conduct maintenance activities on all servers in the primary datacenter in two passes.

Database copy layout during maintenance

In the preceding table, the following applies:

  • C1 = active copy (activation preference value of 1) during normal operations
  • C2 = passive copy (activation preference value of 2) during normal operations

In this step, determine where to place the lagged copy of each database. The lagged copy should be deployed on the same storage array and in the same failure domain as the C1 copy. Because the lagged copy will be used to provide point-in-time recovery, it is unlikely that the C3 copy will be activated. It is more likely that the C3 copy will be used to restore the C1 copy after recovery to a point in time has been completed. Co-locate it on the same storage array as the C1 copy, to allow storage level volume copying, which is usually more efficient than network-based database seeding (especially following a failure event).

To co-locate the C3 copy on the same storage array as the C1 copy, the C3 copies must be placed on the alternate server in the same failure domain, as shown in the following table.

Database layout for lagged copy

Database EX1 EX2 EX3 EX4

DB1

C1

C3

C2

 

DB2

C1

C3

 

C2

DB3

C1

C3

C2

 

DB4

C1

C3

 

C2

DB5

C3

C1

C2

 

DB6

C3

C1

 

C2

DB7

C3

C1

C2

 

DB8

C3

C1

 

C2

DB9

C2

 

C1

C3

DB10

 

C2

C1

C3

DB11

C2

 

C1

C3

DB12

 

C2

C1

C3

DB13

C2

 

C3

C1

DB14

 

C2

C3

C1

DB15

C2

 

C3

C1

DB16

 

C2

C3

C1

In the preceding table, the following applies:

  • C1 = active copy (activation preference value of 1) during normal operations
  • C2 = passive copy (activation preference value of 2) during normal operations
  • C3 = lagged copy (activation blocked) during normal operations

Return to top

A well designed storage solution is a critical aspect of a successful Exchange 2010 Mailbox server role deployment. For more information about Mailbox server storage design, see Mailbox Server Storage Design.

The following table summarizes the storage requirements that have been calculated or determined in a previous design step.

Summary of disk space requirements

Disk space requirements Value

Average mailbox size on disk (MB)

855

Database space required (GB)

22053

Log space required (GB)

1287

Total space required (GB)

23340

Total space required for three database copies (GB)

70020

Total space required for three database copies (terabytes)

68

The XIV Storage System offers logical volumes as the basic data storage element for allocating usable storage space to attached hosts. This logical unit concept is well known and is widely used by other storage subsystems and vendors. However, neither the volume segmentation nor its distribution over the physical disks is conventional in the XIV Storage System.

Traditionally, logical volumes are defined within various RAID arrays, where their segmentation and distribution are manually specified. The result is often a suboptimal distribution within and across modules (expansion units) and is significantly dependent upon the administrator’s knowledge and expertise.

The XIV Storage System uses true virtualization as one of the basic principles for its unique design. With XIV Storage System, each volume is divided into tiny 1-MB partitions, and these partitions are distributed randomly and evenly, and duplicated for protection. The result is optimal distribution in and across all modules, which means that for any volume, the physical drive location and data placement are invisible to the user. This method dramatically simplifies storage provisioning, letting the system lay out the user’s volume in an optimal way.

This method offers complete virtualization, without requiring preliminary volume layout planning or detailed and accurate stripe or block size calculation by the administrator. All disks are equally used to maximize the I/O performance and to exploit all the processing power and bandwidth available in the storage system. XIV Storage System virtualization incorporates an advanced snapshot mechanism with unique capabilities, which enables creating a virtually unlimited number of point-in-time copies of any volume, without incurring any performance penalties.

In previous Exchange releases, it was a recommended best practice to separate database files and log files from the same mailbox database to different volumes backed by different physical disks for recoverability purposes. This is still a recommended best practice for stand-alone architectures and architectures using VSS-based backups.

*Design Decision Point*

With the IBM XIV architecture, each LUN is spread across all 180 spindles. The XIV doesn't use traditional RAID algorithms, but rather a proprietary RAID X. Because this architecture doesn't offer spindle isolation, there's no reason to create separate LUNs for database and log files. Subsequent design decisions will be based on a single volume for each database and log set.

In a previous step, it was determined that each primary Mailbox server would support four active databases, four passive database copies, and four lagged database copies. Therefore, there will be a total of 12 volumes for each primary datacenter Mailbox server.

Number of volumes per Mailbox server

Volumes Mailbox servers

Active database volumes

4

Passive database volumes

4

Lagged database volumes

4

Total volumes

12

In this step, determine the volume size required to support both the database and log capacity requirements. Use the following calculations:

  • Database capacity = [(number of mailbox users × average mailbox size on disk) + (20% data overhead factor)] + [(10% content indexing overhead)]
    =[(1000 × 855) + (171000)] + [102600]
    = 1128600 MB
    = 1102 GB
  • Log capacity = [(log size × number of logs per mailbox per day × number of days required to support lagged copy × number of mailbox users) + (mailbox move % overhead)]
    = (1 MB × 20 × 3 × 1000) + (1000 × 0.01 × 855 MB)
    = 68550 MB
    = 67 GB
  • Volume size = [(database capacity) + (log capacity)] +20% volume free space
    = [(1102) + (67)] ÷ .8
    = 1461

The required volume size is 1,461 GB.

The Microsoft Windows operating system reports volume sizes in binary format (for example, 1 GB = 1024 MB). The XIV Storage System GUI reports sizes in decimal format (for example, 1 GiB = 1000 MB). You need to convert the calculated volume size to decimal format before creating volumes. The conversion factor is 0.9313. To download an IBM conversion tool, see XIV Volume Sizing Spreadsheet Tool.

To use the tool, enter the desired volume size of 1461 GiB in the first field. The decimal value of 1580 GB will appear in the second field. This is the value you use to allocate storage on the XIV Storage System.

Screen shot of IBM Volume Sizing Tool

The XIV Storage System configures volumes in blocks of volume extents. By default, the volume extent size is 17 GB. To determine the number of extents used, divide the target volume size by 17 GB. In this solution, 93 extents (1580 ÷ 17) are required. After partitioning and formatting, the size of the volume will be slightly smaller. We recommend that an additional 17 GB extent is added. The unformatted volume space will be 1598 GB (1581 + 17). On the server, format the newly created volume and verify that the formatted volume meets the original capacity requirement of 1461 GiB.

In a previous step, it was determined that the Mailbox servers would be deployed in two failure domains. Servers EX1 and EX2 are in one failure domain, and servers EX3 and EX4 are in another failure domain.

Volume layout on XIV Storage System

Database EX1 EX2 EX3 EX4

DB1

C1

C3

C2

 

DB2

C1

C3

 

C2

DB3

C1

C3

C2

 

DB4

C1

C3

 

C2

DB5

C3

C1

C2

 

DB6

C3

C1

 

C2

DB7

C3

C1

C2

 

DB8

C3

C1

 

C2

DB9

C2

 

C1

C3

DB10

 

C2

C1

C3

DB11

C2

 

C1

C3

DB12

 

C2

C1

C3

DB13

C2

 

C3

C1

DB14

 

C2

C3

C1

DB15

C2

 

C3

C1

DB16

 

C2

C3

C1

In the preceding table, the following applies:

  • C1 = active copy (activation preference value of 1) during normal operations
  • C2 = passive copy (activation preference value of 2) during normal operations
  • C3 = lagged copy (activation blocked) during normal operations

Ensure that the two high availability copies of each database reside on different physical arrays. In a previous step, it was determined that the database layout shown in the preceding table was the best layout to achieve this goal. This layout results in each XIV Storage System having 21 volumes.

The following table illustrates how the database copies are positioned on the XIV Storage Systems.

Database copies on XIV Storage Systems

Database XIV1   Database XIV2

DB1

C1, C3

 

DB1

C2

DB2

C1, C3

 

DB2

C2

DB3

C1, C3

 

DB3

C2

DB4

C1, C3

 

DB4

C2

DB5

C1, C3

 

DB5

C2

DB6

C1, C3

 

DB6

C2

DB7

C1, C3

 

DB7

C2

DB8

C1, C3

 

DB8

C2

DB9

C2

 

DB9

C1, C3

DB10

C2

 

DB10

C1, C3

DB11

C2

 

DB11

C1, C3

DB12

C2

 

DB12

C1, C3

DB13

C2

 

DB13

C1, C3

DB14

C2

 

DB14

C1, C3

DB15

C2

 

DB15

C1, C3

DB16

C2

 

DB16

C1, C3

The XIV Storage System uses the concept of storage pools. Storage pools aren't disk pools and don't represent pools of physically isolated spindles. Storage pools manage a related group of logical volumes and their snapshots. Storage pools offer the following key benefits:

  • Improved management of storage space   Specific volumes can be grouped within a storage pool, providing the flexibility to control the usage of storage space by specific applications, a group of applications, or departments.
  • Improved regulation of storage space   Automatic snapshot deletion occurs when the storage capacity limit is reached for each storage pool independently. When a storage pool’s size is exhausted, only the snapshots that reside in the affected storage pool are deleted.

The size of storage pools and the associations between volumes and storage pools are constrained by:

  • The size of a storage pool can range from as small as possible (17.1 GB) to as large as possible (the entire system) without any limitation.
  • The size of a storage pool can always be increased, limited only by the free space on the system.
  • The size of a storage pool can always be decreased, limited only by the space already consumed by the volumes and snapshots in that storage pool.
  • Volumes can be moved between storage pools without any limitations, as long as there is enough free space in the target storage pool.

*Design Decision Point*

In this solution, one large storage pool per array for the 24 Exchange database volumes will be created. Because Exchange is the only application on the XIV Storage Systems, there is no need to add additional administrative or management boundaries.

Return to top

In Exchange 2010, the DAG uses a minimal set of components from Windows failover clustering. One of those components is the quorum resource, which provides a means for arbitration when determining cluster state and making membership decisions. It is critical that each DAG member have a consistent view of how the DAG's underlying cluster is configured. The quorum acts as the definitive repository for all configuration information relating to the cluster. The quorum is also used as a tiebreaker to avoid split brain syndrome. Split brain syndrome is a condition that occurs when DAG members can't communicate with each other but are available and running. Split brain syndrome is prevented by always requiring a majority of the DAG members (and in the case of DAGs with an even number of members, the DAG witness server) to be available and interacting for the DAG to be operational.

A witness server is a server outside of a DAG that hosts the file share witness, which is used to achieve and maintain quorum when the DAG has an even number of members. DAGs with an odd number of members don't use a witness server. Upon creation of a DAG, the file share witness is added by default to a Hub Transport server (that doesn't have the Mailbox server role installed) in the same site as the first member of the DAG. If your Hub Transport server is running in a VM that resides on the same root server as VMs running the Mailbox server role, we recommend that you move the location of the file share witness to another highly available server. You can move the file share witness to a domain controller, but because of security implications, do this only as a last resort.

*Design Decision Point*

There is only one additional server in the environment: a file and print server, which is also used to host Exchange VSS-based backups. The file and print server is reasonably stable and is managed by the same administrator who supports the Exchange servers, so it's a good choice for the location of the file share witness.

Return to top

In Exchange 2010, the RPC Client Access service and the Exchange Address Book service were introduced on the Client Access Server role to improve the mailbox users experience when the active mailbox database copy is moved to another Mailbox server (for example, during mailbox database failures and maintenance events). The connection endpoints for mailbox access from Microsoft Outlook and other MAPI clients have been moved from the Mailbox server role to the Client Access server role. Both internal and external Outlook connections must now be load balanced across all Client Access servers in the site to achieve fault tolerance. To associate the MAPI endpoint with a group of Client Access servers rather than a specific Client Access server, you can define a Client Access server array. You can only configure one array per Active Directory site, and an array can't span more than one Active Directory site. For more information, see Understanding RPC Client Access and Understanding Load Balancing in Exchange 2010.

*Design Decision Point*

Because this is a single site deployment with four servers running the Client Access server role, there will be a single Client Access server array. With only four servers to load balance, you can choose to use a Windows Network Load Balancing software solution, which is an installable component of the Windows Server operating system. However, Windows Network Load Balancing can't be enabled on any server that has Windows failover clustering enabled. When you add a Mailbox server to a DAG, Windows failover clustering is enabled. Therefore, you can't use Windows Network Load Balancing to load balance a Client Access server role installed on the same server as a Mailbox server role that's a member of a DAG. A hardware load balancing solution must be deployed.

Return to top

You can use the following steps to determine a hardware load balancing model.

In this example, the preferred vendor is Brocade because they have been providing reliable, high performance solutions for the datacenter for the past 15 years. Brocade ServerIron intelligent application delivery and traffic management solutions have led the industry for over a decade, helping to mitigate costs and prevent losses by optimizing business-critical enterprise and service provider applications with high availability, security, multisite redundancy, acceleration, and scalability, in more than 3,000 of the world's most demanding organizations.

Brocade offers the ServerIron ADX Series of load balancers and application delivery controllers. They support Layer 4 through 7 switching with industry leading performance. The ServerIron ADX Series comes in an intelligent, modular application delivery controller platform and is offered in various configurations.

The ServerIron ADX comes in three primary platforms: the 1000, 4000, and 10000. For most Exchange environments, a version of the ADX 1000  should suffice. For organizations that need to load balance several applications in the datacenter or when large scale chassis reconfiguration or expansion is required, the unique design of the ServerIron ADX 4000 and 10000 provides a dedicated backplane to support application, data, and management functionality through specialized modules.

The ServerIron ADX 1000 offers an incremental deployment pricing model (known as pay as you grow) so that you can scale the capacity of the ServerIron ADX Series as your organization grows.

The ServerIron ADX 1000 Series includes four models of varying processor and port capacity, all based on the full hardware platform and operating software:

  • ADX 1008-1   1 application core and 8 x 1 gigabit Ethernet (GbE) ports
  • ADX 1016-2   2 application cores and 16 x 1 GbE ports
  • ADX 1016-4   4 application cores and 16 x 1 GbE ports
  • ADX 1216-4   4 application cores, 16 x 1 GbE ports, and 2 x 10 GbE ports

Depending on the model selected, a specific number of application cores, interface ports, hardware acceleration, and software capabilities are enabled. The remaining untapped capacity can be unlocked by applying license upgrade key codes.

For more information about the ADX 1000 and other ADC platforms from Brocade, see Brocade ServerIron ADX Series.

The entry-level ServerIron ADX 1008-1 model with a single application core and 8 x 1 GbE ports is selected. This model will meet the current load balancing requirements and provide flexibility to add additional capacity and features to meet future business needs.

Return to top

The ServerIron ADX 1000 can be configured in one of the following configurations:

  • Stand-alone device
  • Active/Hot-standby
  • Active/Active

The recommended device resiliency strategy is active/hot-standby. In a typical hot-standby configuration, one ServerIron application delivery controller is the active device and performs all the Layer 2 switching as well as the Layer 4 server load balancing switching. The other ServerIron application delivery controller monitors the switching activities and remains in a hot-standby role. If the active ServerIron application delivery controller becomes unavailable, the standby ServerIron application delivery controller immediately assumes the unavailable ServerIron application delivery controller's responsibilities. The failover from the unavailable ServerIron application delivery controller to the standby ServerIron application delivery controller is transparent to users. Both ServerIron application delivery controller switches share a common MAC address known to the clients. If a failover occurs, the clients still know the ServerIron application delivery controller by the same MAC address. The active sessions running on the clients continue, and the clients and routers don't need an Address Resolution Protocol (ARP) request for the ServerIron MAC address.

*Design Decision Point*

The load balancer device shouldn't be a single point of failure, and there is no reason to deploy an active/active configuration. The load balancer resiliency strategy will include two ServerIron ADX 1000 load balancers in an active/hot-standby configuration.

Return to top

Exchange protocols and client access services have different load balancing requirements. Some Exchange protocols and client access services require client to Client Access server affinity. Others work without it, but display performance improvements from such affinity. Other Exchange protocols don't require client to Client Access server affinity, and performance doesn't decrease without affinity. For additional information, see Load Balancing Requirements of Exchange Protocols and Understanding Load Balancing in Exchange 2010.

The recommended load balancing method for Outlook client traffic is Source IP Port Persistence. In this method, the load balancer looks at a client IP address and sends all traffic from a certain source/client IP address to a specific Client Access server. The source IP method has two limitations:

  • Whenever the IP address of the client changes, the affinity is lost. However, the user impact is acceptable as long as this occurs infrequently.
  • Having a large number of clients from the same IP address leads to uneven distribution. Distribution of traffic among the Client Access servers then depends on how many clients are arriving from a specific IP address. Clients may arrive from the same IP address because of the following:
    • Network address translation (NAT) or outgoing proxy servers (for example, Microsoft Forefront Threat Management Gateway)   In this case, the original client IP addresses are masked by NAT or outgoing proxy server IP addresses.
    • Client Access server to Client Access server proxy traffic   One Client Access server can proxy traffic to another Client Access server. This typically occurs between Active Directory sites, because most Exchange 2010 traffic needs to be handled by either a Client Access server in the same Active Directory site as the mailbox being accessed or a Client Access Server with the same major version as the mailbox being accessed. In a single site configuration, this isn't an issue.

When an Outlook client connects directly to the Client Access server using the RPC Client Access service and the Exchange Address Book service, the endpoint TCP ports for these services are allocated by the RPC endpoint manager. By default, this requires a large range of destination ports to be configured for load balancing without the ability to specifically target traffic for these services based on a port number. You can statically map these services to specific port numbers to simplify load balancing (and perhaps make it easier to enforce restrictions on network traffic via firewall applications or devices). If the ports for these services are statically mapped, the traffic will be restricted to port 135 (used by the RPC port mapper) and the two specific ports that were selected for these services.

*Design Decision Point*

It was decided to implement static port mapping for the RPC Client Access service and the Exchange Address Book service. These ports will be set to 60000 and 60001 respectively.

For information about how to configure static port mapping, see Load Balancing Requirements of Exchange Protocols.

Return to top

The previous section provided information about the design decisions that were made when considering an Exchange 2010 solution. The following section provides an overview of the solution.

This solution consists of four Exchange 2010 servers in a single site. Each server is deployed as a multiple role server with the Client Access, Hub Transport, and Mailbox server roles. A single namespace is load balanced across an array of Client Access servers. The Mailbox servers are in a single DAG with the file share witness located on a file server in the site.

Logical solution diagram with IBM and Brocade

Return to top

The solution consists of four IBM x3650 M3 servers attached to two IBM XIV Storage Systems via redundant Brocade 300 FC switches. Two Brocade ServerIron ADX 1000 series devices, deployed in a hot-standby configuration, provide hardware load balancing for the solution. Redundant Brocade FastIron Ethernet switches provide network connectivity.

Physical solution diagram with IBM and Brocade

Return to top

The following table summarizes the physical server hardware used in this solution.

Server hardware

Component Description

Server vendor

IBM

Server model

X3650 M3

Processor

Intel Westmere EP X5670 2.93 GHz/6C

Memory

1333 MHz, DDR-3 RDIMM 

Internal disk

300 GB 2.5" SAS

Operating system disk configuration

2 x disk RAID-1

RAID controller

ServeRAID MR10

Network interface

Integrated 2 ports

Power

2x 675 watt power supply

Return to top

The following table summarizes the Exchange server software used in this solution.

Server software

Component Description

Operating System

Windows Server 2008 R2

Exchange Version

Exchange Server 2010 Enterprise Edition

Exchange Patch Level

Exchange 2010 Update Rollup 3

Antivirus

Microsoft Forefront Protection 2010 for Exchange Server

Return to top

The following diagram summarizes the database copy layout used in this solution during normal operating conditions.

Diagram of database copy layout

Return to top

The following table summarizes the storage hardware used in this solution.

Storage hardware

Component Description

Storage vendor

IBM

Storage model

XIV

Disks

180 x 1 terabyte 7200 rpm SATA

Usable capacity

79 terabytes

FC ports

24

Power usage

8.4 kilovolt-ampere (kVA) (peak) or 7.1 kVA (idle)

Return to top

Each of the XIV Storage System storage enclosures used in the solution were configured as shown in the following table.

Storage configuration

Component Description

Storage enclosures

2

Volumes per enclosure

24

Volumes per server

12

Volume size

1598 GB

RAID level

RAID-X

Storage pools

2

Storage pools per enclosure

1

The following table shows how the available storage was designed and allocated between the two XIV Storage Systems.

 

Database XIV1   Database XIV2

DB1

C1, C3

 

DB1

C2

DB2

C1, C3

 

DB2

C2

DB3

C1, C3

 

DB3

C2

DB4

C1, C3

 

DB4

C2

DB5

C1, C3

 

DB5

C2

DB6

C1, C3

 

DB6

C2

DB7

C1, C3

 

DB7

C2

DB8

C1, C3

 

DB8

C2

DB9

C2

 

DB9

C1, C3

DB10

C2

 

DB10

C1, C3

DB11

C2

 

DB11

C1, C3

DB12

C2

 

DB12

C1, C3

DB13

C2

 

DB13

C1, C3

DB14

C2

 

DB14

C1, C3

DB15

C2

 

DB15

C1, C3

DB16

C2

 

DB16

C1, C3

Return to top

The following table summarizes the Fibre Channel switch hardware used in this solution.

Fibre Channel switch hardware

Vendor Brocade

Model

300 SAN

GbE ports

24

Port bandwidth

8 Gbps

Aggregate bandwidth

192 Gbps

For more information about the Brocade 300 SAN switch or other Brocade SAN switches, see Brocade Switches.

The following table summarizes the host bus adapter (HBA) hardware used in this solution.

HBA hardware

Vendor Brocade

Model

825 8G FC HBA

Ports

Dual

Port bandwidth

8 Gbps (1600 MB per second)

For more information about Brocade HBAs, see Brocade Adapters.

Return to top

The following table summarizes the network switch hardware used in this solution.

Network switch hardware

Vendor Brocade

Model

FastIron GS 624-P

Power over Ethernet (PoE) ports

24

Port bandwidth

10/100/1000 Mbps RJ45

10 gigabit Ethernet

2

Return to top

The following table summarizes the load balancing hardware used in this solution.

Load balancing hardware

Vendor Brocade

Model

ADX 1000

Licensing option

ADX 1008-1

Application cores

1

GbE ports

8

For more information about this and other platforms from Brocade, see Brocade ServerIron ADX Series.

Return to top

If you want to deploy this solution with the C3 copy as a high availability copy instead of a lagged copy, there are some design decisions that you may want to revisit.

In a variation to this solution, there's no requirement for a lagged copy. The C3 copy becomes a third high availability copy. In this case, you may want to deploy a different server resiliency model. We recommend a model that designs for targeted failure scenarios. This model distributes C1, C2, and C3 database copies across the four servers in the DAG. Because there are three high availability copies, at least one of those copies resides on a server or storage array in the alternate failure domain. This model works with either 12 or 16 databases. The following example uses the recommended minimum database count of 12.

Solution variation

Database EX1 EX2 EX3 EX4

DB1

C1

C2

C3

 

DB2

 

C1

C2

C3

DB3

C3

 

C1

C2

DB4

C2

C3

 

C1

DB5

C1

 

C2

C3

DB6

C3

C1

 

C2

DB7

C2

C3

C1

 

DB8

 

C2

C3

C1

DB9

C1

 

C3

C2

DB10

C2

C1

 

C3

DB11

C3

C2

C1

 

DB12

 

C3

C2

C1

In the preceding table, the following applies:

  • C1 = active copy (activation preference value of 1) during normal operations
  • C2 = passive copy (activation preference value of 2) during normal operations
  • C3 = passive copy (activation preference value of 3) during normal operations

With this model, the simultaneous failure of any two servers in the DAG can be accommodated. If servers EX1 and EX4 fail at the same time, each of the surviving servers would handle the additional workload according to the activation preferences of the database copies. When two servers fail, only six of nine database copies are activated on the surviving servers. In the previous model, all of the database copies on the surviving server become active. This has implications when sizing servers because you must calculate the megacycle requirements for the worst case targeted failure scenario rather than just sizing for all database copies becoming active. For examples about how to size for this scenario, see other papers in White Papers: Exchange 2010 Tested Solutions.

Database layout during normal operating conditions

Database EX1 EX2 EX3 EX4

DB1

C1

C2

C3

 

DB2

 

C1

C2

C3

DB3

C3

 

C1

C2

DB4

C2

C3

 

C1

DB5

C1

 

C2

C3

DB6

C3

C1

 

C2

DB7

C2

C3

C1

 

DB8

 

C2

C3

C1

DB9

C1

 

C3

C2

DB10

C2

C1

 

C3

DB11

C3

C2

C1

 

DB12

 

C3

C2

C1

Return to top

Prior to deploying an Exchange solution in a production environment, validate that the solution was designed, sized, and configured properly. This validation must include functional testing to ensure that the system is operating as desired as well as performance testing to ensure that the system can handle the desired user load. This section describes the approach and test methodology used to validate server and storage design for this solution. In particular, the following tests will be defined in detail:

  • Performance tests
    • Storage performance validation (Jetstress)
    • Server performance validation (Loadgen)
  • Functional tests
    • Database switchover validation
    • Server switchover validation
    • Server failover validation

Return to top

The level of performance and reliability of the storage subsystem connected to the Exchange Mailbox server role has a significant impact on the overall health of the Exchange deployment. Additionally, poor storage performance will result in high transaction latency, primarily reflected in poor client experience when accessing the Exchange system. To ensure the best possible client experience, validate storage sizing and configuration via the method described in this section.

For validating Exchange storage sizing and configuration, we recommend the Microsoft Exchange Server Jetstress tool. The Jetstress tool is designed to simulate an Exchange I/O workload at the database level by interacting directly with the ESE, which is also known as Jet. The ESE is the database technology that Exchange uses to store messaging data on the Mailbox server role. Jetstress can be configured to test the maximum I/O throughput available to your storage subsystem within the required performance constraints of Exchange. Or, Jetstress can accept a target profile of user count and per-user IOPS, and validate that the storage subsystem is capable of maintaining an acceptable level of performance with the target profile. Test duration is adjustable and can be run for a minimal period of time to validate adequate performance or for an extended period of time to additionally validate storage subsystem reliability.

The Jetstress tool can be obtained from the Microsoft Download Center at the following locations:

The documentation included with the Jetstress installer describes how to configure and execute a Jetstress validation test on your server hardware.

There are two main types of storage configurations:

  • Direct-attached storage (DAS) or internal disk scenarios
  • Storage area network (SAN) scenarios

With DAS or internal disk scenarios, there's only one server accessing the disk subsystem, so the performance capabilities of the storage subsystem can be validated in isolation.

In SAN scenarios, the storage utilized by the solution may be shared by many servers and the infrastructure that connects the servers to the storage may also be a shared dependency. This requires additional testing, as the impact of other servers on the shared infrastructure must be adequately simulated to validate performance and functionality.

The following storage validation test cases were executed against the solution and should be considered as a starting point for storage validation. Specific deployments may have other validation requirements that can be met with additional testing, so this list isn't intended to be exhaustive:

  • Validation of worst case database switchover scenario   In this test case, the level of I/O is expected to be serviced by the storage subsystem in a worst case switchover scenario (largest possible number of active copies on fewest servers). Depending on whether the storage subsystem is DAS or SAN, this test may be required to run on multiple hosts to ensure that the end-to-end solution load on the storage subsystem can be sustained.
  • Validation of storage performance under storage failure and recovery scenario (for example, failed disk replacement and rebuild)   In this test case, the performance of the storage subsystem during a failure and rebuild scenario is evaluated to ensure that the necessary level of performance is maintained for optimal Exchange client experience. The same caveat applies for a DAS vs. SAN deployment: If multiple hosts are dependent on a shared storage subsystem, the test must include load from these hosts to simulate the entire effect of the failure and rebuild.

The Jetstress tool produces a report file after each test is completed. To help you analyze the report, use the guidelines in Reading Jetstress 2010 Test Reports.

Specifically, you should use the guidelines in the following table when you examine data in the Test Results table of the report.

Jetstress results analysis

Performance counter instance Guidelines for performance test

I/O Database Reads Average Latency (msec)

The average value should be less than 20 milliseconds (msec) (0.020 seconds), and the maximum values should be less than 50 msec.

I/O Log Writes Average Latency (msec)

Log disk writes are sequential, so average write latencies should be less than 10 msec, with a maximum of no more than 50 msec.

%Processor Time

Average should be less than 80%, and the maximum should be less than 90%.

Transition Pages Repurposed/sec (Windows Server 2003, Windows Server 2008, Windows Server 2008 R2)

Average should be less than 100.

The report file shows various categories of I/O performed by the Exchange system:

  • Transactional I/O Performance   This table reports I/O that represents user activity against the database (for example, Outlook generated I/O). This data is generated by subtracting background maintenance I/O and log replication I/O from the total I/O measured during the test. This data provides the actual database IOPS generated along with I/O latency measurements required to determine whether a Jetstress performance test passed or failed.
  • Background Database Maintenance I/O Performance   This table reports the I/O generated due to ongoing ESE database background maintenance.
  • Log Replication I/O Performance   This table reports the I/O generated from simulated log replication.
  • Total I/O Performance   This table reports the total I/O generated during the Jetstress test.

Return to top

After the performance and reliability of the storage subsystem is validated, ensure that all of the components in the messaging system are validated together for functionality, performance, and scalability. This means moving up in the stack to validate client software interaction with the Exchange product as well as any server-side products that interact with Exchange. To ensure that the end-to-end client experience is acceptable and that the entire solution can sustain the desired user load, the method described in this section can be applied for server design validation.

For validation of end-to-end solution performance and scalability, we recommend the Microsoft Exchange Server Load Generator tool (Loadgen). Loadgen is designed to produce a simulated client workload against an Exchange deployment. This workload can be used to evaluate the performance of the Exchange system, and can also be used to evaluate the effect of various configuration changes on the overall solution while the system is under load. Loadgen is capable of simulating Microsoft Office Outlook 2007 (online and cached), Office Outlook 2003 (online and cached), POP3, IMAP4, SMTP, ActiveSync, and Outlook Web App (known in Exchange 2007 and earlier versions as Outlook Web Access) client activity. It can be used to generate a single protocol workload, or these client protocols can be combined to generate a multiple protocol workload.

You can get the Loadgen tool from the Microsoft Download Center at the following locations:

The documentation included with the Loadgen installer describes how to configure and execute a Loadgen test against an Exchange deployment.

When validating your server design, test the worst case scenario under anticipated peak workload. Based on a number of data sets from Microsoft IT and other customers, peak load is generally equal to 2x the average workload throughout the remainder of the work day. This is referred to as the peak-to-average workload ratio.

Screen shot of Performance Monitor

In this Performance Monitor snapshot, which displays various counters that represent the amount of Exchange work being performed over time on a production Mailbox server, the average value for RPC operations per second (the highlighted line) is about 2,386 when averaged across the entire day. The average for this counter during the peak period from 10:00 through 11:00 is about 4,971, giving a peak-to-average ratio of 2.08.

To ensure that the Exchange solution is capable of sustaining the workload generated during the peak average, modify Loadgen settings to generate a constant amount of load at the peak average level, rather than spreading out the workload over the entire simulated work day. Loadgen task-based simulation modules (like the Outlook simulation modules) utilize a task profile that defines the number of times each task will occur for an average user within a simulated day.

The total number of tasks that need to run during a simulated day is calculated as the number of users multiplied by the sum of task counts in the configured task profile. Loadgen then determines the rate at which it should run tasks for the configured set of users by dividing the total number of tasks to run in the simulated day by the simulated day length. For example, if Loadgen needs to run 1,000,000 tasks in a simulated day, and a simulated day is equal to 8 hours (28,800 seconds), Loadgen must run 1,000,000 ÷ 28,800 = 34.72 tasks per second to meet the required workload definition. To increase the amount of load to the desired peak average, divide the default simulated day length (8 hours) by the peak-to-average ratio (2) and use this as the new simulated day length.

Using the task rate example again, 1,000,000 ÷ 14,400 = 69.44 tasks per second. This reduces the simulated day length by half, which results in doubling the actual workload run against the server and achieving our goal of a peak average workload. You don't adjust the run length duration of the test in the Loadgen configuration. The run length duration specifies the duration of the test and doesn't affect the rate at which tasks will be run against the Exchange server.

The following server design validation test cases were executed against the solution and should be considered as a starting point for server design validation. Specific deployments may have other validation requirements that can be met with additional testing, so this list isn't intended to be exhaustive:

  • Normal operating conditions   In this test case, the basic design of the solution is validated with all components in their normal operating state (no failures simulated). The desired workload is generated against the solution, and the overall performance of the solution is validated against the metrics that follow.
  • Single server failure or single server maintenance (in site)   In this test case, a single server is taken down to simulate either an unexpected failure of the server or a planned maintenance operation for the server. The workload that would normally be handled by the unavailable server is now handled by other servers in the solution topology, and the overall performance of the solution is validated.

Exchange performance data has some natural variation within test runs and among test runs. We recommend that you take the average of multiple runs to smooth out this variation. For Exchange tested solutions, a minimum of three separate test runs with durations of eight hours was completed. Performance data was collected for the full eight-hour duration of the test. Performance summary data was taken from a three to four hour stable period (excluding the first two hours of the test and the last hour of the test). For each Exchange server role, performance summary data was averaged between servers for each test run, providing a single average value for each data point. The values for each run were then averaged, providing a single data point for all servers of a like server role across all test runs.

Before you look at any performance counters or start your performance validation analysis, verify that the workload you expected to run matched the workload that you actually ran. Although there are many ways to determine whether the simulated workload matched the expected workload, the easiest and most consistent way is to look at the message delivery rate.

Every message profile consists of the sum of the average number of messages sent per day and the average number of messages received per day. To calculate the message delivery rate, select the average number of messages received per day from the following table.

Peak message delivery rate

Message profile Messages sent per day Messages received per day

50

10

40

100

20

80

150

30

120

200

40

160

The following example assumes that each Mailbox server has 5,000 active mailboxes with a 150 messages per day profile (30 messages sent and 120 messages received per day).

Peak message delivery rate for 5,000 active mailboxes

Description Calculation Value

Message profile

Number of messages received per day

120

Number of active mailboxes per Mailbox server

Not applicable

5000

Total messages received per day per Mailbox server

5000 × 120

600000

Total messages received per second per Mailbox server

600000 ÷ 28800

20.83

Total messages adjusted for peak load

20.83 × 2

41.67

You expect 41.67 messages per second delivered on each Mailbox server running 5,000 active mailboxes with a message profile of 150 messages per day during peak load.

The actual message delivery rate can be measured using the following counter on each Mailbox server: MSExchangeIS Mailbox(_Total)\Messages Delivered/sec. If the measured message delivery rate is within one or two messages per second of the target message delivery rate, you can be confident that the desired load profile was run successfully.

This section describes the Performance Monitor counters and thresholds used to determine whether the Exchange environment was sized properly and is able to run in a healthy state during extended periods of peak workload. For more information about counters relevant to Exchange performance, see Performance and Scalability Counters and Thresholds.

When validating whether a Mailbox server was properly sized, focus on processor, memory, storage, and Exchange application health. This section describes the approach to validating each of these components.

Processor

During the design process, you calculated the adjusted megacycle capacity of the server or processor platform. You then determined the maximum number of active mailboxes that could be supported by the server without exceeding 80 percent of the available megacycle capacity. You also determined what the projected CPU utilization should be during normal operating conditions and during various server maintenance or failure scenarios.

During the validation process, verify that the worst case scenario workload doesn't exceed 80 percent of the available megacycles. Also, verify that actual CPU utilization is close to the expected CPU utilization during normal operating conditions and during various server maintenance or failure scenarios.

For physical Exchange deployments, use the Processor(_Total)\% Processor Time counter and verify that this counter is less than 80 percent on average.

 

Counter Target

Processor(_Total)\% Processor Time

<80%

Memory

During the design process, you calculated the amount of database cache required to support the maximum number of active databases on each Mailbox server. You then determined the optimal physical memory configuration to support the database cache and system memory requirements.

Validating whether an Exchange Mailbox server has sufficient memory to support the target workload isn't a simple task. Using available memory counters to view how much physical memory is remaining isn't helpful because the memory manager in Exchange is designed to use almost all of the available physical memory. The information store (store.exe) reserves a large portion of physical memory for database cache. The database cache is used to store database pages in memory. When a page is accessed in memory, the information doesn't have to be retrieved from disk, reducing read I/O. The database cache is also used to optimize write I/O.

When a database page is modified (known as a dirty page), the page stays in cache for a period of time. The longer it stays in cache, the better the chance that the page will be modified multiple times before those changes are written to the disk. Keeping dirty pages in cache also causes multiple pages to be written to the disk in the same operation (known as write coalescing). Exchange uses as much of the available memory in the system as possible, which is why there aren't large amounts of available memory on an Exchange Mailbox server.

It may not be easy to know whether the memory configuration on your Exchange Mailbox server is undersized. For the most part, the Mailbox server will still function, but your I/O profile may be much higher than expected. Higher I/O can lead to higher disk read and write latencies, which may impact application health and client user experience. In the results section, there isn't any reference to memory counters. Potential memory issues will be identified in the storage validation and application health result sections, where memory-related issues are more easily detected.

Storage

If you have performance issues with your Exchange Mailbox server, those issues may be storage-related issues. Storage issues may be caused by having an insufficient number of disks to support the target I/O requirements, having overloaded or poorly designed storage connectivity infrastructure, or by factors that change the target I/O profile like insufficient memory, as discussed previously.

The first step in storage validation is to verify that the database latencies are below the target thresholds. In previous releases, logical disk counters determined disk read and write latency. In Exchange 2010, the Exchange Mailbox server that you are monitoring is likely to have a mix of active and passive mailbox database copies. The I/O characteristics of active and passive database copies are different. Because the size of the I/O is much larger on passive copies, there are typically much higher latencies on passive copies. Latency targets for passive databases are 200 msec, which is 10 times higher than targets on active database copies. This isn't much of a concern because high latencies on passive databases have no impact on client experience. But if you are using the traditional logical disk counters to measure latencies, you must review the individual volumes and separate volumes containing active and passive databases. Instead, we recommend that you use the new MSExchange Database counters in Exchange 2010.

When validating latencies on Exchange 2010 Mailbox servers, we recommend you use the counters in the following table for active databases.

 

Counter Target

MSExchange Database\I/O Database Reads (Attached) Average Latency

<20 msec

MSExchange Database\I/O Database Writes (Attached) Average Latency

<20 msec

MSExchange Database\IO Log Writes Average Latency

<1 msec

We recommend that you use the counters in the following table for passive databases.

 

Counter Target

MSExchange Database\I/O Database Reads (Recovery) Average Latency

<200 msec

MSExchange Database\I/O Database Writes (Recovery) Average Latency

<200 msec

MSExchange Database\IO Log Read Average Latency

<200 msec

noteNote:
To view these counters in Performance Monitor, you must enable the advanced database counters. For more information, see How to Enable Extended ESE Performance Counters.

In addition to disk latencies, review the Database\Database Page Fault Stalls/sec counter. This counter indicates the rate of page faults that can't be serviced because there are no pages available for allocation from the database cache. This counter should be 0 on a healthy server.

 

Counter Target

Database\Database Page Fault Stalls/sec

<1

Also, review the Database\Log Record Stalls/sec counter, which indicates the number of log records that can't be added to the log buffers per second because the log buffers are full. This counter should average less than 10.

 

Counter Target

Database\Log Record Stalls/sec

<10

Exchange Application Health

Even if there are no obvious issues with processor, memory, and disk, we recommend that you monitor the standard application health counters to ensure that the Exchange Mailbox server is in a healthy state.

The MSExchangeIS\RPC Averaged Latency counter provides the best indication of whether other counters with high database latencies are actually impacting Exchange health and client experience. Often, high RPC averaged latencies are associated with a high number of RPC requests, which should be less than 70 at all times.

 

Counter Target

MSExchangeIS\RPC Averaged Latency

<10 msec on average

MSExchangeIS\RPC Requests

<70 at all times

Next, make sure that the transport layer is healthy. Any issues in transport or issues downstream of transport affecting the transport layer can be detected with the MSExchangeIS Mailbox(_Total)\Messages Queued for Submission counter. This counter should be less than 50 at all times. There may be temporary increases in this counter, but the counter value shouldn't grow over time and shouldn't be sustained for more than 15 minutes.

 

Counter Target

MSExchangeIS Mailbox(_Total)\Messages Queued for Submission

<50 at all times

Next, ensure that maintenance of the database copies is in a healthy state. Any issues with log shipping or log replay can be identified using the MSExchange Replication(*)\CopyQueueLength and MSExchange Replication(*)\ReplayQueueLength counters. The copy queue length shows the number of transaction log files waiting to be copied to the passive copy log file folder and should be less than 1 at all times. The replay queue length shows the number of transaction log files waiting to be replayed into the passive copy and should be less than 5. Higher values don't impact client experience, but result in longer store mount times when a handoff, failover, or activation is performed.

 

Counter Target

MSExchange Replication(*)\CopyQueueLength

<1

MSExchange Replication(*)\ReplayQueueLength

<5

To determine whether a Client Access server is healthy, review processor, memory, and application health. For an extended list of important counters, see Client Access Server Counters.

Processor

For physical Exchange deployments, use the Processor(_Total)\% Processor Time counter. This counter should be less than 80 percent on average.

 

Counter Target

Processor(_Total)\% Processor Time

<80%

Application Health

To determine whether the MAPI client experience is acceptable, use the MSExchange RpcClientAccess\RPC Averaged Latency counter. This counter should be below 250 msec. High latencies can be associated with a large number of RPC requests. The MSExchange RpcClientAccess\RPC Requests counter should be below 40 on average.

 

Counter Target

MSExchange RpcClientAccess\RPC Averaged Latency

<250 msec

MSExchange RpcClientAccess\RPC Requests

<40

To determine whether a transport server is healthy, review processor, disk, and application health. For an extended list of important counters, see Transport Server Counters.

Processor

For physical Exchange deployments, use the Processor(_Total)\% Processor Time counter. This counter should be less than 80 percent on average.

 

Counter Target

Processor(_Total)\% Processor Time

<80%

Disk

To determine whether disk performance is acceptable, use the Logical Disk(*)\Avg. Disk sec/Read and Write counters for the volumes containing the transport logs and database. Both of these counters should be less than 20 msec.

 

Counter Target

Logical Disk(*)\Avg. Disk sec/Read

<20 msec

Logical Disk(*)\Avg. Disk sec/Write

<20 msec

Application Health

To determine whether a Hub Transport server is sized properly and running in a healthy state, examine the MSExchangeTransport Queues counters outlined in the following table. All of these queues will have messages at various times. You want to ensure that the queue length isn't sustained and growing over a period of time. If larger queue lengths occur, this could indicate an overloaded Hub Transport server. Or, there may be network issues or an overloaded Mailbox server that's unable to receive new messages. You will need to check other components of the Exchange environment to verify.

 

Counter Target

MSExchangeTransport Queues(_total)\Aggregate Delivery

<3000

MSExchangeTransport Queues(_total)\Active Remote Delivery Queue Length

<250

MSExchangeTransport Queues(_total)\Active Mailbox Delivery Queue Length

<250

MSExchangeTransport Queues(_total)\Retry Mailbox Delivery Queue Length

<100

MSExchangeTransport Queues(_total)\Submission Queue Length

<100

Return to top

You can use the information in the following sections for functional validation tests.

A database switchover is the process by which an individual active database is switched over to another database copy (a passive copy), and that database copy is made the new active database copy. Database switchovers can happen both within and across datacenters. A database switchover can be performed by using the Exchange Management Console (EMC) or the Exchange Management Shell.

To validate that a passive copy of a database can be successfully activated on another server, run the following command.

Move-ActiveMailboxDatabase <DatabaseName> -ActivateOnServer <TargetServer>

Success criteria: The active mailbox database is mounted on the specified target server. This result can be confirmed by running the following command.

Get-MailboxDatabaseCopyStatus <DatabaseName>

A server switchover is the process by which all active databases on a DAG member are activated on one or more other DAG members. Like database switchovers, a server switchover can occur both within a datacenter and across datacenters, and it can be initiated by using both the EMC and the Shell.

  • To validate that all passive copies of databases on a server can be successfully activated on other servers hosting a passive copy, run the following command.
    Get-MailboxDatabase -Server <ActiveMailboxServer> | Move-ActiveMailboxDatabase -ActivateOnServer <TargetServer>
    
    Success criteria: The active mailbox databases are mounted on the specified target server. This can be confirmed by running the following command.
    Get-MailboxDatabaseCopyStatus <DatabaseName>
    
  • To validate that one copy of each of the active databases will be successfully activated on another Mailbox server hosting passive copies of the databases, shut down the server by performing the following action.
    Turn off the current active server.
    Success criteria: The active mailbox databases are mounted on another Mailbox server in the DAG. This can be confirmed by running the following command.
    Get-MailboxDatabaseCopyStatus <DatabaseName>
    

A server failover occurs when the DAG member can no longer service the MAPI network, or when the Cluster service on a DAG member can no longer contact the remaining DAG members.

To validate that one copy of each of the active databases will be successfully activated on another Mailbox server hosting passive copies of the databases, turn off the server by performing one of the following actions:

  • Press and hold the power button on the server until the server turns off.
  • Pull the power cables from the server, which results in the server turning off.

Success criteria: The active mailbox databases are mounted on another Mailbox server in the DAG. This can be confirmed by running the following command.

Get-MailboxDatabase -Server <MailboxServer> | Get-MailboxDatabaseCopyStatus 

Return to top

Testing was conducted at the Microsoft Enterprise Engineering Center, a state-of-the-art enterprise solutions validation laboratory on the Microsoft main campus in Redmond, Washington.

With more than 125 million dollars in hardware and with ongoing strong partnerships with the industry's leading original equipment manufacturers (OEMs), virtually any production environment can be replicated at the EEC. The EEC offers an environment that enables extensive collaboration among customers, partners, and Microsoft product engineers. This helps ensure that Microsoft end-to-end solutions will meet the high expectations of customers.

Return to top

The following section summarizes the results of the functional and performance validation tests.

The following table summarizes the functional validation test results.

Functional validation results

Test case Result Comments

Database switchover

Successful

Completed without errors

Server switchover

Successful

Completed without errors

Server failure

Successful

Completed without errors

Return to top

The following tables summarize the Jetstress storage validation results. This solution achieved higher than target transactional I/O while maintaining database latencies well under the 20 msec target.

 

Overall test result

Pass

Overall throughput

Transactional I/O per second Result

Target transactional I/O per second

960

Achieved transactional I/O per second

1158

Transactional I/O performance: database reads

Database I/O database reads per second I/O database reads average latency

Instance1

134

12.7

Instance2

134

12.6

Instance3

134

12.6

Instance4

134

12.6

Instance5

134

12.6

Instance6

134

12.6

Instance7

134

12.7

Instance8

134

12.6

Transactional I/O performance: database writes

Database I/O database writes per second I/O database writes average latency

Instance1

59

10.6

Instance2

59

10.6

Instance3

59

10.6

Instance4

59

10.4

Instance5

59

10.2

Instance6

59

9.6

Instance7

59

10.3

Instance8

59

10.4

Transactional I/O performance: log writes

Database I/O log writes per second I/O database writes average latency

Instance1

41

6.0

Instance2

41

5.9

Instance3

41

6.0

Instance4

41

6.0

Instance5

41

6.0

Instance6

41

5.9

Instance7

41

5.9

Instance8

41

6.0

Return to top

The following sections summarize the server design validation results for the test cases.

The first test case represents peak workload during normal operating conditions. Normal operating conditions refer to a state where all of the active and passive databases reside on the servers they were planned to run on. Because this test case doesn't represent the worst case workload, it isn't the key performance validation test. It provides a good indication of how this environment should run outside of a server failure or maintenance event. In this case, each Mailbox server is running three active, three passive, and three lagged databases.

The message delivery rate verifies that tested workload matched the target workload. The actual message delivery rate is slightly higher than target.

 

Counter Target Tested result

Message Delivery Rate / Mailbox Server

20.8

22.2

The following tables show the validation of multiple role servers.

Processor

Processor utilization is well under the threshold.

 

Counter Target Tested result

Processor(_Total)\% Processor Time

<80%

52

Storage

The storage results are acceptable with read and write latencies at or below 20 msec. The database page fault stalls and log record stalls are both 0, as expected.

 

Counter Target Tested result

MSExchange Database\I/O Database Reads (Attached) Average Latency

<20 msec

16

MSExchange Database\I/O Database Writes (Attached) Average Latency

<20 msec

<Reads average

20

Database\Database Page Fault Stalls/sec

0

0

MSExchange Database\IO Log Writes Average Latency

<20 msec

9

Database\Log Record Stalls/sec

0

0

MSExchange Database\I/O Database Reads (Recovery) Average Latency

<200 msec

48

MSExchange Database\I/O Database Writes (Recovery) Average Latency

<200 msec

43

MSExchange Database\IO Log Read Average Latency

<200 msec

29

Application Health of Mailbox Server

Exchange is very healthy, and all of the counters used to determine application health are well under target values.

 

Counter Target Tested result

MSExchangeIS\RPC Requests

<70

19

MSExchangeIS\RPC Averaged Latency

<10 msec

8

MSExchangeIS Mailbox(_Total)\Messages Queued for Submission

0

5.5

MSExchange Replication(*)\CopyQueueLength

<1

0.2

MSExchange Replication(*)\ReplayQueueLength

<2

1.525

Application Health of Client Access Server

The low RPC averaged latency values confirm a healthy Client Access server with no impact on client experience.

 

Counter Target Tested result

MSExchange RpcClientAccess\RPC Averaged Latency

<250 msec

17

MSExchange RpcClientAccess\RPC Requests

<40

12.5

Application Health of Hub Transport Server

The Transport Queue counters are all well under target, confirming that the Hub Transport server is healthy and able to process and deliver the required messages.

 

Counter Target Tested result

\MSExchangeTransport Queues(_total)\Aggregate Delivery Queue Length (All Queues)

<3000

41

\MSExchangeTransport Queues(_total)\Active Remote Delivery Queue Length

<250

0

\MSExchangeTransport Queues(_total)\Active Mailbox Delivery Queue Length

<250

40

\MSExchangeTransport Queues(_total)\Submission Queue Length

<100

0

\MSExchangeTransport Queues(_total)\Retry Mailbox Delivery Queue Length

<100

1

The second test case represents peak workload during a failure or maintenance event, where two of the four servers are no longer operational, and six databases are active on the surviving server in the pair. This test case represents the worst case workload, and therefore is considered the key performance validation test.

Message delivery rate verifies that tested workload matched the target workload. The actual message delivery rate is slightly higher than target.

 

Counter Target Tested result

Message Delivery Rate / Mailbox

41.6

44.4

The following tables show the validation of Mailbox servers.

Processor

Processor utilization is well under the threshold, as expected.

 

Counter Target Tested result

Processor(_Total)\% Processor Time

<80%

68

Storage

The storage results are acceptable. The I/O database read average latencies are right at the 20 msec target. The database page fault stalls and log record stalls are both 0, as expected.

 

Counter Target Tested result

MSExchange Database\I/O Database Reads (Attached) Average Latency

<20 msec

21

MSExchange Database\I/O Database Writes (Attached) Average Latency

<20 msec

32

Database\Database Page Fault Stalls/sec

0

0

MSExchange Database\IO Log Writes Average Latency

<20 msec

9

Database\Log Record Stalls/sec

0

0

Application Health of Mailbox Server

Exchange is healthy, and all the counters used to determine application health are well under target values.

 

Counter Target Tested result

MSExchangeIS\RPC Requests

<70

32

MSExchangeIS\RPC Averaged Latency

<10 msec

9

MSExchangeIS Mailbox(_Total)\Messages Queued for Submission

0

9.5

MSExchange Replication(*)\CopyQueueLength

<1

0.3

Application Health of Client Access Server

The low RPC Averaged Latency values confirm a healthy Client Access server with no impact on client experience.

 

Counter Target Tested Result

MSExchange RpcClientAccess\RPC Averaged Latency

<250 msec

18

MSExchange RpcClientAccess\RPC Requests

<40

20

Application Health of Hub Transport Server

The Transport Queues counters are all well under target, confirming that the Hub Transport server is healthy and able to process and deliver the required messages.

 

Counter Target Tested result

\MSExchangeTransport Queues(_total)\Aggregate Delivery Queue Length (All Queues)

<3000

15.5

\MSExchangeTransport Queues(_total)\Active Remote Delivery Queue Length

<250

0

\MSExchangeTransport Queues(_total)\Active Mailbox Delivery Queue Length

<250

15.5

\MSExchangeTransport Queues(_total)\Submission Queue Length

<100

0

\MSExchangeTransport Queues(_total)\Retry Mailbox Delivery Queue Length

<100

0

Hyper-Threading Technology (HTT) is a technology that enables a single CPU to act like multiple CPUs. Without hyperthreading, a single CPU can only handle one instruction (or thread) from one program at any specific point in time. With hyperthreading, the operating system handles a single physical CPU as two logical CPUs that can process two independent threads at the same time. This increases the efficiency of the physical CPU and results in a lower percentage processor time.

Although hyperthreading can increase the workload that can be run on the server, it also presents some capacity planning and monitoring challenges for Exchange deployments. Performance benefits are not linear and will change depending on the workload being executed. We recommend that all capacity planning be based on systems with hyperthreading disabled. If absolutely necessary, hyperthreading can be enabled as a temporary measure to increase CPU capacity until additional hardware can be obtained.

Many customers ask for data comparing Exchange running on systems with hyperthreading enabled and disabled. In this test case, hyperthreading was enabled on the servers in this environment. Tests were run to determine how this impacts percentage processor utilization during normal operating conditions. In the tests, processor utilization dropped from 52 to 28 with hyperthreading enabled. Results may vary with different workloads and different processors so we recommend that you conduct your own tests to determine the impact of hyperthreading on your environment.

 

Counter Target Tested result without hyperthreading Tested result with hyperthreading

Processor(_Total)\% Processor Time

<80%

52

28

Return to top

This white paper provides an example of how to design, test, and validate an Exchange Server 2010 solution for customer environments with 16,000 mailboxes in a single site deployed on IBM and Brocade hardware. The step-by-step methodology in this document walks through the important design decision points that help address key challenges while ensuring that the customer's core business requirements are met.

Return to top

For the complete Exchange 2010 documentation, see Exchange Server 2010. For more information about the solutions in this document, see:

This document is provided "as-is." Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. You bear the risk of using it.

This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You may copy and use this document for your internal, reference purposes.

 
Did you find this helpful?
(1500 characters remaining)
Thank you for your feedback

Community Additions

ADD
Show:
© 2014 Microsoft. All rights reserved.