Export (0) Print
Expand All

Storage Design for Exchange Server 2007

How Microsoft IT Exceeds High-Availability Targets with Large Mailboxes at Low Costs Based on New Storage Designs

Technical White Paper

Published: April 7, 2008

Download

Download Technical White Paper, 2.78 MB, Microsoft Word file

Cc500980.icon_PowerPoint(en-us,TechNet.10).gif PowerPoint Presentation, 3.52 MB, Microsoft PowerPoint file

Cc500980.icon_PowerPoint(en-us,TechNet.10).gif IT Pro Webcast, WMA

Situation

Solution

Benefits

Products & Technologies

Microsoft IT ran Exchange Server 2003 mailbox servers in a clustered configuration to achieve high availability of 99.99 percent, but the Exchange databases remained a critical single point of failure, and the high costs associated with shared SAN storage hindered Microsoft IT from supporting employees with mailbox quotas of greater than 200 MB. The future Mailbox server design required a cost-efficient solution to support mailbox sizes between 500 MB and 2 GB. These increased mailbox sizes made it necessary to provide 10 times more data capacity on mailbox servers and implement redundancy at the storage level to eliminate the need for restores from backup as the primary recovery mechanism after a storage failure.

By implementing Exchange Server 2007 with a storage architecture based on CCR, Microsoft IT eliminated shared storage as a critical single point of failure by maintaining separate up-to-date copies of mailbox data at all times. This provided an opportunity to replace SAN technology with DAS equipment and support employee productivity with substantially larger mailboxes that accommodated higher redundancy levels.

  • Lower costs in comparison to previous SAN-based solutions
  • Improved reliability by eliminating storage as a single point of failure
  • Simplified maintenance and troubleshooting
  • Higher performance and lower I/O requirements
  • Larger mailbox size capacities and more mailboxes per server
  • End-to-end messaging service ownership from server and storage design to operations procedures
  • Windows Server 2003
  • Microsoft Exchange Server 2007
  • Microsoft System Center Data Protection Manager 2007
  • Storage Area Networks (SANs)
  • Direct-attached storage (DAS)
  • Cluster continuous replication (CCR)
On This Page

Cc500980.arrow_px_down(en-us,TechNet.10).gif Executive Summary
Cc500980.arrow_px_down(en-us,TechNet.10).gif Introduction
Cc500980.arrow_px_down(en-us,TechNet.10).gif Business Requirements
Cc500980.arrow_px_down(en-us,TechNet.10).gif Advantages of DAS-Based Storage Designs
Cc500980.arrow_px_down(en-us,TechNet.10).gif Microsoft IT Storage Design
Cc500980.arrow_px_down(en-us,TechNet.10).gif Best Practices
Cc500980.arrow_px_down(en-us,TechNet.10).gif Conclusion
Cc500980.arrow_px_down(en-us,TechNet.10).gif For More Information

Executive Summary

More than 18 months after the first Microsoft® Exchange Server 2007 deployment in the corporate messaging environment and more than 12 months after completing the full production rollout across the entire company, the Microsoft Information Technology (Microsoft IT) group is able to report significant benefits such as:

  • Messaging service levels exceeding high-availability targets of 99.99 percent.
  • Cost reductions in excess of $10 million per year.
  • Increased mailbox quotas by up to a factor of 10.
  • Consolidation of the initial Exchange Server 2007 base by nearly a factor of two.

Microsoft IT was able to achieve these results by taking full advantage of new storage features and input/output (I/O) improvements in Exchange Server 2007, the latest advancements in 64-bit processor technology, and direct-attached storage (DAS)-based storage solutions.

One key strategy that accounts for more than $5 million in annual cost savings involved eliminating the need for backups to tape by relying on new high-availability features in Exchange Server 2007 such as cluster continuous replication (CCR) as the first level of protection, and Microsoft System Center Data Protection Manager 2007 as the second level of protection. Microsoft IT is not required to keep data on tape for archiving or other purposes. Moreover, according to an internal study conducted in 2006, Microsoft IT realized a 74 percent reduction of storage costs per gigabyte by replacing Storage Area Network (SAN) technology with DAS technology in the Mailbox server design. CCR enabled Microsoft IT to switch from SAN to DAS, which improved Microsoft IT's ability to support employee productivity by means of large mailboxes with quotas between 500 megabytes (MB) and 2 gigabytes (GB).

Microsoft IT pursued another key strategy that focused on driving down total cost of ownership (TCO) through server consolidation. Microsoft IT has already reduced the initial Mailbox server base in the corporate messaging environment by more than 45 percent, from 62 servers (124 cluster nodes) to 34 Mailbox servers (68 cluster nodes), and consolidation efforts continue. Before and after consolidation, Microsoft employees enjoy large mailbox capacities, fast server response times, and messaging services that exceed the required high-availability level of 99.99 percent and frequently reach 99.999 percent with no extra effort.

Exchange Server 2007 enables Microsoft IT to not only lower storage costs and increase mailbox quotas, but also decrease storage complexities, regain full control over all aspects of the Mailbox server design (including the storage subsystem), eliminate maintenance overhead, and increase high availability of Mailbox servers. All storage-related issues that Microsoft IT encountered since the initial production rollout of Exchange Server 2007 were recoverable without the need for backups. There have been no critical storage-related incidents affecting Mailbox server availability across the entire corporate messaging environment for more than 18 months.

The purpose of this white paper is to share Microsoft IT knowledge, experiences, and recommendations related to the architecture and design of Exchange Server 2007 Mailbox servers. This paper is not intended to serve as a procedural guide. Although many organizations have similar requirements, each enterprise environment also has unique requirements, making it necessary to adapt the information discussed in this paper.

This white paper assumes that readers are IT architects and technical decision makers who are already familiar with Windows Server® 2003, the Active Directory® directory service, and Exchange Server. Specifically, knowledge about SAN and DAS technologies, server clustering, and the high-availability features of Exchange Server 2007 is helpful. Detailed product information is available in the Microsoft Exchange Server 2007 Technical Library at http://technet.microsoft.com/en-us/library/bb124558.aspx.

Note: For security reasons, the sample names of forests, domains, organizations, and other internal resources mentioned in this paper do not represent real resource names used within Microsoft and are for illustration purposes only.

Introduction

"We learned through bitter experience that SAN redundancies cannot fully compensate for the critical single point of failure that shared storage represents in the clustered Mailbox server architecture. Exchange Server 2007 is the first product version that eliminates this critical single point of failure through CCR. We are further advancing this technology to provide our customers with even more flexibility in Exchange Server 2007 Service Pack 1 and future releases to continue the effort to decrease costs and increase service levels."

Perry Clarke
Product Unit Manager
Exchange Server Product Group
Microsoft Corporation

Perry Clarke, the Product Unit Manager in the Exchange Server product group who is responsible for the technologies of the Mailbox server role, still remembers the time when CCR development began. The product team held CCR as a cornerstone in the vision for Exchange Server 2007 because this technology provides compelling answers to some of the most pressing enterprise customer needs, such as supporting significantly larger mailboxes at substantially lower costs. The development team was excited about this new technology and its potential to support larger mailboxes at lower storage costs, provide shorter failover times, reduce the need for restores from backup, and noticeably decrease storage complexity and eliminate maintenance overhead. Yet to the surprise of many in the product group, Microsoft IT did not share the enthusiasm. Failover clustering was never a question, but in the beginning of 2006, Microsoft IT was skeptical about the possibilities of using CCR on DAS in the Mailbox server design.

Microsoft IT hesitated to embrace CCR on DAS primarily for the following concerns:

  • Need to protect existing IT investments The SAN environment at Microsoft IT represents considerable investment in technology that is not easily abandoned just because new technology emerges. Microsoft IT initially did not consider the shift to CCR on DAS as inevitable. In the beginning of 2006, Microsoft IT had not yet completed its plans to increase mailbox quotas from 200 MB up to 2 GB. Therefore, it was not immediately apparent that a properly designed SAN environment that accommodates these quotas requires approximately 30 times the existing storage capacity to hold 10 times more messaging data, including corresponding hardware Volume Shadow Copy Service (VSS)-based backups. The costs to increase SAN capacities by a factor of 30 would have been forbidding, especially when taking ongoing costs for capacity and performance management into consideration. In the absence of concrete numbers, it seemed more prudent to preserve the existing investment.
  • Desire to capitalize on existing expert knowledge It is a strong Microsoft recommendation to use dedicated storage for Exchange Server to ensure a high transaction rate with low latencies and avoid unpredictable performance behavior, yet accommodating this requirement in a shared SAN environment poses complex configuration and performance optimization challenges. In close collaboration with storage vendors, Microsoft IT engineers developed best practices and actively helped enterprise customers with their SAN optimizations for Exchange Server. Microsoft IT engineers who had gained expert knowledge in the field of SAN optimizations for Exchange Server wanted to capitalize on this.
  • Perception that DAS was not an enterprise storage technology Prior to Microsoft Exchange 2000 Server and SANs, Parallel Small Systems Computer Interface (Parallel SCSI) was state of the art in its various standards with thick cables, 50, 68, or 80-pin connectors, and performance, compatibility, scalability, and reliability issues. Serial Attached SCSI (SAS) began to replace Parallel SCSI by 2006, but for many at Microsoft IT, DAS was still synonymous with fragile connectors, bent pins, loose electrical contacts, and thick cables connecting a maximum number of only 16 devices. It was considered impossible to install 100 or 200 DAS drives in a Mailbox server to achieve high scalability. It was likewise unthinkable that a DAS hard disk drive could be more reliable than a SAN hard disk drive. At the end of 2007, SAS technology, surpassing Fibre Channel with higher interface speeds and lower failure rates, had come to market. At the same time that SAS interface was an emerging DAS technology in the beginning of 2006, Microsoft IT started the Exchange Server 2007 production rollout.
  • Concerns that DAS would create storage silos and hidden operational costs Another obstacle that prevented Microsoft IT from initially seeing CCR on DAS as a viable solution for Mailbox servers was the fact that DAS attaches directly to each cluster node, which creates individual storage silos. From a SAN point of view, it is an overwhelming proposition to create a large number of individual storage locations in the corporate messaging environment. In a SAN environment, ongoing costs for storage allocation, capacity management, performance management, and troubleshooting can quickly exceed the initial investment in hardware and installation. By assuming that this issue of hidden ongoing costs would also apply to DAS, Microsoft IT saw any initial DAS savings potential dwindle rapidly. Today, with the benefit of operating for more than 18 months of CCR on DAS in production, it is easy to say that DAS storage is "designed once and never touched again." However, in early 2006, Microsoft IT was unable to verify that there is truly no need for DAS capacity and performance management beyond the initial storage design. Replacing broken disks, cables, or redundant array of independent disks (RAID) controllers is merely a part of standard hardware maintenance. Downtime due to storage or other node failures is less than two minutes of failover time in a properly designed, CCR-based Mailbox server, and data loss is greatly reduced due to redundant copies of messaging databases on individual cluster nodes. In fact, when CCR on DAS is compared with shared-storage clusters on SAN, it is noticeable that there is less chance for data loss and less need for database restores from backup because CCR eliminates the data instance used by the active node as a critical single point of failure. CCR on DAS also does not create new storage silos. It merely moves the existing storage silos—which dedicated, exclusive Exchange Server storage represents in a shared SAN environment—out of the high-maintenance, high-cost environment into a low-maintenance, low-cost alternative. Microsoft IT doubted these facts because they were unverifiable at the time.
  • Belief that CCR was not enterprise ready It is an interesting proposal for an IT organization to commit fully to an emerging technology. However, this was the case for CCR in the beginning of 2006. CCR was a cornerstone in the Exchange Server 2007 vision as a key enabler of employee productivity through large mailboxes. Yet, Microsoft IT was concerned about possible implementation difficulties and delays because CCR was still in an early beta stage. Even without delays, Microsoft IT engineers did not readily take to the idea of relying on new technology with unknown scalability and reliability characteristics that would be at the very core of large Mailbox servers in the corporate messaging environment. CCR has proved its enterprise readiness over the past 18 months, enabling Microsoft IT to maintain 10 times more data on Mailbox servers with higher service availability levels. In early 2006, the development team could not yet prove the enterprise readiness of CCR for the simple reason that there was no enterprise deployment of CCR in existence.
  • Fear that replication latencies would introduce the potential for data loss Microsoft IT concerns also revolved around the asynchronous nature of CCR, which can result in replication latencies and potential data loss. The scenario for data loss is straightforward: If the primary node receives a message and fails before Exchange Server 2007 replicates the data to the passive node, failover occurs, the passive node becomes active, and the Mailbox server has lost that e-mail message. When the development team suggested that the transport dumpster queue on Hub Transport servers addresses this issue by retaining and redelivering recent messages as needed, Microsoft IT insisted that the product group treats this feature as an intrinsic part of CCR. Microsoft IT did not want to take any chances with lost messages. The active node must be able to request redelivery from all Hub Transport servers in the local Active Directory site, and the Hub Transport servers must redeliver promptly so that no messages are lost after a failover.

During January and February of 2006, emotions ran high between the Exchange Server product group and Microsoft IT. Decisions changed literally every day. One day Microsoft IT would agree to deploy CCR on DAS, and the next day it would revert to the plans to SAN-based single copy clusters (SCCs). In the end, the question was not settled through debate. In the middle of the heated debate, a SAN storage array failure occurred, taking down multiple Mailbox servers, and causing an outage and the loss of 8,000 production mailboxes. It took three days to bring the systems back online, and the worst news was yet to come. Through a combination of unfortunate circumstances, the most recent tape-based backups were also irretrievably lost. Microsoft IT was unable to restore the most recent data, and 8,000 users, including employees, partners, contractors, and vendors lost e-mail data. It was a horrible week for Microsoft IT and the Exchange Server product group alike. It showed not only the critical nature of shared storage as a single point of failure in the Mailbox server architecture, but also the vulnerability of an IT organization if it has to depend on tape-based backups as the primary means to recover from storage failures.

Konstantin Ryvkin is a Senior Technology Architect at Microsoft IT and a member of the Exchange Messaging Engineering team responsible for the design of the corporate messaging environment. Looking back at that time, he says that the disaster and the painful recovery made Microsoft IT a stronger IT organization. It highlighted areas where previous technology and designs were no longer meeting business needs, and it opened doors for more rapid innovation and adoption of new technology. It also renewed the spirit of Microsoft IT to be the first and foremost customer of Microsoft, deploying new technologies at full scale in the corporate messaging environment to provide real-time feedback to the product teams and then to deliver solid proof of the product's enterprise readiness to customers. Microsoft IT did not commit to CCR on DAS merely because of a storage failure on SAN-based Mailbox servers, but because it was important to demonstrate the enterprise readiness of CCR.

Despite a lingering sense of trepidation, a consensus was reached about the need to move forward. In the initial Exchange Server 2007 Mailbox server designs, Microsoft IT cautiously used CCR on DAS at a moderate scale of 2,000 mailboxes with 500-MB quotas. Six months later, that scale increased to 6,000 mailboxes on most servers, representing a total of approximately 5 terabytes of data. It has now reached more than 12 terabytes on Mailbox servers for 4,000 heavy users with 2-GB quotas. CCR on DAS is an absolute success at Microsoft IT, and the Exchange Messaging Engineering team continues to explore product capabilities with new Mailbox server designs. So far, Exchange Server 2007 has not reached its limits, yet there are factors, such as recovery time objectives (RTOs) and recovery point objectives (RPOs), that require Microsoft IT to revamp backup and disaster recovery procedures before placing more than 12 terabytes of messaging data on a Mailbox server.

Business Requirements

While preparing for the Exchange Server 2007 production rollout in 2006, Microsoft IT analyzed internal messaging statistics and trends, assessed user demographics, clarified legal and regulatory requirements, and reviewed existing service level agreements (SLAs) for messaging in order to identify important business requirements. The messaging statistics and trends clearly showed a need for increased storage capacities in the messaging environment. The mailbox count grew by approximately 75 percent over the five years prior to the deployment of Exchange Server 2007, as indicated in Table 1. It was clear that this trend would continue after the rollout. Furthermore, corporate management began to encourage employees to stop using personal folders to archive messages. Based on data gathered from surveys and during pilot projects, Microsoft IT determined that depending on job responsibilities, users would require mailbox capacities of up to 2 GB.

Table 1. Microsoft Messaging Statistics and Projections

Category

2002/2003

2003/2004

2004/2005

2005/2006

2006/2007

2007/2008

Total mailboxes

71,000

80,000

95,000

110,000

130,000

147,000

Microsoft Exchange ActiveSync® users per month

Not applicable

6,000

13,000

21,000

31,000

48,000

Outlook® Anywhere users per month

Not applicable

20,000

25,000

60,000

60,000

100,000

Internet message submissions per day (unfiltered)

6,000,000

9,000,000

11,300,000

13,000,000

13,500,000

30,000,000

Blocked message submissions per day

2,500,000

7,500,000

10,000,000

10,500,000

11,000,000

28,000,000

Maximum message size

2 MB

5 MB

10 MB

10 MB

10 MB

10 MB

E-mail volume per user per calendar day

10 MB

15 MB

15 MB

20 MB

20 MB

26 MB

Number of clustered Mailbox servers

113*

38

34

30

62

34

Typical mailbox quota

100 MB

200 MB

200 MB

200 MB

500 MB or 2 GB

500 MB or 2 GB

Total mailbox data

7 terabytes

17 terabytes

19 terabytes

22 terabytes

60 terabytes

Up to 300 terabytes

* Mostly non-clustered Mailbox servers

Note: This table is an updated version of Table 2 from the Microsoft IT Showcase Note on IT "Going 64-bit with Microsoft Exchange Server 2007," available at http://www.microsoft.com/downloads/details.aspx?FamilyID=f31e7541-f63a-4b7d-b8d2-3794c4dc3329&DisplayLang=en.

Increased Mailbox Capacities

Facing the requirement to increase mailbox capacities by up to a factor of 10 while maintaining existing SLAs, Microsoft IT looked closely at messaging demographics to assess individual user needs more accurately. Figure 1 shows the demographic profile of the corporate messaging environment based on the categories defined in the Exchange Server 2007 product documentation. The product documentation uses an average message size of 50 kilobytes (KB), which corresponds to the typical message size in the corporate messaging environment (43 KB to 50 KB). The user profiles enabled Microsoft IT to make reasonable assumptions regarding mailbox capacity requirements.

Cc500980.image001(en-us,TechNet.10).gif

Figure 1. Demographic messaging profile at Microsoft

The demographic reveals that not all Microsoft users require the largest mailboxes to maintain their productivity levels. In fact, only about one-third of the users are very heavy and heaviest e-mail message recipients with the largest mailbox requirements, about 15 percent of the users are heavy users that can benefit from 2-GB mailboxes, and the majority are light and medium users with moderate messaging needs. For example, most Microsoft partners and vendors with internal accounts belong to the light and medium user groups. Accordingly, Microsoft IT defined that 33 percent of the users require 2-GB mailbox quotas and 66 percent require 500-MB quotas as a starting point. Full-time employees with 500-MB mailboxes have the ability to request 2-GB mailbox quotas if needed. This arrangement directly influenced the Mailbox server designs that Microsoft IT created for the initial production rollout in 2006.

During the initial rollout in the production environment, Microsoft IT used three different Mailbox server designs to accommodate the user need for larger mailboxes. The most common server type, based on CCR on DAS, supported 2,000 users with mailbox quotas of 500 MB. The other two server types supported 2,400 and 3,600 users with 2-GB quotas. Microsoft IT deployed the server type based on CCR on SAN for 3,600 users with 2-GB quotas only during the Beta 1 stage and switched entirely to the CCR-on-DAS design for 2,400 users with the availability of Exchange Server 2007 Beta 2. For details about the three original Exchange Server 2007 Mailbox server designs, see the Note on IT "Going 64-bit with Microsoft Exchange Server 2007," available at http://www.microsoft.com/downloads/details.aspx?FamilyID=f31e7541-f63a-4b7d-b8d2-3794c4dc3329&DisplayLang=en.

Service Level Agreements

Several factors caused Microsoft IT to approach the initial server design cautiously. Most importantly, CCR was still in beta and therefore not a trusted technology yet. Additionally, the only backup solution available at the time for Microsoft IT Mailbox servers based on CCR on DAS relied on Windows® Backup, the streaming backup API with limited throughput capabilities, and the requirement to perform online backup operations on the active node. Performing online backups on the passive node requires VSS-based technology, such as System Center Data Protection Manager 2007, but that product was not available yet. Furthermore, the dual-core processors and server models that Microsoft IT used during the initial rollout limited server scalability. Microsoft IT was not yet able to place 6,000 or more mailboxes on a server without jeopardizing SLAs.

Table 2 summarizes the organization-wide SLAs with a business importance level (BIL) of Important that influence the design of Mailbox servers in the corporate production environment.

Table 2. Organization-Wide SLAs with Impact on Mailbox Server Designs

Service level definition

Resolution target

Comments

End-to-end availability of messaging services

99.99 percent or greater

This SLA gives an end-to-end view of messaging as a managed service and includes Mailbox server availability as well as the availability of Client Access servers, Active Directory, and the network infrastructure. On Mailbox servers, Microsoft IT measures the availability of system services based on stop and start events and the availability of messaging databases based on events generated by the Exchange Information Store service.

End-to-end client availability

99.5 percent or greater

This SLA defines client availability as the percentage of successful remote procedure call (RPC) activity in relationship to failed RPC activity between Microsoft Office Outlook clients or Client Access servers and Mailbox servers.

End-to-end client performance

95 percent or greater

This SLA requires RPC client/server operations between Outlook clients or Client Access servers and Mailbox servers to finish in less than two seconds.

Business will continue with messaging service

One hour or less

This SLA requires individual database restores through reseeding or from backup to finish in less than one hour.

Retention of mailbox database backups

14 days or more

This SLA enables Microsoft IT to discard Exchange Server backups after 14 days. Microsoft IT does not use database backups for archiving purposes.

Retention of deleted items

14 days or more

This SLA influences the calculation of server capacity needs, as explained later in this white paper.

It is important to note that Microsoft IT also maintains RPOs and RTOs as a best-effort commitment. Both the current RPO and the current RTO for complete server restorations are less than 24 hours. However, these RPOs and RTOs remain from the Exchange Server 2003 time frame and were defined long before continuous-replication technology became a reality. They do not address the new capabilities. For example, the RPOs and RTOs do not clearly define a maximum time for reseeding an entire server to re-enable full resiliency after a total node failure. This aspect currently falls on an informal basis under the complete server RTO of less than 24 hours.

The development of new RPOs and RTOs is a work in progress. Microsoft IT is also creating new disaster recovery procedures to take full advantage of CCR and possibly standby continuous replication (SCR). Microsoft IT is targeting an RTO of 12 hours and an RPO of less than one hour in case of a complete data-center loss. Although these targets seem very achievable, Microsoft IT technology architects still need to verify the new disaster recovery procedures within and across data centers with more than 12 terabytes of messaging data on a Mailbox server.

Advantages of DAS-Based Storage Designs

"With more than 18 months of production use, I can say that CCR on DAS is a great solution for a Mailbox server platform running Exchange Server 2007, especially for environments looking for highly available and cost-efficient designs. Through the deployment at Microsoft IT, CCR on DAS proved to perform exceptionally well in high-availability scenarios at the impressive scale far exceeding our initial expectations. Now nobody in the Microsoft IT Messaging team would consider reverting back to the previous SAN-based architecture—we get everything that the business demands from e-mail on the DAS-based CCR platform."

Konstantin Ryvkin
Senior Technology Architect
Microsoft IT Exchange Messaging
Microsoft Corporation

From the business point of view, the single most important advantage of CCR on DAS over any other storage solution for Mailbox servers is the possibility to have substantially increased mailbox sizes with substantially decreased storage costs while maintaining or exceeding existing high-availability levels. For example, DAS enables Microsoft IT to support mailbox quotas of up to 2 GB with a TCO that is comparable to maintaining 200-MB mailbox sizes on SAN. Microsoft employees can use large mailbox capacities in conjunction with optimized, fast, and reliable search capabilities that are built in to Exchange Server 2007. Microsoft employees can store more than a year's worth of e-mail messages in their mailboxes and do not have to use personal folder stores (in .pst files) to move out messages as data volumes approach mailbox limits. Having all messaging data directly on the Mailbox server facilitates data maintenance, backups, and server-based content indexing and search, and reduces security risks and support costs. Having all data on the server also means Microsoft IT can apply centrally maintained policies to ensure compliance with government and company regulations, and users can access all of their messages from any capable device in the office, at home, and on the road. Outlook Anywhere, Microsoft Office Outlook Web Access, and Exchange ActiveSync are preferred in locations with Internet connections, and Unified Messaging provides access in all other locations, as long as at least a stationary or mobile phone is available.

For the technical decision makers at Microsoft IT, one of the most important initial concerns about CCR on DAS is now one of the most compelling reasons for CCR on DAS: maintaining large data volumes on Mailbox servers with high availability. With 10 times more data on the servers and an ever-increasing number of users, it becomes increasingly more difficult to rely on backup and restore operations as the primary means to recover from storage failures and maintain existing high-availability SLAs, specifically Microsoft IT RTOs that demand restoring a Mailbox server with all messaging databases in less than 24 hours. CCR shifts the focus from backups to failovers as the primary recovery mechanism after a server or storage failure and is a key element in Microsoft IT's long-term strategy around large mailboxes.

CCR on DAS provides Microsoft IT with the following advantages in the Mailbox server design (explained in more detail in subsequent sections):

  • Increased Exchange data resilience Failover is the main method to recover from storage failures on the primary node. The primary node has a built-in application-level mechanism that replicates and maintains synchronized copies of messaging databases on separate server nodes and provides failover times of less than two minutes. The Mailbox server can continue to run on the second node with only a short service interruption, whereas none of the nodes on an SCC-based server cluster can keep the Mailbox server running after a storage failure.
  • Ongoing backup operations during regular business hours Maintaining separate synchronized data copies on the active and the passive node implies that it is possible to perform ongoing backup operations for a Mailbox server on the passive node without affecting users accessing their mailboxes on the active node. Microsoft IT uses Data Protection Manager 2007, which is fully CCR aware and supports passive-node backup operations every 15 minutes.
  • Reduced reliance on traditional backups to restore data With CCR-based clusters, Microsoft IT uses backup as a secondary tool to ensure the recoverability of data in the event that storage on the second node also fails. This is an unlikely scenario because Microsoft IT attends to any primary node failure immediately. This is possible because Microsoft IT continuously monitors all Mailbox servers in the corporate messaging environment by using Microsoft System Center Operations Manager 2007. In the event of a node failure, Operations Manager 2007 automatically alerts front-line operators, who take prompt action. With the Mailbox server running on the second node, Microsoft IT can repair the failed node and then reseed the messaging databases if necessary without having to rely on restoring from backups. Reseeding is seldom necessary because the Extensible Storage Engine (ESE) supports automatic recovery mechanisms based on transaction log file replay and can sustain most node failures.
  • Simplicity in the storage design and low maintenance overhead Unlike SANs, which are complex storage systems maintained by highly specialized engineers, DAS is very straightforward and requires only minimal skills that every Exchange administrator can master. Microsoft IT uses external enclosures for hard disk drives and dedicated RAID controllers in each cluster node. It is not necessary to use identical hardware between cluster nodes. Only the drive letter assignments must match. In fact, Microsoft supports CCR-based Mailbox servers with any appropriate components and server models from the standard Hardware Compatibility List (HCL), up to the point that the cluster nodes can be from different server vendors. However, it is important to note that Microsoft IT uses identical configurations so that failover to the passive node can occur without sacrificing performance.
  • Predictable Mailbox server performance The local nature of the DAS solution ensures optimal performance because the local cluster node can utilize 100 percent of the storage resources. There is no need for shared-capacity management because a single server application, such as Exchange Server 2007, uses the storage resources exclusively. The DAS-based Mailbox server design implicitly conforms to the Microsoft recommendation to provide dedicated storage for Exchange Server operations.

Increased Exchange Data Resilience through Redundancies

Figure 2 shows the architecture of the SAN-based server cluster configuration that Microsoft IT used prior to Exchange Server 2007 in the corporate messaging environment. Four active nodes correspond to four Exchange Server 2003 Mailbox servers, and one primary passive node is available to run one of these Mailbox servers without performance penalties after a failover. The remaining two passive nodes were less powerful and served primarily as systems to perform tape-based backup operations. As illustrated, the SAN environment is fully redundant at the hardware level from the cluster nodes all the way down to the storage media. Any single component can fail along the path to the data without interrupting the service. After a timeout of the pending I/O operation, the SAN software automatically switches to the second path and the Mailbox server continues to run on the same cluster node.

Cc500980.image002(en-us,TechNet.10).gif

Figure 2. Microsoft IT server cluster configuration for Exchange Server 2003 Mailbox servers

Despite the complexity and redundancy of SANs, Exchange databases remain critical single points of failure. The SAN-based cluster can provide a high level of availability for the messaging databases and the server cluster can recover from hardware failures on up to three cluster nodes through a failover, but this server cluster cannot recover from a critical storage failure. When SAN storage fails for any reason, none of the seven cluster nodes can run the affected Mailbox server or servers until Microsoft IT repairs the system and restores the messaging databases from backup. Although it is unlikely that both disks in a particular mirror set simultaneously break to cause a RAID 10 failure, hardware failures and firmware issues can lead to SAN outages. Human error is also a possibility. In fact, human error is the biggest risk factor due to the complex nature of SANs. It might take a long time until a storage failure happens, but when it happens, the best strategy is to have a second, synchronized data copy readily available. CCR provides this solution.

As illustrated in Figure 3, the CCR-based architecture is very straightforward and fully redundant at the Exchange database level. Microsoft IT connects multiple storage enclosures to each cluster node and creates RAID 10 drives with mirror sets across the enclosures, as explained in more detail later in this paper. At the Exchange database level, this results in twice the redundancy in comparison to the SAN-based Mailbox server configuration illustrated in Figure 2.

Cc500980.image003(en-us,TechNet.10).gif

Figure 3. CCR-based Mailbox server configuration

It is an interesting aspect that CCR on DAS is able to deliver twice the redundancy at the Exchange database level with less storage sophistication and complexity on the individual cluster nodes in comparison to SCC on SAN. As a Microsoft IT engineer put it, two simple storage systems can be better than one complex system. In a SAN-based configuration, every cluster node has two host bus adapters connected to separate Fibre Channel fabrics, switches, and controller pairs, and the Fibre Channel disks are dual ported. Yet with CCR on DAS, Microsoft IT uses SAS disks with a single port and only a single RAID controller in each cluster node per storage unit, as shown in Figure 3. At the Exchange database level in the overall Mailbox server design, the RAID controllers are redundant, but at the level of individual cluster nodes, this is a local single point of failure. Eliminating this local single point of failure requires a CCR on SAN configuration with two host bus adapters in each cluster node connecting each node to a separate SAN storage array. Microsoft IT used this configuration in very early Exchange Server 2007 Mailbox server designs, but switched entirely to CCR on DAS with subsequent deployments during the initial production rollout for cost reasons. Due to the required failover, CCR on DAS cannot sustain a controller failure on a cluster node without service interruption. However, considering the low failure rates at the RAID controller level and low impact on the end-to-end availability of messaging services, Microsoft IT finds the service interruption of less than two minutes of failover time acceptable because Microsoft IT deployed 90 SAS RAID controller cards and had zero controller failures during the past 12 months. For Microsoft IT, the costs to deploy CCR on SAN far outweigh the benefit of eliminating these two minutes in the unlikely event that a RAID controller fails. SCC on SAN is not an alternative because SCC on SAN cannot recover from storage failures and requires Microsoft IT to restore data from backup, whereas CCR on DAS can sustain a storage failure because the data is readily available on the second node.

Note: Microsoft IT initiates the vast majority of failovers manually by using the Move-ClusteredMailboxServer cmdlet during planned and unplanned maintenance, such as to install mandatory security patches or update driver versions on cluster nodes. CCR on SAN has no advantage over CCR on DAS in this scenario because the failovers are unavoidable and the duration of the service interruption is comparable.

Ongoing Backup Operations During Regular Business Hours

CCR technology does not automatically eliminate the need for backups. There is a lack of redundancy if the storage subsystem on the primary node of a clustered Mailbox server experiences a total failure because only one node remains with the data until Microsoft IT repairs and reseeds the databases on the affected node. Backups provide the required additional layer of protection, and the recommended approach is to perform VSS-based backups on the passive node, as illustrated in Figure 4.

Cc500980.image004(en-us,TechNet.10).gif

Figure 4. Data Protection Manager 2007-based backup on passive nodes

Microsoft IT switched from streaming backups on the active node to software-based VSS backups on the passive node with the release of Data Protection Manager 2007. By minimizing the backup-related performance impact on the active node, Microsoft IT can accommodate more users per Mailbox server while at the same time performing backup operations in much more frequent intervals. Microsoft IT configures the Data Protection Manager server to receive transaction logs every 15 minutes. The server performs an express full backup once a day to maintain a complete and consistent image of the data on the Data Protection Manager server. The express full backup relies on block-level synchronization in conjunction with the Exchange VSS Writer to identify and replicate only those data blocks that have changed in the production databases since the last express full backup.

For backup storage on Data Protection Manager servers, Microsoft IT also uses DAS technology, specifically RAID 10 on Serial Advanced Technology Attachment (SATA) disks with an individual disk capacity of more than 500 GB. In comparison to hardware VSS backups that Microsoft IT used in conjunction with SAN-based Mailbox servers, the solution based on Data Protection Manager 2007 on DAS helps Microsoft IT reduce the complexity of the backup environment, eliminate third-party dependencies, and achieve further storage cost savings while maintaining fast recovery objectives.

Reduced Need for Restores from Backup

Data Protection Manager 2007 enables Microsoft IT to restore mailbox data from any 15-minute point in time onto the original server or to a different server. However, there is practically no need to perform restores onto the original server for disaster recovery or any other purposes, as the more than 18 months of Microsoft IT experience with CCR suggest. For Microsoft IT, restoring files from backup is a tool that can be helpful in software testing and in the validation of disaster recovery plans. Microsoft IT performs these restores to a different server to avoid affecting the Mailbox servers in the corporate messaging environment.

CCR entirely transformed the Microsoft IT approach to fast recovery. The previous method for SAN-based Mailbox servers relied on VSS clones. At midnight, Microsoft IT cloned the logical unit numbers (LUNs) for the Mailbox server to a new set of clone LUNs. Although this solution provided fast recovery of large amounts of data from backup in a matter of minutes, it required two additional LUNs for each Mailbox server LUN with associated high SAN costs and highly specialized storage engineers to perform the recovery procedures in the SAN environment. By switching to CCR for fast recovery with Exchange Server 2007, Microsoft IT eliminated these costs and dependencies. Restoring from backup is no longer the primary recovery mechanism. The fast recovery mechanism of a CCR-based Mailbox server is a straightforward failover to the passive node.

The reason why CCR effectively eliminates the need for restores from backup onto the original Mailbox servers stems from the fact that the cluster nodes in a CCR-based Mailbox server are mutual hot-standby systems. Active and passive nodes can change their roles at any time. CCR automatically reverses the replication direction to keep the messaging databases synchronized by means of transaction log shipping and replay on the passive node. This implies that any node can fail and rebuild by using the messaging databases that are still available on the other node. The cluster nodes share no hardware components. It is therefore unlikely that a storage failure on one node will affect the messaging databases on the other node. Messaging databases that are unaffected, mounted, and available online on the second node are the basis for recovery scenarios without backups.

The typical CCR-based recovery scenario includes the following four phases (depicted in Figure 5 after the list):

  1. Normal operation The clustered Mailbox server is available and new transactions, such as due to Hub Transport servers delivering messages, result in new transaction log files on the active node. Through file system notifications, the Microsoft Exchange Replication Service on the passive node learns that new transaction logs are pending replication. The NTFS file system generates these notifications on the active node whenever the ESE closes and renames the current transaction log file with a sequence number to make room for the next transaction log. The Exchange Replication Service on the passive node copies the new transaction log files via a security-enhanced file share from the active node into the local transaction log inspection folder. Exchange Server 2007 inspects these logs and moves them to the target storage group's transaction log folder for replay into the destination mailbox database. This asynchronous process of having the passive node perform the transaction log replication helps to keep CPU and I/O load on the active node at a minimum, but it also introduces a chance for a lossy failover.
  2. Lossy failover and recovery In the worst-case scenario where the active node fails right after a Hub Transport server delivered messages and before CCR had a chance to replicate the current transactions, failover occurs and the passive node becomes active without the most recent messages. To retrieve the missing messages, the Mailbox server requests redelivery from all Hub Transport servers in the local Active Directory site as part of the recovery procedure after a lossy failover. Hub Transport servers maintain a transport dumpster queue for each continuous replication-enabled storage group in order to retain recently delivered messages and redeliver these messages promptly to bring the Mailbox server up to date. Microsoft IT uses the Set-TransportConfig cmdlet to configure this feature on all Hub Transport servers with the MaxDumpsterSizePerStorageGroup parameter set to 15 MB and a MaxDumpsterTime value of 07.00:00:00, which corresponds to seven days. This is explained in more detail in the Microsoft IT Showcase Note on IT "Going 64-Bit with Microsoft Exchange Server 2007," available at http://www.microsoft.com/downloads/details.aspx?FamilyID=f31e7541-f63a-4b7d-b8d2-3794c4dc3329&DisplayLang=en.
  3. Node repair At this point, the Mailbox server is up to date and running on the remaining cluster node while Microsoft IT performs the necessary repair activities on the failed node, such as replacing the public network interface card (NIC) or updating a faulty driver. The key issue is whether the node failure affected the messaging databases on the failed node. Typically, all messaging data is still intact and Microsoft IT only needs to restart the failed cluster node and catch up on transaction logs. In rare cases, node failures corrupt messaging data. In this situation, Microsoft IT only needs to reseed the affected databases from the remaining copy, which is an uncomplicated procedure that uses the ESE streaming backup API to perform online copy operations. Through reseeding, Microsoft IT can copy messaging databases individually from the currently active node to the repaired node while users are online and accessing their mailboxes. Users might notice slower server response times during the reseeding process, but the Mailbox server with all the messaging data is available and the server performance meets the Microsoft IT SLAs.
  4. Normal operation The Mailbox server resumes normal operations after the failed node is repaired, and any affected messaging databases are reseeded. On the repaired node, Exchange Server 2007 automatically recognizes that it is now running in passive context, and CCR reverses to update the repaired passive node with all transactions that occur on the active node. There is no need to reconfigure the system. CCR simply continues to replicate the messaging data from the active node to the passive node. Furthermore, Microsoft IT does not need to perform a failback of the clustered Mailbox server to the original node because both nodes have an identical hardware and software configuration. The Mailbox server can continue to run on the second node without performance penalties, Data Protection Manager 2007 automatically switches backup processes to the current passive node, and Microsoft IT saves two minutes of failback time, which would otherwise count against the high-availability SLAs.

Cc500980.image005(en-us,TechNet.10).gif

Figure 5. Recovering from a storage failure in a CCR-based Mailbox server

Simplicity in the Storage Design and Low Maintenance Overhead

The storage subsystem manager is an important non-technical aspect of CCR on DAS in comparison to any SAN-based Mailbox server design. In SAN-based environments, dedicated and highly specialized storage engineers perform the necessary installation, configuration, optimization, and troubleshooting tasks. In DAS-based environments, regular Exchange administrators are typically capable of performing these storage-related tasks. Having full control over all aspects of the Mailbox server design—end to end, from the disks up to the client connections—is an important advantage for the Exchange Messaging team at Microsoft IT. Yet, it also requires Microsoft IT to design the storage subsystem with simplicity in mind so that Exchange administrators can perform all configuration and maintenance tasks without the help of highly specialized storage experts.

Microsoft IT achieves simplicity in the storage design primarily through the following approaches:

  • Taking advantage of SAS technology Unlike SAN technology that requires strict configuration, SAS offers impressive configuration flexibility at the hardware layer. SAS uses common electrical SATA connections, and an SAS enclosure can accept different types of disks. The order of the disks in an array is not important, and it is possible to expand an SAS storage subsystem by daisy-chaining storage enclosures together. Microsoft IT recently tested the flexibility of SAS in a lab environment by shutting down a cluster node after moving the clustered Mailbox server to the other node, changing the position of the SAS disks in the storage enclosures, and replacing the RAID controller. During the restart of the cluster node, the system fully recognized the RAID configuration and drive letter assignments and the cluster node was immediately operational again. According to Seagate data sheets, SAS technology has emerged as an enterprise technology with small form factor (SFF) 2.5-inch disks surpassing Fibre Channel disks with an annualized failure rate (AFR) of 0.55 percent (versus 0.62 percent) while disk capacities continue to increase and hardware prices continue to fall. New generations of SAS RAID controllers appear every 12 to 18 months and typically come along with new server generations, whereas the product cycle of SAN counterparts is about three to four years. The SAN innovation rate is slower due to the higher technical complexities in comparison to DAS technology. For example, SFF disks appeared on the market more than two years ago, yet SAN vendors are still not able to integrate this technology into their systems. With CCR on DAS, Microsoft IT takes full advantage of the most dynamic products and technologies in the storage market.
  • Standardizing the storage layout by creating universal storage building blocks To further minimize hardware maintenance complexities, help ensure reliability, and provide scalability, Microsoft IT developed a standardized storage layout that uses universal storage building blocks for scaling production Mailbox servers. A universal storage building block, or USBB as Microsoft IT calls it for short, is a self-contained unit of two physical storage enclosures, combined to provide database and transaction log drives for the specified set of Exchange databases. The number of drives—identified through individual LUNs at the SCSI level—that Microsoft IT can use in a USBB depends on the capacity of the storage enclosures and the required number of LUNs to provide the desired data capacity and I/O performance. The USBB count per server depends on the number of mailboxes and the mailbox quotas that Microsoft IT wants to maintain on the server. However, the conceptual storage layout per USBB remains unchanged. Among other things, this makes it easy to monitor the drives and replace failing disks. An operator can easily identify each disk through its position in the storage subsystem, as indicated in Figure 6, which shows a USBB configuration with 25 disks per enclosure. This particular USBB provides three RAID 10 LUNs for messaging databases, and one RAID 10 LUN for all transaction logs combined.

Cc500980.image006(en-us,TechNet.10).gif

Figure 6. A universal storage building block for CCR on DAS

Predictable Mailbox Server Performance

An issue that Microsoft IT occasionally notices in large customer deployments on SAN-based Mailbox servers concerns the use of shared storage for messaging databases. Whereas Microsoft IT Mailbox server designs for Exchange Server 2003 always followed product recommendations to use dedicated storage arrays, customers occasionally ignore this recommendation and share the storage hardware between different types of server applications, such as Microsoft SQL Server® and Exchange Server, in an attempt to maximize capacity utilization. However, if Microsoft SQL Server databases are stored on the same physical media as Exchange Server databases, running large SQL Server jobs can lead to non-Exchange load surges and impaired Exchange Server performance. Microsoft IT engineers call this issue hot-spot contention to indicate poor performance of the storage subsystem resulting from different usage patterns caused by different server applications accessing the same storage media, as illustrated in Figure 7.

Cc500980.image007(en-us,TechNet.10).gif

Figure 7. Different usage patterns on shared storage

The database I/O of Exchange Server 2007 is composed of large numbers of random page requests using a page size of 8 KB, whereas other server applications might access data more sequentially and in larger contiguous blocks. If all this data resides on the same physical media, the heads of the hard disk drive must frequently move out of the Exchange data region to service the non-Exchange data requests. The result is an unpredictable sharp decrease in Exchange Server performance due to increased response times at random intervals. For example, an Exchange read request that might have taken 8 milliseconds (ms) might now take 108 ms because the disk heads spend 100 ms retrieving non-Exchange data between the individual page requests of Exchange Server.

Exchange Server administrators cannot analyze this problem because the non-Exchange LUN is not visible in the Exchange server configuration. Likewise, the Exchange Server LUN is invisible on the computer running SQL Server. Exchange Server administrators and SQL Server administrators have no idea that they share the same physical storage media. Furthermore, the storage engineer who created the LUNs is unaware of the Exchange Server and SQL Server requirements. As far as the SAN environment is concerned, the storage engineer followed common best practices to counterbalance SAN costs. All systems show an optimized configuration, and yet users occasionally complain about slow Office Outlook clients in online mode.

Hot-spot contention is hard to locate, and it keeps reappearing in customer environments because there is no certainty in a SAN that the system configuration does not change over time. All too often, unaware storage engineers cannot resist the temptation to optimize available storage capacities. CCR on DAS puts an ultimate end to this problem by taking Exchange Server storage out of the shared SAN environment. It puts full control over the storage design into the hands of Exchange Server experts and eliminates unpredictable system behavior due to the side effects of SAN capacity optimization.

"Microsoft IT is our ultimate touchstone of enterprise readiness. From Microsoft IT we learn not only how our technology performs under real-world conditions, but also what issues and constraints system engineers face when designing production-grade Mailbox servers. No test lab can deliver this valuable insight. It fuels our development activities and ensures that we stay focused on actual current and future needs of our customers."

Matt Gossage
Sr. Program Manager
Exchange Server Product Group
Microsoft Corporation

Microsoft IT Storage Design

In comparison to previous product versions, Exchange Server 2007 provides increased design flexibility because this 64-bit messaging system takes full advantage of available processor and memory resources and includes numerous architectural advancements that help to lower the I/O footprint. Among other things, Exchange Server 2007 includes a database buffer cache that can substantially reduce the need for data reads from disk during normal operation. The size of the database buffer cache depends on the physical amount of random access memory (RAM) present in the system; therefore, the physical amount of memory directly influences the I/O footprint. This relationship between memory and I/O footprint opens new opportunities to achieve optimal server response times by balancing memory with disk performance. This design flexibility is especially noticeable when comparing Microsoft IT Mailbox server designs with the results of the Exchange 2007 Mailbox Server Role Storage Requirements Calculator. The storage requirements calculator performs calculations based on product recommendations, while Microsoft IT takes advantage of excess I/O performance in the storage subsystem to go beyond product recommendations. Microsoft IT arrives at acceptable response times with less memory per Mailbox server than recommended because the storage subsystem includes far more disks for capacity reasons than required for performance and can therefore compensate for the increased I/O footprint that results from having less RAM per user on the server. The Exchange 2007 Mailbox Server Role Storage Requirements Calculator and detailed information about its use are available on the Microsoft Exchange Team Blog at http://msexchangeteam.com/archive/2007/01/15/432207.aspx.

Kyryl Perederiy, Senior Systems Engineer in the Microsoft IT Exchange Messaging team, is responsible for the Mailbox server designs. He emphasizes that customers evaluating the Microsoft IT designs should keep in mind that Microsoft IT currently exceeds recommended configurations in its Mailbox server designs in an effort to help the product group verify performance capabilities of Exchange Server 2007 under real-world conditions. Specifically, Microsoft IT Mailbox servers use 1 MB to 2 MB of RAM per user instead of the recommended 3.5 MB to 5 MB, but Microsoft IT Mailbox servers still perform well because the DAS-based storage design of Microsoft IT provides plenty of headroom to make up for the difference, as explained later in this paper. By taking advantage of excess I/O capabilities, Microsoft IT was able to optimize memory capacities and switch from expensive high-end enterprise hardware to mainstream enterprise server models, such as a dual socket quad-core Intel Xeon processor X5355 server model with only eight slots for fully buffered dual inline memory modules (FB-DIMMs) to support 6,000 users per server on average. With a maximum available module size of 4 GB, this mainstream server model has a maximum capacity of 32 GB of memory, but 4-GB DIMMs currently do not offer an attractive capacity/price ratio for Microsoft IT. For cost reasons, Microsoft IT uses 2-GB memory modules—in other words, a total of 16 GB of memory.

Microsoft IT emphasizes the following aspects in the storage design for Exchange Server 2007 Mailbox servers:

  • Reliability Microsoft IT uses enterprise storage equipment in Mailbox servers and specifically pays attention to unrecoverable errors per bits read, AFRs, and manufacturer warranties. It is also noteworthy that Microsoft IT prefers SFF 2.5-inch disks to the large form factor (LFF) 3.5-inch disks because of the lower power requirements, lower cost per gigabyte, and higher performance and reliability, as well as less heat emission and less vibration. Less heat and less vibration result in less degradation over time. Hard disk drives of choice are SFF SAS disks with one bad sector per 10E16 bits read, an AFR of 0.55 percent, and a three-year warranty. SATA disks are still not as reliable as SAS disks. However, Microsoft IT is beginning to consider SATA because reliability and performance are increasingly meeting enterprise standards, the capacity/price ratio is attractive, and CCR deemphasizes the need for highly reliable disks in the storage subsystem on each cluster node.
  • Availability Redundancies are a primary means in the Mailbox server design to help ensure high availability. At the hardware layer, Microsoft IT uses multiple external storage enclosures with redundant power supplies and controller connections. As mentioned earlier in this paper, Microsoft IT mirrors the hard disk drives across enclosures and then includes these mirrors in stripe sets to implement a RAID 10 configuration. In this configuration, an entire enclosure can fail without affecting the availability of the storage subsystem. Microsoft IT prefers RAID 10 to RAID 5 because RAID 10 provides a higher level of redundancy and can tolerate multiple disk failures, whereas RAID 5 can only tolerate a single disk failure. RAID 5-based Mailbox server designs are still in an experimental stage at Microsoft IT. With the exception of the RAID controller, Microsoft IT ensures redundancies for all storage components on each cluster node, including enclosures, power supplies, hard disk drives, and cables. CCR provides the necessary redundancies at the Exchange database level and compensates very well for the missing RAID controller redundancy at the individual cluster nodes.
  • Performance Microsoft IT uses RAID 10 also because RAID 10 performs better than RAID 5. With RAID 10, write operations are free of parity calculations, and RAID 10 uses more disks than RAID 5 to deliver the same capacity, which benefits I/O performance when the storage subsystem is being designed for capacity. For example, it takes six 146-GB disks to build a RAID 10 drive with 438 GB of raw capacity. RAID 5 only requires four 146-GB disks to accomplish the same. Assuming SFF enterprise disks with a spindle speed of 10,000 revolutions per minute (RPM), capable of performing approximately 160 I/O operations per second (IOPS), the RAID 10 drive can handle 960 read IOPS (6 * 160 = 960), whereas the RAID 5 drive can only reach 640 read IOPS (4 * 160 = 640). Disk throughput and server response times are important design criteria for Microsoft IT because slow Mailbox servers affect employee productivity. Response times above 20 ms are tolerable only for short periods of time on a highly loaded system, but not on an ongoing basis because users will notice slow Office Outlook client behavior in online mode, such as when switching folders or composing new messages. Especially with large mailboxes, users need to be able to work fast and perform reliable searches to locate information quickly.
  • Capacity Performance requirements determine the minimum number of disks in the storage subsystem, yet more disks might be required to meet capacity needs. To ensure adequate storage capacity, Microsoft IT calculates maximum database sizes based on the number of mailboxes and the desired quotas, and then adds additional capacity for database overhead, content indexing overhead, and unexpected database growth. If the number of disks required to meet capacity needs exceeds the minimum number of disks required to ensure I/O performance, Microsoft IT calls the storage design capacity bound. On the other hand, if fewer disks are required to provide the necessary capacities, but more disks are needed to reach the required I/O performance, the design is performance bound. Due to the low I/O requirements of Exchange Server 2007 and the high performance of SFF SAS disks, Microsoft IT Mailbox server designs are currently capacity bound, and Microsoft IT uses any excess I/O performance to counterbalance low memory conditions, as mentioned earlier. This picture might change in the future if Microsoft IT decides to switch to RAID 5 with SFF SAS or use larger and slower SATA disks.
  • Costs For Microsoft IT, cost efficiency is an integral part of enterprise readiness. Multiple solutions may meet Microsoft IT's reliability, performance, and capacity requirements, include similar support options and tools (such as management packs), and be on the standard data center hardware list. Microsoft IT chooses the least expensive technology for the corporate messaging environment to demonstrate the cost savings potential of Exchange Server 2007. High-end enterprise server models provide the flexibility to match product recommendations in the Mailbox server design, yet Microsoft IT deliberately selects mainstream enterprise hardware to build scaled-up servers with the lowest possible budget. CCR on DAS is a key enabler to drive down costs to unprecedented levels, and RAID 5 with SFF SAS disks or RAID 10 on SATA disks might offer further opportunities to continue this effort.
  • Simplicity Microsoft IT capitalizes on the potential of CCR on DAS to simplify the storage design through straightforward RAID configurations and a standardized storage layout based on USBBs. Among other things, simplicity helps to keep maintenance overhead and TCO at a minimum. Exactly the same operations procedures apply to all Mailbox servers in the corporate messaging environment regardless of mailbox numbers and quotas. Simplicity also helps to ensure stable Mailbox server performance in the event of a component failure. For example, Microsoft IT does not use a hot spare in the RAID 10 configuration, so the RAID controller cannot automatically phase out a failed disk and rebuild the RAID 10 array. Disk failures decrease the RAID 10 performance to a certain degree, because fewer disks remain to handle the I/O load; yet rebuilding a RAID array affects I/O performance even more noticeably. By scheduling hardware maintenance during non-business hours, Microsoft IT minimizes the impact of disk failures on server performance. Manually replacing hard disk drives is not a problem for Microsoft IT. Most Microsoft data centers are staffed 24 hours per day, seven days per week. In the event of a disk failure, an IT specialist is available to replace the affected disk at the next appropriate maintenance window.
  • Recoverability Hardware and database redundancies enable Microsoft IT to ensure system recoverability in the event of component failures as well as in the event of entire node failures. Furthermore, Microsoft IT relies on Data Protection Manager 2007 to help ensure recoverability in the event of a loss of both cluster nodes. RTO objectives demand individual database restoration in less than one hour. Accordingly, Microsoft IT distributes the mailbox data over a large number of messaging databases per server to keep the maximum file size of each individual messaging database below 200 GB. The maximum file size corresponds to the number of mailboxes and the desired quotas plus database overhead, but excluding content indexing overhead and reserve for unexpected database growth. Keeping individual database files below 200 GB also helps Exchange Server complete online maintenance cycles regularly, which is critical to keep the Mailbox server healthy and the system performance stable.
  • Scalability All messaging trends point upward at Microsoft. Message volumes are continuously rising, and the number of users almost doubled since the Exchange Server 2003 time frame. Mailbox sizes are steadily growing, and even 2-GB quotas might soon be too small for agile users. CCR on DAS enables Microsoft IT to keep pace with these trends by increasing the scalability of the Mailbox servers in the corporate messaging environment. For example, hardware maintenance on the passive node does not affect online users. It is possible to replace server hardware when new processor technology becomes available or upgrade the storage subsystem on one node while the other node keeps the Mailbox server available to users. A typical SAS-based RAID controller can support up to 100 disks. Mainstream 2U enterprise server models can support up to three RAID controllers, and large 4U high-end enterprise servers can include up to seven RAID controllers, putting the ceiling at 700 disks. It is unlikely that Microsoft IT will reach this scalability limit in the near future.

Note: Microsoft IT system engineers have contributed to the Exchange 2007 Mailbox Server Role Storage Requirements Calculator and strongly recommend this design tool to customers. In addition, third-party vendors and original equipment manufacturers (OEMs) who want to develop and test storage solutions for Exchange Server 2007 should consider participating in the Microsoft Exchange Solution Reviewed Program (ESRP) - Storage v2.0. Detailed information about ESRP is available at http://technet.microsoft.com/en-us/exchange/bb412164.aspx.

Mailbox Server Performance

"The ultimate test of the storage design is the failover. When thousands of concurrent users hit that fresh active node, all read requests must go to disk because a cold database buffer cache cannot deliver the data. That decisive moment shows how much our Mailbox servers benefit from the high performance available with DAS."

Kyryl Perederiy
Sr. Systems Engineer
Microsoft IT Exchange Messaging
Microsoft Corporation

In a reliable network environment with sufficient net-available bandwidth, processor, memory, and storage subsystem are the main components that influence Mailbox server performance. Processor capabilities have the most significant impact. Microsoft IT used dual-core processors during the initial production rollout, which limits server scalability to 2,000-3,000 mailboxes. Currently, Microsoft IT uses server models with two quad-core Intel Xeon X5355 processors (a total of eight processor cores) to realize an increased density of 6,000 mailboxes per server for heavy users. Microsoft IT continues to monitor the processor market to take advantage of new models as soon as they become available at reasonable pricing.

Memory capacities are not as critical for Microsoft IT because Exchange Server 2007 Service Pack 1 (SP1) includes further ESE enhancements and it is possible to counterbalance memory deficiency with disk I/O performance. According to the product recommendation of 3.5 MB to 5 MB of RAM per mailbox plus 2 GB of RAM per server, a server with 6,000 mailboxes requires 24 GB to 32 GB of memory (6,000 * 3.5 or 5 MB / 1024 + 2 GB = 22.50 GB or 31.30 GB). Memory requirements increase further with more mailboxes, yet even the product group does not recommend more than 32 GB per Mailbox server to remain cost efficient. As mentioned earlier, Microsoft IT optimized memory capacities by taking advantage of excess I/O performance and supports 6,000 heavy users and up to twice as many medium users in one scalability pilot with 16 GB of RAM. It is important to note, however, that these server designs are for users with 500-MB quotas. For heavy users that require 2-GB quotas, mostly full-time employees, Microsoft IT uses server designs for up to 4,000 mailboxes on the same server platform with two quad-core Intel Xeon X5355 processors.

Figure 8 illustrates how Exchange Server 2007 uses the available physical memory to lower I/O requirements by caching frequently accessed data, such as the Inbox folder view for the top-most messages, calendar information, and any message-processing rules for each user. The less often the storage engine needs to reload this information, the lower the I/O footprint. In combination with further ESE enhancements, such as delayed write operations to avoid repeated writes to the same object on disk, Exchange Server 2007 puts less I/O demand on the storage subsystem than any previous Exchange Server version. Microsoft IT measured performance requirements by monitoring the performance counters Disk Transfers/sec, Disk Reads/sec, and Disk Writes/sec. Microsoft IT determined during the initial production rollout that Microsoft employees typically generate approximately 0.27 IOPS to 0.4 IOPS in a read/write mix of 1:1 on Mailbox servers with 5 MB of RAM per user.

Cc500980.image008(en-us,TechNet.10).gif

Figure 8. Database buffer cache and database I/O

Now, more than 12 months after the initial production rollout, Microsoft IT uses significantly less memory in Mailbox servers. It follows that I/O requirements increase because Exchange Server 2007 cannot cache as much data for database read and write operations. This is not a linear increase, however. The database buffer cache only optimizes database I/O generated through client requests, online maintenance, and other processes that communicate with the Exchange Information Store service to access messaging data. Other I/O activities do not increase, such as non-transactional I/O due to transaction log replication by means of CCR. Accordingly, Microsoft IT assumes I/O requirements of 0.7 IOPS per user with a change of the read/write mix to 2:1. In other words, the majority of the Mailbox servers in the corporate messaging environment supporting 6,000 users must be able to perform 2,800 read operations and 1,400 write operations per second (6,000 * 0.7 * 2/3 = 2,800 and 6,000 * 0.7 * 1/3 = 1,400). Corresponding to the RAID 10 configuration, each write request equals two write operations behind the RAID controller, so the storage subsystem must be able to handle 2,800 read requests and 2,800 write requests per second, or 5,600 IOPS in total. In contrast, the largest Mailbox server that Microsoft IT currently operates in a scalability experiment with 12,000 mailboxes for light and medium users and 1 MB to 2 MB of RAM per user must be able to perform 5,600 read operations and 2,800 write operations per second (12,000 * 0.7 * 2/3 = 5,600 and 12,000 * 0.7 * 1/3 = 2,800). On this server, the RAID 10-based storage subsystem must be able to handle 11,200 IOPS in total. With 160 IOPS per 10,000-RPM SAS disk, 70 disks are required to ensure the necessary performance at the highest server scale (11,200 / 160 = 70). For comparison, an SATA-based Mailbox server (spindle speed 7,200 RPM, 50 IOPS) would require 224 disks (11,200 / 50 = 224) to meet performance requirements.

Mailbox Store Capacity

The quick way to determine total database capacity needs is to multiply the number of mailboxes by the desired mailbox quota and add an estimated data amount for retained deleted items, with 20 percent database overhead and up to 20 percent reserve for content indexing and unexpected database growth. Thus, a Mailbox server with 6,000 mailboxes and quotas of 500 MB requires a database capacity of nearly 5 terabytes. Meeting this capacity requirement with 36 SAS disks in a RAID 10 configuration according to performance needs (5,600 IOPS / 160 IOPS per disk = 35 disks) requires an individual disk capacity of approximately 300 GB (5 terabytes * 1024 / 36/2 = ~300 GB). Yet, 300-GB or 400-GB SAS drives do not offer an attractive price per gigabyte for Microsoft IT. The 10,000-RPM SAS disks that Microsoft IT currently uses are only available with a maximum raw capacity of 146 GB, which corresponds to a usable formatted capacity of 137.2 GB. Accordingly, Microsoft IT needs more than twice as many disks (5 terabytes * 1024 / 137.2 GB = ~38 * 2 = 76 disks).

On Mailbox servers with 4,000 mailboxes and 2-GB quotas, capacity needs for database drives more than double in comparison to the standard Mailbox server design for 6,000 users with 500-MB quotas. In this server design, 180 10,000-RPM SAS disks are necessary to provide the required 12 terabytes of mailbox storage (12 terabytes * 1024 / 137.2 GB = ~90 * 2 = 180 disks) and further disks are necessary for the transaction log drives. Figure 9 illustrates how Microsoft IT determines total capacity needs for databases and transaction logs based on Microsoft-specific mailbox numbers, quotas, and deleted items retention times.

Cc500980.image009(en-us,TechNet.10).gif

Figure 9. Storage capacity needs for 4,000 mailboxes with 2-GB quotas at Microsoft IT

Microsoft IT calculates capacity needs for Mailbox servers based on the following factors:

  • Number of mailboxes and desired quotas By multiplying the number of mailboxes by the maximum possible mailbox size, Microsoft IT estimates the total amount of data that the users can store in the mailbox databases. In the server design for 4,000 mailboxes with 2-GB capacity, Microsoft IT applies the following quotas:
    • Issue warning: 1.8 GB
    • Prohibit send: 1.9 GB
    • Prohibit send and receive: 2.0 GB
  • Database overhead This factor depends on the average message traffic per user, the desired deleted items retention time, and internal database overhead. At Microsoft, the average message traffic per user for both sent and received items is approximately 20 MB per day. The required deleted items retention time according to SLAs is 14 days, which provides Microsoft users with sufficient time to recover any deleted items directly within their Office Outlook clients. In addition, Microsoft IT takes into account 20 percent internal overhead based on the maximum mailbox and dumpster size to accommodate database structures, table indexes, and so forth. Accordingly, the total database overhead is approximately 36 percent of the maximum mailbox size (20 MB per day * 14 days = 280 MB and [280 MB + (2048 MB + 280 MB) * 20% / 100%] * 100% / 2048 MB = 36.41%).
  • Content indexes Content indexing enables users to perform full-text searches of messages and message attachments. Although Exchange Server 2007 does not store content indexes directly in the messaging databases, Microsoft IT considers an overhead of 5 percent of the corresponding messaging databases because the search catalog resides in the same folder on the same drive as the corresponding messaging database.
  • Unexpected database growth Microsoft IT reserves at least 10 percent of space for unexpected growth that might result from massive mailbox moves or other technical and non-technical reasons. The extra space also enables Microsoft IT to support up to 10 percent more users per server than defined in the mailbox server design.
  • Transaction logs According to Exchange Server best practices, Microsoft IT uses separate physical drives for transaction logs and messaging databases. Although transaction log drives require substantially less capacity than database drives, Microsoft IT provisions a capacity of about 10 percent of the database drives. This capacity ensures that the transaction log drives do not run out of disk space during peak load, such as when moving up to 1,500 mailboxes per day to a new Mailbox server. It is important to note that Microsoft IT does not enable circular logging during mailbox moves to delete transaction logs automatically because circular logging negatively affects recoverability. Instead, Microsoft IT ensures that CCR replication works and performs incremental backups in 15-minute intervals on the passive node by using Data Protection Manager 2007. Exchange Server 2007 deletes transaction logs that have been replicated to the passive node at the end of every incremental or full online backup.

    Note: The Microsoft IT approach to determine Mailbox server capacity needs differs slightly from product recommendations because capacity needs depend on not only quotas and other factors related to mailbox databases but also the desired physical storage layout. For example, in a USBB with 25 disks per enclosure, an efficient physical layout uses 21 disks per enclosure for three database LUNs and the remaining 4 disks per enclosure for one transaction log LUN. With a usable capacity of 137.2 GB per disk, the resulting log LUN capacity is 1,097.60 GB regardless of capacity calculations based on estimated usage profiles. However, Microsoft IT follows product recommendations to verify that the desired physical layout is capable of handling the expected data volumes and I/O load.

Standardized Storage Layout

Using SAS disks with a capacity of 137.2 GB means that each cluster node of a Mailbox server for 4,000 users with 2-GB mailboxes requires fewer than 36 SAS disks for performance but more than 180 disks for capacity. Microsoft IT can rest assured that the storage subsystem has ample performance capabilities to compensate for the higher I/O footprint resulting from optimized memory capacities. Specifically, it takes 200 disks in a RAID 10 configuration to provide the required total capacity for database drives and transaction log drives. Coincidentally, 200 disks fit perfectly into the standardized Microsoft IT storage layout based on USBBs to keep the server configuration as straightforward and as modular as possible. As mentioned earlier in this paper, a single USBB consists of two mirrored enclosures. Each enclosure can contain 25 disks. A USBB therefore has 50 disks. Assuming 200 disks for capacity, the Mailbox server requires four USBBs.

As illustrated in Figure 10, each cluster node has two RAID controllers. A single RAID controller has two channels and is capable of connecting two storage enclosures per channel. Per USBB, Microsoft IT connects the first storage enclosure to channel A and the second storage enclosure to channel B. In this way, Microsoft IT mitigates the impact of individual channel failures, such as due to a broken cable, on the overall availability of the storage subsystem. For example, if channel A on the first RAID controller fails for any reason, the two USBBs connected to that RAID controller are still operational because the mirrored disks in the second storage enclosure of each USBB are still available and ensure a functioning stripe set according to the RAID 10 configuration.

Cc500980.image010(en-us,TechNet.10).gif

Figure 10. Microsoft IT Mailbox server design for 4,000 users with 2-GB quotas

In the depicted Mailbox server design, each USBB stores 1,000 mailboxes, which Microsoft IT distributes across three database LUNs. Each database LUN has a raw capacity of 1,022 GB ([14 / 2] * 146 = 1,022), which is sufficient for 667 GB (2 GB * 1,000 / 3 = ~667 GB) of mailbox data plus overhead.

To ensure timely completion of database maintenance routines, backup processes, and restore operations according to SLA requirements, Microsoft IT keeps the size of individual mailbox databases at approximately 200 GB. At a data transfer rate of 60 MB per second, it takes about 57 minutes to restore 200 GB. This complies with Microsoft IT RTOs requiring individual databases to be restored through reseeding or from backup in less than one hour. If multiple databases are affected, Microsoft IT can double the bandwidth utilization to 120 MB per second by performing multiple restores or reseeds in parallel. Accordingly, Microsoft IT distributes the mailbox data per database LUN across four mailbox databases. Each database contains approximately 84 mailboxes.

Using four mailbox databases per database LUN and 12 database LUNs per Mailbox server implies that Microsoft IT must configure 48 databases, each placed in a separate storage group according to CCR requirements. However, Microsoft IT uses only one RAID 10 LUN per USBB for all transaction logs combined. Due to the sequential nature of transactional I/O in the checkpoint advancement mechanism, there is no performance improvement in spreading the transaction logs of the storage groups per USBB over multiple drives.

Note: For further information regarding the design of the corporate messaging environment, including Mailbox server designs for 6,000 users with 500-MB mailbox quotas and all other Exchange server roles, see the Microsoft IT Showcase technical white paper "Exchange Server 2007 Design and Architecture at Microsoft," available at http://www.microsoft.com/downloads/details.aspx?FamilyID=98C522BC-814A-421A-99C0-D964ED119C0D&displaylang=en.

Microsoft IT Scalability Experiment

The Mailbox server design for 4,000 heavy users with 2-GB quotas represents the largest server configuration in terms of data capacity that Microsoft IT currently deploys in the corporate production environment. Although there is an effort to increase the number of mailboxes per server for this user profile further, the corresponding server designs are not final yet. Microsoft IT must evaluate several dependencies and SLAs, including RPOs, RTOs, and server scalability, before Mailbox server capacities can exceed 12 terabytes of messaging data.

For the majority of users, Microsoft IT continues to provision mailboxes with 500-MB quotas, primarily on servers with 6,000 mailboxes. The server design corresponds to the design for 2-GB mailboxes, yet with half the number of USBBs, a single RAID controller, and 100 SAS disks per cluster node according to the lower capacity requirements, as illustrated in Figure 11.

Cc500980.image011(en-us,TechNet.10).gif

Figure 11. Microsoft IT Mailbox server design for 6,000 users with 500-MB quotas

Analyzing performance data gathered in the production environment enabled Microsoft IT to determine that the scalability limit of the current server models with two quad-core processors is at 6,000 heavy Microsoft users with a concurrency ratio of 100 percent. Beyond 6,000 concurrent heavy users, processor resources become a bottleneck and RPC communication issues can occur. However, light and medium users, such as vendors, contractors, and part-time employees, do not show the same usage patterns and concurrency ratios at Microsoft. For these users, Microsoft IT recently developed a Mailbox server design for 12,000 mailboxes. The design corresponds exactly to the configuration shown in Figure 10 for 4,000 mailboxes with 2-GB quotas. Only the number of mailboxes per database differs (250 instead of 84 mailboxes) and the quotas are at 500 MB. Microsoft IT operates one Mailbox server in this configuration.

Microsoft IT server architects call the 12,000-mailbox server deployment a scalability experiment, although it is in fact a production server maintained according to SLAs. Yet, at a scale of more than 6,000 users, new issues and risks arise that must be taken into consideration. According to Matt Gossage, Senior Program Manager and responsible for the development of ESE technologies, Mailbox server scalability is not only a question of storage engine capabilities. Messaging clients also communicate with Active Directory, and if users work over Internet connections, firewalls and Exchange Server 2007 Client Access servers can affect system scalability as well. Network communication issues can arise when a very large number of concurrent users, possibly working with multiple client instances, access their mailboxes through a Client Access server. Although Microsoft IT plans to go beyond 12,000 mailboxes per Mailbox server in the near future, server architects remain cautious because of the scalability limits of client/server communication.

When scaling up Mailbox servers, it is important to keep in mind that Microsoft Office Outlook 2007 clients establish multiple simultaneous TCP/IP connections and RPC sessions to the server. TCP/IP connections and RPC sessions vary based on client use. TCP/IP connections temporarily increase based on certain client actions such as public folder access, Address Book lookups, and accessing shared calendars. Both Office Outlook client modes (Cached Mode/Online Mode) and the server's connectivity configuration (direct or Outlook Anywhere) have a significant impact on the number of TCP/IP and RPC sessions established to the server. For example, when users communicate with a Mailbox server through Outlook Anywhere, there may be more RPC sessions to the Exchange System Attendant process for Active Directory connections via DSProxy than RPC sessions to the Exchange Information Store process for access to mailbox data. For every inbound directory connection, DSProxy must establish an outbound connection to Active Directory. There are TCP/IP and RPC scalability limits in Windows Server 2003, Windows Server 2008, and Exchange Server 2007. By default, the Microsoft TCP/IP implementation cannot support more than 65,535 outbound TCP/IP connections and Exchange Server 2007 is limited to 60,000 RPC sessions. These limits are set for security and performance reasons. Office Outlook client actions will begin to fail when one or both of these limits are exceeded, resulting in a poor user experience.

Due to the TCP/IP connection and RPC session limits and the variability of the number of connections and sessions due to client configuration and use, Microsoft recommends that customers do not place more than 10,000 mailboxes per server without clarifying usage profiles (including client mode, configuration, and concurrency) and network communication requirements. Table 3 summarizes the maximum number of concurrent Office Outlook clients supported per Mailbox server based on the TCP/IP connection and RPC session scalability limits.

Table 3. Maximum Number of Concurrent Office Outlook Clients per Mailbox Server

Client configuration

Office Outlook 2007

Direct Access in Online Mode

60,000

Direct Access in Cached Mode

30,000

Outlook Anywhere in Cached Mode

10,000

Note: For more information about Mailbox server scalability and design recommendations, visit the Exchange Team Blog at http://msexchangeteam.com.

Best Practices

Microsoft IT continuously evaluates emerging new server and storage technologies and their potential to achieve improvements in the design of Mailbox servers. Specifically, CCR enables Microsoft IT to capitalize on recent DAS innovations to meet business and technical requirements better than with previous technology. During the continuous process of optimizing the Exchange Server 2007 Mailbox server design to increase mailbox density per server, consolidate the corporate messaging environment, lower TCO, and exceed high-availability SLAs, Microsoft IT developed the following general best practices that might also be useful to other IT organizations that want to improve their Mailbox server designs by using Exchange Server 2007:

  • Eliminate storage as a critical single point of failure Storage is the most crucial component in the Mailbox server design. Microsoft IT strongly recommends using CCR to eliminate this component as a critical single point of failure to reduce the need for restores from backup in the event of a storage failure. It is important to note that CCR is storage platform agnostic. Microsoft IT deployed CCR on non-shared SAN storage during the very early stages of the internal production rollout (Beta 1 time frame), but soon switched to CCR on DAS for cost reasons.
  • Use enterprise storage hardware Regardless of the underlying storage technology (Fibre Channel, SAS, or SATA), disk performance and reliability are important considerations, especially in environments with power users and a concurrency level on Mailbox servers of almost 100 percent. For this reason, Microsoft IT prefers to use SFF SAS disks with an interface speed of 3 GB per second and AFR values of 0.55 percent. If the manufacturer of the hard disk drive does not provide AFR ratings, Microsoft IT evaluates the mean time between failures (MTBF). MTBF values above 1 million hours are an indicator of a disk suitable for Exchange Server 2007 Mailbox servers.
  • Focus on simplicity When shifting from a SAN environment to DAS technology, it is particularly important to focus on straightforward storage configurations that IT administrators can master with familiar configuration tools and without the involvement of storage experts. SAS technology provides a high level of flexibility and external storage enclosures that help to separate the storage subsystem from the remaining server hardware. It is also a good idea to design DAS-based storage subsystems generously to provide sufficient capacity for future needs. When DAS-based storage is properly designed, it is deployed once and never touched again for the lifetime of the Mailbox server.
  • Standardize the storage layout In enterprise environments with a large number of Mailbox servers, it is advantageous to standardize the Mailbox server design and storage layout. This results in consistent configuration, maintenance, and operations procedures that are applied to all servers, regardless of the number of users a particular server supports. The standardized design helps to lower operations costs, reduces the risk of human errors, and facilitates the automation of recurring tasks.
  • Design the storage subsystem for both capacity and performance needs Large Mailbox servers require a storage subsystem that provides sufficient capacities. Delivering data to clients with fast response times means taking into account the expected data volume and sufficient I/O performance according to the excepted concurrent workload. The primary choices to achieve these goals are RAID 10 and RAID 5. Because RAID 10 provides a higher level of redundancies and better performance than RAID 5, it is the preferred choice for the storage subsystem of Mailbox servers. For example, RAID 10 enables Microsoft IT to mirror the disks across enclosures. In this configuration, an entire storage enclosure can fail without affecting the availability of the storage subsystem. RAID 5 cannot reach this high level of resilience to hardware failures. However, CCR reduces the resiliency requirement on individual cluster nodes, making it possible to use RAID 5 for better disk capacity utilization. In any case, on a healthy Mailbox server, the LogicalDisk/Avg. Disk sec/Read performance counter of the database drives should be below 10 ms on average. An average value above 20 ms typically indicates a hardware or performance problem.
  • Perform backups on the passive node Although CCR reduces the need for restores from backup, it does not eliminate the need for backups. Backups still provide the last line of recoverability in the event that storage fails on both nodes of a Mailbox server. Microsoft IT uses System Center Data Protection Manager 2007, which is fully CCR aware and can perform software VSS backups on the passive node without affecting users accessing their mailboxes on the active node and without requiring specialized skills as a backup administrator or storage expert.
  • Keep the size of individual messaging databases small Maximizing the number of storage groups and mailbox databases on a Mailbox server provides Exchange Server 2007 with a maximum database checkpoint depth to optimize database write operations, and it helps to keep the size of individual mailbox databases small. Small databases facilitate timely restores through reseeding or by using backups. Small databases also facilitate online maintenance routines that Exchange Server 2007 performs regularly to keep the databases in a consistent and defragmented state and the Mailbox server healthy.
  • Continuously monitor Mailbox servers Although CCR on DAS enables IT organizations to achieve such high availability levels as 99.99 percent, it does not do so single-handedly. It takes the right mix of technology, people, and processes and a proactive approach to service management. By using System Center Operations Manager 2007 to monitor all systems in the corporate messaging environment, including Mailbox servers, Microsoft IT is able to resolve the vast majority of all incidents before users notice them. For example, Operations Manager alerts front-line operators if average response times of a Mailbox server exceed 20 ms, which is usually an indicator of hardware issues because the Microsoft IT Mailbox server design is free from performance issues such as hot-spot contention. Operations Manager 2007 also alerts front-line operators if Exchange Server services, such as the Exchange Replication Service on the passive node, are stopped without being restarted for any reason, so that front-line operators can take action promptly and restart the services. The Exchange Server services must run on both nodes for CCR to be able to replicate transaction logs.

Conclusion

CCR is a groundbreaking Exchange Server technology that enables enterprise IT organizations, such as Microsoft IT, to achieve significant improvements in the Mailbox server design. It provides new design options to balance processor, memory, and storage resources in order to achieve an optimal system performance while remaining cost efficient. It substantially simplifies the storage design so that the IT organization can realize new levels of efficiency in maintenance and operational processes. Most importantly, it provides the opportunity to support notably increased mailbox sizes over drastically decreased storage costs while maintaining or exceeding existing SLAs. In conjunction with optimized search capabilities, SLAs of 99.99 and above help to increase employee productivity, facilitate data maintenance, backups, and reduce security risks and support costs.

For Microsoft IT, designing Mailbox servers with CCR on DAS has proven to be a sound strategic decision. The benefits are clear and measurable. CCR provides Microsoft IT with the ability to eliminate the Exchange databases as a critical single point of failure in the Mailbox server design, use low-cost DAS technology instead of high-cost SAN equipment, and exceed existing SLAs demanding 99.99 percent high-availability. CCR reduces the need for restores from backup. A failover to the passive node is now the primary recovery mechanism for all primary node failures. CCR on DAS eliminates maintenance overhead. There is no need for storage capacity management or performance management beyond the initial storage design. CCR on DAS is designed once and never touched again.

Other important advantages include streamlined backup procedures and predictable Mailbox server performance. By using Data Protection Manager 2007, Microsoft IT can perform online backups every 15 minutes on the passive node without affecting users accessing their mailboxes on the active node. Backups are still important as a last line of recoverability in the event that storage fails on both cluster nodes. Mailbox server performance is stable and predictable because Exchange Server 2007 can use 100 percent of the storage resources directly attached to the cluster node without sharing I/O operations or disk space with other applications. The DAS-based Mailbox server design implicitly conforms to the Microsoft recommendation to provide dedicated storage for Exchange Server.

CCR on DAS is enterprise ready and provides Microsoft IT with a solid foundation to keep pace with the trends in the corporate messaging environment. At Microsoft, mailboxes steadily increase in number and size. During the Exchange Server 2003 time frame, Microsoft IT supported the user base with 4,000 mailboxes per Mailbox server and 200-MB quotas—a data volume per server of about 1 terabyte, including database overhead. Today, Microsoft IT Mailbox servers support up to 12 terabytes of messaging data. In the future, this data volume will grow even more as Microsoft IT continues the server consolidation effort in the corporate messaging environment by increasing mailbox numbers and mailbox quotas per server. A few details remain to be examined before Microsoft IT Mailbox servers can go beyond 12 terabytes, yet the key question regarding storage technology is settled. CCR on DAS is the only cost-efficient option to support massive amounts of messaging data without sacrificing reliability or high availability on Mailbox servers running Exchange Server 2007.

For More Information

For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada information Centre at (800) 563-9048. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information through the World Wide Web, go to:

http://www.microsoft.com

http://www.microsoft.com/technet/itshowcase

The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.

This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED, OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.

Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.

Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.

Unless otherwise noted, the example companies, organizations, products, domain names, e-mail addresses, logos, people, places, and events depicted herein are fictitious, and no association with any real company, organization, product, domain name, e-mail address, logo, person, place, or event is intended or should be inferred.

© 2008 Microsoft Corporation. All rights reserved.

Microsoft, Active Directory, ActiveSync, Outlook, SQL Server, Windows, and Windows Server are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.

All other trademarks are property of their respective owners.

Was this page helpful?
(1500 characters remaining)
Thank you for your feedback
Show:
© 2014 Microsoft