Click to Rate and Give Feedback
TechNet
TechNet Library
Messaging Backup and Restore at Microsoft

Technical Case Study

Published: June 16, 2003

Microsoft's internal IT department, part of the Operations and Technology Group (OTG), requires a reliable backup and restore solution to support messaging server operations. With the improved features in Microsoft Exchange 2003 and the Microsoft Windows Server System, OTG created a new more efficient design for servers running Exchange and a two-step backup solution that minimizes user downtime and provides a more flexible backup and restore solution.

Download

Download Technical Case Study, 417 KB, Microsoft Word file

PowerPoint PowerPoint Presentation, 695 KB, Microsoft PowerPoint file

Situation

Solution

Products & Technologies

Benefits

As part of Microsoft's internal deployment of Microsoft Exchange 2003 and the Microsoft Windows Server System, Microsoft IT began to consolidate messaging servers to reduce operating costs. High availability requirements for the clustered server redesign prompted Microsoft IT to examine its backup and restore solution.

Microsoft IT implemented a two-step backup solution to improve server performance, and support many more mailboxes per server. Microsoft IT also implemented the Exchange 2003 Recovery Storage Group (RSG) feature that minimizes user downtime during a database recovery. The new solution supports server consolidation with high availability while providing greater backup and restore flexibility.

  • Microsoft Exchange Server 2003
  • Microsoft Windows Server System
  • Veritas Network Storage Executive (NSE)
  • HP StorageWorks Enterprise Virtual Array 5000 storage area network (SAN) solution
  • Microsoft Operations Manager (MOM) Exchange 2003 Management Pack (MP)
  • Increased server availability during single node failures
  • Reduced operating costs from server consolidation
  • Minimized impact to users during backup using an alternate passive node in the clustered configuration
  • Flexible recovery solution allowed users to continue working while restoration of mailbox/database was done offline and merged later

This case study is intended for enterprise technical decision makers and messaging operations personnel. A clear understanding of Exchange, backup procedures, and cluster technology is assumed.

Introduction

E-mail has become a vital part of today's communication infrastructure because it connects people with knowledge. However, IT departments are challenged to support this growing technology in a timely and cost-effective manner. Like any other enterprise IT department, OTG requires a solid backup and restore solution to keep messaging services up and running.

As of this writing, OTG is in the process of consolidating regional mailbox servers and locations. Increased numbers of users per server and much more data on disks present increased risks in the event of failure. The messaging group in OTG measures mailbox service availability as a factor of downtime and the number of users affected. For example, a one-minute outage affecting 1,000 users is measured as 1,000 minutes of downtime.

OTG uses a two-stage backup process (disk-to-disk and disk-to-tape) to manage this risk. This eliminates server performance impact from tape backup, and provides greater flexibility. The solution is based on a combination of:

  • Exchange Server 2003
  • Microsoft Windows Server™ 2003, Enterprise Edition
  • Windows NT® Backup for disk-to-disk backup
  • Veritas Storage Management solution for disk-to-tape backup
  • Hewlett-Packard's StorageWorks Storage Area Network (SAN) solution
  • Microsoft Operations Manager Management Pack for Exchange Server 2003

The two-step backup operation and new clustered design offers many benefits, including:

  • Flexibility to support an increase in the number of supported mailboxes per server
    • From 1,500 to 2,700 mailboxes per virtual server in the regional locations
    • From 3,000 to 4,000 mailboxes per virtual server in the headquarters data center implementation
  • Supports doubling of user mailbox limits, from 100 megabytes (MB) to 200 MB, while maintaining a one-hour service level agreement (SLA) per database for both backup and restore operations
  • Reduces operating costs by consolidating 115 smaller regional mailbox servers to 29 mailbox servers, and eliminating dedicated servers for database restore operations
  • Helps improve existing service uptime SLA requirement
    • 99.9% for legacy standalone servers
    • 99.99% for new clustered servers

Situation

To put the scale of the situation in context, OTG is responsible for maintaining more than 150,000 computers worldwide. At the time of this writing, OTG supports 190 servers running Exchange Server 2003 on Windows Server 2003 Enterprise Edition, distributed over 75 locations worldwide. 115 of these servers are mailbox servers hosting more than 82,000 mailboxes.

The corporate e-mail infrastructure at Microsoft, as of June 2003, is comprised of:

  • Global mail flow of 6,500,000 messages per day, with 2,500,000 average Internet mail messages per day
  • 20 databases per server, with 50 gigabyte (GB) maximum database size on new clustered deployments
  • Global service availability of 99.9% with a target to achieve 99.99% on clustered designs
  • Worldwide mail delivery in less than 5 minutes, 90% of the time
  • Backup and restore operation SLA of less than one hour per database

In the past, it was challenging to maintain the one hour backup restore SLA on direct attached SCSI server implementations. These server designs used a one-step backup process (disk-to-tape), where backups were performed to tape devices over the 100 MB LAN, which limited throughput capabilities (averaging 300-400 MB per minute). Backups were limited to non-business hours to minimize any impact to clients with mailboxes hosted on these servers. Recovering a mailbox store impacted by corruption could potentially impact up to 1,000 mailboxes, causing an extended loss in productivity for users hosted on this server. This represented a cost in lost productivity of $60-$80/hr per user. Single mailbox restore operations required dedicated servers. This configuration is shown in Figure 1.

MSGBR01

Figure 1: Old regional messaging environment

Backup Solution

To solve these problems and support server consolidation, OTG uses a flexible two-stage process to back up data within a multimode clustered configuration—disk-to-disk (stage 1) and disk-to-tape (stage 2).

The first stage backup runs on all active nodes within the cluster to complete an online backup to a dedicated pool of disks with capacity to support two-day online retention. With the clustered implementation, the data from the first stage is then moved using cluster functionality to an alternate passive node. This node transfers the backup files to a pool of fiber-attached tape devices. This eliminates the performance impact to users by offloading the second stage backup processing (copying the files to tape) to the alternate passive node. This process is shown in Figure 2.

Bb735215.msgbr02(en-us,TechNet.10).gif

Figure 2: Two-step backup process

Recovery Solution

The method of recovery is based on the type of failure and a business decision between restoring service and restoring data.

For example, if a single database must be restored on a cluster, up to 200 people are affected. Because up to 2 days of backup data is available on disk and can be restored online in less than an hour (restore rates of up to 2 GB/min can be achieved), regular Exchange restore procedures are used to get users quickly back online with their data.

If an entire storage group (5 databases) fails, as many as 1,000 users may be affected. Restoring large amounts of data during business hours would prevent user access to email for the period of restoration. To quickly restore service, OTG can use a new Exchange 2003 feature called the Recovery Storage Group (RSG). OTG mounts an empty "stub" mailbox that enables users to receive and send new mail, but without access to old mail. Service is quickly restored because users work from this stub mailbox while OTG restores the databases to the RSG. When the restoration is complete, OTG can merge the old and new data from the "stub" using the Microsoft Exchange Mailbox Merge Wizard (also called Exmerge). Because their restore speed is restricted to LAN-based tape, this method is also used for the legacy non-clustered servers that are currently in the process of being consolidated.

Server Design

OTG's clustered design supports a significant increase in the number and size of mailboxes per Exchange server. It helps eliminate performance impact to users during the second stage backup process because it offloads the process to non active servers within the cluster, thereby maintaining the existing SLA.

OTG's design goal was to support 8,000 mailboxes per SAN, with 200 MB mailbox limits, 99.99% cluster server availability, and less than one hour per database backup and restore time.

Regional Design

The server specification for the regional cluster implementation consists of one Enterprise Virtual Array (EVA), three active nodes, one primary passive node, and one alternate passive node (AAAPp).

Headquarters Design

The headquarters clustered implementation is similar in design. It consists of two EVA per cluster, with four active nodes, one primary passive node, and two alternate passive nodes for streaming backups from disk to tape (AAAAPpp). These configurations are shown in Table 1.

Table 1. Cluster Design Specifications
  Regional Headquarters
4 CPU Active Nodes 3 4
4 CPU Passive Nodes 1 1
2 CPU Alternate Passive Nodes 1 2
EVA 1 2
Mailboxes per cluster 8,000 16,000

The passive node(s) provide two benefits—as a failover solution and an independent resource for the backup process. The primary passive node supports peak period failover. The alternate passive node(s) are primarily used to transfer backups to tape, which allows the active nodes to focus solely on serving users. The alternate passive nodes also provide off-peak failover support to speed up the process for rolling upgrades or patch application. The 16,000 mailbox headquarters implementation design is shown in Figure 3.

Bb735215.msgbr03(en-us,TechNet.10).gif

Figure 3: Clustered Exchange 2003 design

Best Practices

Two-Step Backup

Two step backup has proved beneficial, especially during the second stage (disk-to-tape). In the past, both stages were managed by the same server running Exchange, which impacted server performance during the disk-to-tape phase. Now, using clustering improvements in Exchange and Windows®, OTG is able to offload stage two from an active node to an alternate passive node, which minimizes user impact and allows second stage backups to occur at any time, if necessary.

Backup Throughput Adjustment

OTG discovered a way to more than double disk-to-disk backup rates using the Windows Backup utility through a registry adjustment. The adjustment increased the throughput on average from 600 MB/min to 1,200 MB/min per storage group. The adjustment is set on the user profile that is used to execute the backup script (Windows Registry - HKey_Current_User).

OTG runs two concurrent backup jobs per active Exchange instance, providing an aggregate rate of around 2.4 GB per minute per server, with two to three servers per SAN enclosure (depending on headquarters or regional design). OTG has monitored maximum throughput without excessive read and write disk latencies at approximately 6.3 GB/min per SAN enclosure. Throughputs were dependent on Logical Unit Number (LUN) distribution across controllers with Data, Log and Backup LUNS per Storage Group assigned per controller.

The mode used for optimized throughput is:

  • Storage Group 1 and 2 - Data, Log and Backup on Controller one
  • Storage Group 3 and 4 - Data, Log and Backup on Controller two
  • Job concurrency limited to two per server with SG1 and SG3 run concurrently followed by SG2 and SG4
  • RAID
    • Target LUNS for backup were Vraid5
    • All Vriad5 LUNS with write back cache disabled

Vraid1 targets would provide better throughputs, and are under consideration at the time of this writing as an option with the 146 GB disks for the first stage backup (disk to disk).

Server Design

OTG implemented a multi-node cluster design based on active and primary alternate servers with four hyper threading 1.9 gigahertz (GHz) processors and 4 GB of RAM. These servers run Windows Server 2003, Enterprise Edition and Exchange Server 2003 with the following modifications:

  • /3GB switch set in the Boot.ini file
  • /USERVA=3030 parameter set in the Boot.ini file
  • SystemPages set to 0
  • Mount Points are used to mitigate drive letter limitations for supporting the log, SMTP and backup drives
  • Backup disks are maintained in a separate cluster resource group to allow independent LUN movement between cluster nodes

Each Exchange Virtual Server (EVS) hosts four storage groups for a total of twenty databases of about 50 GB each per EVS. Each database is configured with a 200 MB mailbox limit. This design means a maximum of 200 mailboxes per database, which equals 4,000 per EVS, 16,000 mailboxes total.

The alternate passive node configuration is a less expensive server with dual 2.4 GHz CPUs, and 2 GB RAM. These nodes are used to transfer the backup files from step one to a fiber-connected LTO-1 Tape Silo. This part of the process is managed with Veritas Network Storage Executive.

Manage Transaction Logs

Increasing the number of mailboxes per servers also increases the number of transaction logs per server. The time it takes to replay transaction logs significantly impacts the time it takes to restore a server. Calculate the time to replay the logs, monitor the average number of logs per day, and then adjust recovery plans accordingly.

Using MOM to Monitor Backup and Restore

The MOM Management Pack for Exchange Server 2003 provides OTG with a complete monitoring solution to help maintain servers running Exchange. MOM enables OTG operators to manage large numbers of servers running Exchange from a central console, facilitating rapid failure detection and reducing time to resolution. The management pack helps detect, and alert possible Exchange service outages.

For example, OTG runs a custom processing rule that checks the application event logs each evening between 8 P.M. and 8:10 P.M. for missing events that indicate scheduled backups have started. If the event is not present, operators are immediately alerted. OTG also has a consolidation rule to detect a successful number of backups over a given period of time. If the number is lower than expected, an alert is issued to the operator to investigate. A standard MOM rule sends an alert if a transaction log is older than 24 hours, indicating a possible backup failure.

When using the MOM MP to monitor the cluster, OTG recommends that you disable event log replication so that you can prevent duplicate alerts from the physical cluster nodes.

OTG recommends that you monitor disk-to-disk backup completion times to ensure that any scheduled user moves are not adversely impacted, especially during server consolidation. Do not move users from a database while it is being backed up.

Future Plans

A new feature of Window Server 2003 called Volume Shadow Copy service allows for local file-system or specific Vendor Storage based data snap-shot functionality for file servers. Shadow Copy enables end users to provide self-help recovery of file share data, which saves on help-desk calls. As of this writing, OTG uses Shadow Copy primarily for file servers.

OTG is testing, in the Exchange environment, the operational benefits of Volume Shadow Copy service as a possible solution for "snap and clone" integration. This will support recovery of large amounts of data in minutes. Testing of "snap and clone" requestor and provider functionality for Exchange Server 2003 databases at scale is ongoing as of this writing.

Benefits

Through a combination of Exchange Server 2003, the Windows Server System, third-party SAN technology, and faster hardware, OTG created a clustered design with two-stage backups that enabled a greater level of flexibility to back up and restore data when needed, but which had minimal impact on users. OTG experienced the following additional benefits:

  • Reduced the impact to users (formerly 6 or more hours per user) of a database restore operation
  • Eliminated the need for dedicated restore servers to handle single mailbox restore operations
  • Helped maintain the 99.9% service availability requirement
  • Improved backup and restore times to less than one hour
  • Doubled the user mailbox limit (to 200 Mb)
  • Enabled server consolidation by hosting many more mailboxes per server — from 1,500 to 2,700 for the regional design, and from 3,000 to 4,000 for the headquarters design

These benefits, in turn, enabled the consolidation of many smaller regional mailbox servers by more than 75%, which reduced operational costs.

For More Information

For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada information Centre at (800) 563-9048. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information via the World Wide Web, go to:

http://www.microsoft.com/

http://www.microsoft.com/technet/itsolutions/msit/default.mspx/

© 2003 Microsoft Corporation. All rights reserved.

This case study is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft, Windows, Windows NT, Windows Server, and Windows Server System are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.

© 2009 Microsoft Corporation. All rights reserved. Terms of Use | Trademarks | Privacy Statement
Page view tracker