Messaging Backup and Restore at Microsoft
Technical Case Study
Published: June 16, 2003
Microsoft's internal IT department, part of the Operations and Technology Group
(OTG), requires a reliable backup and restore solution to support messaging server
operations. With the improved features in Microsoft Exchange 2003 and the Microsoft
Windows Server System, OTG created a new more efficient design for servers running
Exchange and a two-step backup solution that minimizes user downtime and provides
a more flexible backup and restore solution.
|
Situation
|
Solution
|
Products & Technologies
|
Benefits
|
|
As part of Microsoft's internal deployment of Microsoft Exchange 2003 and the Microsoft
Windows Server System, Microsoft IT began to consolidate messaging servers to reduce
operating costs. High availability requirements for the clustered server redesign
prompted Microsoft IT to examine its backup and restore solution.
|
Microsoft IT implemented a two-step backup solution to improve server performance,
and support many more mailboxes per server. Microsoft IT also implemented the Exchange
2003 Recovery Storage Group (RSG) feature that minimizes user downtime during a
database recovery. The new solution supports server consolidation with high availability
while providing greater backup and restore flexibility.
|
- Microsoft Exchange Server 2003
- Microsoft Windows Server System
- Veritas Network Storage Executive (NSE)
- HP StorageWorks Enterprise Virtual Array 5000 storage area network (SAN) solution
- Microsoft Operations Manager (MOM) Exchange 2003 Management Pack (MP)
|
- Increased server availability during single node failures
- Reduced operating costs from server consolidation
- Minimized impact to users during backup using an alternate passive node in
the clustered configuration
- Flexible recovery solution allowed users to continue working while restoration
of mailbox/database was done offline and merged later
This case study is intended for enterprise technical decision makers and messaging
operations personnel. A clear understanding of Exchange, backup procedures, and
cluster technology is assumed.
|
Introduction
E-mail has become a vital part of today's communication infrastructure because it
connects people with knowledge. However, IT departments are challenged to support
this growing technology in a timely and cost-effective manner. Like any other enterprise
IT department, OTG requires a solid backup and restore solution to keep messaging
services up and running.
As of this writing, OTG is in the process of consolidating regional mailbox servers
and locations. Increased numbers of users per server and much more data on disks
present increased risks in the event of failure. The messaging group in OTG measures
mailbox service availability as a factor of downtime and the number of users affected.
For example, a one-minute outage affecting 1,000 users is measured as 1,000 minutes
of downtime.
OTG uses a two-stage backup process (disk-to-disk and disk-to-tape) to manage this
risk. This eliminates server performance impact from tape backup, and provides greater
flexibility. The solution is based on a combination of:
- Exchange Server 2003
- Microsoft Windows Server™ 2003, Enterprise Edition
- Windows NT® Backup for disk-to-disk backup
- Veritas Storage Management solution for disk-to-tape backup
- Hewlett-Packard's StorageWorks Storage Area Network (SAN) solution
- Microsoft Operations Manager Management Pack for Exchange Server 2003
The two-step backup operation and new clustered design offers many benefits, including:
- Flexibility to support an increase in the number of supported mailboxes per server
- From 1,500 to 2,700 mailboxes per virtual server in the regional locations
- From 3,000 to 4,000 mailboxes per virtual server in the headquarters data center
implementation
- Supports doubling of user mailbox limits, from 100 megabytes (MB) to 200 MB, while
maintaining a one-hour service level agreement (SLA) per database for both backup
and restore operations
- Reduces operating costs by consolidating 115 smaller regional mailbox servers to
29 mailbox servers, and eliminating dedicated servers for database restore operations
- Helps improve existing service uptime SLA requirement
- 99.9% for legacy standalone servers
- 99.99% for new clustered servers
Situation
To put the scale of the situation in context, OTG is responsible for maintaining
more than 150,000 computers worldwide. At the time of this writing, OTG supports
190 servers running Exchange Server 2003 on Windows Server 2003 Enterprise Edition,
distributed over 75 locations worldwide. 115 of these servers are mailbox servers
hosting more than 82,000 mailboxes.
The corporate e-mail infrastructure at Microsoft, as of June 2003, is comprised
of:
- Global mail flow of 6,500,000 messages per day, with 2,500,000 average Internet
mail messages per day
- 20 databases per server, with 50 gigabyte (GB) maximum database size on new clustered
deployments
- Global service availability of 99.9% with a target to achieve 99.99% on clustered
designs
- Worldwide mail delivery in less than 5 minutes, 90% of the time
- Backup and restore operation SLA of less than one hour per database
In the past, it was challenging to maintain the one hour backup restore SLA on direct
attached SCSI server implementations. These server designs used a one-step backup
process (disk-to-tape), where backups were performed to tape devices over the 100
MB LAN, which limited throughput capabilities (averaging 300-400 MB per minute).
Backups were limited to non-business hours to minimize any impact to clients with
mailboxes hosted on these servers. Recovering a mailbox store impacted by corruption
could potentially impact up to 1,000 mailboxes, causing an extended loss in productivity
for users hosted on this server. This represented a cost in lost productivity of
$60-$80/hr per user. Single mailbox restore operations required dedicated servers.
This configuration is shown in Figure 1.
.gif)
Figure 1: Old regional messaging environment
Backup Solution
To solve these problems and support server consolidation, OTG uses a flexible two-stage
process to back up data within a multimode clustered configuration—disk-to-disk
(stage 1) and disk-to-tape (stage 2).
The first stage backup runs on all active nodes within the cluster to complete an
online backup to a dedicated pool of disks with capacity to support two-day online
retention. With the clustered implementation, the data from the first stage is then
moved using cluster functionality to an alternate passive node. This node transfers
the backup files to a pool of fiber-attached tape devices. This eliminates the performance
impact to users by offloading the second stage backup processing (copying the files
to tape) to the alternate passive node. This process is shown in Figure 2.
.gif)
Figure 2: Two-step backup process
Recovery Solution
The method of recovery is based on the type of failure and a business decision between
restoring service and restoring data.
For example, if a single database must be restored on a cluster, up to 200 people
are affected. Because up to 2 days of backup data is available on disk and can be
restored online in less than an hour (restore rates of up to 2 GB/min can be achieved),
regular Exchange restore procedures are used to get users quickly back online with
their data.
If an entire storage group (5 databases) fails, as many as 1,000 users may be affected.
Restoring large amounts of data during business hours would prevent user access
to email for the period of restoration. To quickly restore service, OTG can use
a new Exchange 2003 feature called the Recovery Storage Group (RSG). OTG mounts
an empty "stub" mailbox that enables users to receive and send new mail, but without
access to old mail. Service is quickly restored because users work from this stub
mailbox while OTG restores the databases to the RSG. When the restoration is complete,
OTG can merge the old and new data from the "stub" using the Microsoft Exchange
Mailbox Merge Wizard (also called Exmerge). Because their restore speed is restricted
to LAN-based tape, this method is also used for the legacy non-clustered servers
that are currently in the process of being consolidated.
Server Design
OTG's clustered design supports a significant increase in the number and size of
mailboxes per Exchange server. It helps eliminate performance impact to users during
the second stage backup process because it offloads the process to non active servers
within the cluster, thereby maintaining the existing SLA.
OTG's design goal was to support 8,000 mailboxes per SAN, with 200 MB mailbox limits,
99.99% cluster server availability, and less than one hour per database backup and
restore time.
Regional Design
The server specification for the regional cluster implementation consists of one
Enterprise Virtual Array (EVA), three active nodes, one primary passive node, and
one alternate passive node (AAAPp).
Headquarters Design
The headquarters clustered implementation is similar in design. It consists of two
EVA per cluster, with four active nodes, one primary passive node, and two alternate
passive nodes for streaming backups from disk to tape (AAAAPpp). These configurations
are shown in Table 1.
Table 1. Cluster Design Specifications
|
|
Regional |
Headquarters |
|
4 CPU Active Nodes |
3 |
4 |
|
4 CPU Passive Nodes |
1 |
1 |
|
2 CPU Alternate Passive Nodes |
1 |
2 |
|
EVA |
1 |
2 |
|
Mailboxes per cluster |
8,000 |
16,000 |
The passive node(s) provide two benefits—as a failover solution and an independent
resource for the backup process. The primary passive node supports peak period failover.
The alternate passive node(s) are primarily used to transfer backups to tape, which
allows the active nodes to focus solely on serving users. The alternate passive
nodes also provide off-peak failover support to speed up the process for rolling
upgrades or patch application. The 16,000 mailbox headquarters implementation design
is shown in Figure 3.
.gif)
Figure 3: Clustered Exchange 2003 design
Best Practices
Two-Step Backup
Two step backup has proved beneficial, especially during the second stage (disk-to-tape).
In the past, both stages were managed by the same server running Exchange, which
impacted server performance during the disk-to-tape phase. Now, using clustering
improvements in Exchange and Windows®, OTG is able to offload stage two from
an active node to an alternate passive node, which minimizes user impact and allows
second stage backups to occur at any time, if necessary.
Backup Throughput Adjustment
OTG discovered a way to more than double disk-to-disk backup rates using the Windows
Backup utility through a registry adjustment. The adjustment increased the throughput
on average from 600 MB/min to 1,200 MB/min per storage group. The adjustment is
set on the user profile that is used to execute the backup script (Windows Registry
- HKey_Current_User).
OTG runs two concurrent backup jobs per active Exchange instance, providing an aggregate
rate of around 2.4 GB per minute per server, with two to three servers per SAN enclosure
(depending on headquarters or regional design). OTG has monitored maximum throughput
without excessive read and write disk latencies at approximately 6.3 GB/min per
SAN enclosure. Throughputs were dependent on Logical Unit Number (LUN) distribution
across controllers with Data, Log and Backup LUNS per Storage Group assigned per
controller.
The mode used for optimized throughput is:
- Storage Group 1 and 2 - Data, Log and Backup on Controller one
- Storage Group 3 and 4 - Data, Log and Backup on Controller two
- Job concurrency limited to two per server with SG1 and SG3 run concurrently followed
by SG2 and SG4
- RAID
- Target LUNS for backup were Vraid5
- All Vriad5 LUNS with write back cache disabled
Vraid1 targets would provide better throughputs, and are under consideration at
the time of this writing as an option with the 146 GB disks for the first stage
backup (disk to disk).
Server Design
OTG implemented a multi-node cluster design based on active and primary alternate
servers with four hyper threading 1.9 gigahertz (GHz) processors and 4 GB of RAM.
These servers run Windows Server 2003, Enterprise Edition and Exchange Server 2003
with the following modifications:
- /3GB switch set in the Boot.ini file
- /USERVA=3030 parameter set in the Boot.ini file
- SystemPages set to 0
- Mount Points are used to mitigate drive letter limitations for supporting the log,
SMTP and backup drives
- Backup disks are maintained in a separate cluster resource group to allow independent
LUN movement between cluster nodes
Each Exchange Virtual Server (EVS) hosts four storage groups for a total of twenty
databases of about 50 GB each per EVS. Each database is configured with a 200 MB
mailbox limit. This design means a maximum of 200 mailboxes per database, which
equals 4,000 per EVS, 16,000 mailboxes total.
The alternate passive node configuration is a less expensive server with dual 2.4
GHz CPUs, and 2 GB RAM. These nodes are used to transfer the backup files from step
one to a fiber-connected LTO-1 Tape Silo. This part of the process is managed with
Veritas Network Storage Executive.
Manage Transaction Logs
Increasing the number of mailboxes per servers also increases the number of transaction
logs per server. The time it takes to replay transaction logs significantly impacts
the time it takes to restore a server. Calculate the time to replay the logs, monitor
the average number of logs per day, and then adjust recovery plans accordingly.
Using MOM to Monitor Backup and Restore
The MOM Management Pack for Exchange Server 2003 provides OTG with a complete monitoring
solution to help maintain servers running Exchange. MOM enables OTG operators to
manage large numbers of servers running Exchange from a central console, facilitating
rapid failure detection and reducing time to resolution. The management pack helps
detect, and alert possible Exchange service outages.
For example, OTG runs a custom processing rule that checks the application event
logs each evening between 8 P.M. and 8:10 P.M. for missing events that indicate
scheduled backups have started. If the event is not present, operators are immediately
alerted. OTG also has a consolidation rule to detect a successful number of backups
over a given period of time. If the number is lower than expected, an alert is issued
to the operator to investigate. A standard MOM rule sends an alert if a transaction
log is older than 24 hours, indicating a possible backup failure.
When using the MOM MP to monitor the cluster, OTG recommends that you disable event
log replication so that you can prevent duplicate alerts from the physical cluster
nodes.
OTG recommends that you monitor disk-to-disk backup completion times to ensure that
any scheduled user moves are not adversely impacted, especially during server consolidation.
Do not move users from a database while it is being backed up.
Future Plans
A new feature of Window Server 2003 called Volume Shadow Copy service allows for
local file-system or specific Vendor Storage based data snap-shot functionality
for file servers. Shadow Copy enables end users to provide self-help recovery of
file share data, which saves on help-desk calls. As of this writing, OTG uses Shadow
Copy primarily for file servers.
OTG is testing, in the Exchange environment, the operational benefits of Volume
Shadow Copy service as a possible solution for "snap and clone" integration. This
will support recovery of large amounts of data in minutes. Testing of "snap and
clone" requestor and provider functionality for Exchange Server 2003 databases at
scale is ongoing as of this writing.
Benefits
Through a combination of Exchange Server 2003, the Windows Server System, third-party
SAN technology, and faster hardware, OTG created a clustered design with two-stage
backups that enabled a greater level of flexibility to back up and restore data
when needed, but which had minimal impact on users. OTG experienced the following
additional benefits:
- Reduced the impact to users (formerly 6 or more hours per user) of a database restore
operation
- Eliminated the need for dedicated restore servers to handle single mailbox restore
operations
- Helped maintain the 99.9% service availability requirement
- Improved backup and restore times to less than one hour
- Doubled the user mailbox limit (to 200 Mb)
- Enabled server consolidation by hosting many more mailboxes per server — from 1,500
to 2,700 for the regional design, and from 3,000 to 4,000 for the headquarters design
These benefits, in turn, enabled the consolidation of many smaller regional mailbox
servers by more than 75%, which reduced operational costs.
For More Information
For more information about Microsoft products or services, call the Microsoft Sales
Information Center at (800) 426-9400. In Canada, call the Microsoft Canada information
Centre at (800) 563-9048. Outside the 50 United States and Canada, please contact
your local Microsoft subsidiary. To access information via the World Wide Web, go
to:
http://www.microsoft.com/
http://www.microsoft.com/technet/itsolutions/msit/default.mspx/
© 2003 Microsoft Corporation. All rights reserved.
This case study is for informational purposes only. MICROSOFT MAKES NO WARRANTIES,
EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft, Windows, Windows NT, Windows Server,
and Windows Server System are either registered trademarks or trademarks of Microsoft
Corporation in the United States and/or other countries. The names of actual companies
and products mentioned herein may be the trademarks of their respective owners.