Best Practices for Using Volume Shadow Copy Service with Exchange Server 2003

Article
01/24/2008

Microsoft® Exchange Server 2003 uses the Volume Shadow Copy Service (VSS) that is included in the Microsoft Windows Server™ 2003 operating system to take volume shadow copies of Exchange Server 2003 databases and transaction log files. By using VSS, you might be able to restore databases within minutes, regardless of the database size. This fast restore capability depends primarily on the capabilities of the provider component of the VSS solution.

Because many VSS strategies are available, you must understand and test the capacity, performance, and recovery implications of the solutions in order to make sure that you have the data that you need to succeed in your deployment. You must also make sure that any potential solution is operating in the VSS framework. This article provides information about choosing, testing, deploying, and monitoring a VSS solution for Exchange Server 2003.

What Is VSS?

VSS is a set of COM APIs that implements a framework that enables volume backups to be performed while applications on a system continue to write to the volumes. Requestors, writers, and providers communicate in the VSS framework to create and restore volume shadow copies. A shadow copy of a volume duplicates all the data held on that volume at one well-defined instant in time.

The backup process includes the following steps:

The requestor initiates the backup process. The requestor instructs the writer to prepare a data set for backup.
The writer prepares the data for backup. Exchange Server 2003 and other applications implement writers that prepare data according to the specific requirements of the application. After the data set is ready, the writer signals the requestor to back up the data set.
The provider interacts with the disk system and manages shadow copies. When instructed by the requestor, the provider creates a shadow copy.
The requestor signals backup success or failure to the writer, and completes the backup process.

By separating the functionality of requestors, writers, and providers, the VSS framework makes each component independent of the others. A single requestor can interact with different providers or with multiple writers. For more information about requestors, writers, and providers, see Basic VSS Concepts.

The Exchange writer is automatically installed with Exchange Server 2003. Requestors can access the Exchange writer only if Exchange Server 2003 is installed on the Windows Server 2003 operating system. VSS backups are not available for Exchange Server 2003 if Exchange Server 2003 is installed on Microsoft Windows® 2000 Server.

When instructed to do so by a requestor, the Exchange writer prepares Exchange databases for backup. The writer does this by suspending all disk write I/O to the databases for up to 20 seconds. This is referred to as freezing the databases. The provider must be able to complete the shadow copy within this window or the backup will be aborted. After backup finishes, the writer thaws the databases and resumes regular I/O operations.

Note

Windows Backup for Windows Server 2003 can use the default software-based Windows VSS provider to perform generic VSS backups of disk volumes and files. However, Windows Backup cannot communicate with the Exchange writer and should not be used to make VSS backups of Exchange database files. Several non-Microsoft backup applications implement requestors that work with the Exchange writer.

VSS Backup Methods

A provider can execute shadow copy requests in many different ways. Although the Exchange writer is not aware of how the provider is creating a shadow copy, make sure that you understand how the provider for your solution works so that you can plan for performance and capacity. Although no industry-standard definition or naming convention for shadow copy backup methods exists, the large majority of backup methods can be broadly categorized as either clone shadow copies or snapshot shadow copies.

Clone Shadow Copies

A clone shadow copy is a full copy of the volumes in a shadow copy set. A shadow copy set is a group of volume shadow copies that are synchronized at the same point in time.

Like an ordinary copy, a clone is independent of the original data. If all the original data is lost, the clone still persists intact. This differs from a snapshot, which is not fully independent of the original data. For more information about snapshots, see "Snapshot Shadow Copies" later in this article.

You must consider capacity planning when you use clones. To make sure that a restorable copy is available if a failure occurs during a backup, you must use an N+1 scheme, where N is the number of backup clones that you want to have available to restore at any time. For example, if you decide to have only one backup, you still need two (1+1) target clones to rotate between to prevent data loss if a failure occurs during a backup.

The provider vendor determines how a particular solution implements how a clone is created.

Mirror Some solutions prepare mirror copies in advance. These mirror copies are then split off to take a backup, providing you with a read-only copy and a live production volume. This strategy has almost no affect on the production logical unit numbers (LUNs) during the backup and checksum integrity process. However, it does create significant I/O load on production LUNs before the backup.

You must make sure that you schedule time to resynchronize a clone that is no longer needed with the production LUN when you rotate through multiple clones. For restoration, the solution might resynchronize the read-only copy to the production LUN, which affects any other online storage groups that are using the same production LUN until all the data is copied. During restore, some storage arrays just change pointers to the read-only copy. This makes it writable.

Clone Some solutions create a clone at the time of backup, where all the data in the LUN must be copied to another LUN. That data is then marked read-only. This strategy can consume less capacity up front, but requires all data to be copied at the time of backup. With this strategy, you must know how many gigabytes per hour a particular storage controller can sustain, in addition to the effect on the production database LUNs during the copy. This enables you to design your LUNs correctly for maximum throughput, and to plan the timing of this operation to minimize the effect on the production LUNs.

Snapshot Shadow Copies

The most important difference between a clone and a snapshot is that a snapshot is not fully independent of the original data. Generically, snapshots are created by defining a marker at a point in time and making sure that the data can be rolled back to that point in time. You can keep multiple snapshots, and snapshots typically require much less additional disk space than clones.

Snapshots can be created in several different ways. The most common method is called copy-on-write.

The copy-on-write method defines a snapshot at a point in time, and then monitors the original dataset for changes. If a change is made, the change is recorded or tracked in a separate location. Over time, therefore, the size of a snapshot can continue to grow, especially when a snapshot is made of a quickly changing dataset.

The snapshot manager presents different views of the dataset, typically as if they were different full backups of the data. The snapshot manager can also switch to any available view of the data on demand, thus, in a sense, restoring the data.

Remember that a snapshot is not actually an independent copy of the data. If the original data is destroyed, the snapshot data is useless because it contains only the recent changes to the data.

This backup method gives you a rollback mechanism, but not an actual backup of the data. The advantage to this backup method is that you are only writing the changes, instead of all the data, to disk, so that the actual creation of the snapshot can occur very quickly. A disadvantage is that you do not have a recoverable backup if your original data is corrupted.

Because a snapshot backup does not provide a true backup, most solutions implement an additional step that streams the snapshot backup to tape. Streaming to tape adds a significant sequential I/O load to your production database LUNs.

During regular operation, the I/O pattern for a disk hosting Exchange databases is very random, but the I/O pattern for a streaming backup is very sequential. Mixing sequential workloads with random workloads makes it harder for caching to continue to be efficient and can lead to excessive latency and reduce peak I/O throughput considerably. This is an important factor to consider if you plan to depend completely on snapshots as your backup source for Exchange Server 2003.

Many Exchange Server administrators plan to reduce the effect of this problem by performing streaming backups during off-peak hours. Although this can be an effective strategy, it might not be obvious what the off-peak hours actually are.

Besides responding to client requests, an Exchange database also requires time to perform online maintenance. This maintenance can be scheduled by the administrator, but frequently requires several hours a day to finish. Even if end-user load is low, the database might be busy with maintenance tasks. You must also consider the additional server load required to prepare or perform backups. As a best practice, avoid overlapping backup windows with online maintenance or peak user demand intervals.

To determine what peak hours are for database activity, you have to actually profile the load on your databases over a baseline period of at least several days.

Exchange Requestors and Checksum Integrity Verification

An Exchange database file is divided into a series of pages of equal size. Each of these pages contains a checksum that verifies the integrity of the Exchange data on that page. If any data on the page is changed outside the control of the Exchange server, for example, by a disk or controller error, verification of the checksum will detect this problem. Exchange transaction log files also implement a checksumming scheme, but one that is not page-based. Therefore, damage to transaction log files can also be detected.

Microsoft supports a streaming backup API for performing backups of Exchange databases when they are running. The streaming API is implemented in Windows Backup on all versions of Windows and Exchange Server and in many non-Microsoft backup applications.

Note

You must install the Exchange administrator program on a computer that is running Windows Server in order to enable Windows Backup to perform Exchange online streaming API backups.

Essentially, a streaming backup copies a database to the backup media one page at a time, in order. During backup, the checksum of each page is verified, and the backup is completed successfully only if all pages in the database pass verification. Transaction log files that are part of the backup are also verified. This guarantees that the last backup of a particular database is a good one.

There is no opportunity to verify page integrity of a database or transaction logs as a shadow copy is created. Therefore, checksum integrity verification must be performed after the shadow copy has been created. Microsoft guidelines put responsibility for performing this verification on the requestor.

The requestor, or backup application, runs checksum integrity verification against the database and log files after the backup is completed. This is a very heavy streaming I/O load against the database and transaction log logical unit numbers (LUNs). The requestor performs the checksum integrity verification by running Exchange Server Database Utilities (Eseutil.exe). This reads the whole set of backup files to verify the individual integrity of each database page and transaction log file.

By default, Eseutil.exe runs as fast as the storage can read the data, and this is optimal for the typical clone, which is independent of the production LUNs. However, not all VSS backup sets are independent of the original data. For more information about different types of VSS backups, see “VSS Backup Methods” later in this article.

Sometimes, it may help to throttle the I/O rate of the checksum integrity verification by adding an artificial pause after a set number of I/Os. By using Exchange Server 2003 with Service Pack 2 (SP2), you can add the following switch to add a 1 second pause after a set number of I/Os:

/p<x>

Where x indicates the number of I/Os after which the pause occurs. For example, the following command adds an artificial 1 second pause after every 100 I/Os:

eseutil /K /p100

This I/O throttling is implemented only for database file verification, not for transaction log file verification.

You must carefully consider and plan for the I/O load created by checksum integrity verification when devising your backup procedures. This verification is an important part of the backup process and cannot be disregarded. However, you can defer the verification temporarily, subject to strict guidelines described in the Microsoft Knowledge Base article 822896, Exchange Server 2003 Data Backup and Volume Shadow Copy Services. This article provides a detailed description of the checksum integrity verification requirements that must be met by a backup requestor for it to be in compliance with Microsoft supportability recommendations.

A snapshot cannot be fully independent of the production LUNs. Therefore, running checksum verification for a snapshot must have an effect on the production LUNs. Checksum verification on a clone might or might not affect the production system, depending on where the clone is stored and how it is accessed.

You must carefully monitor the I/O load and effect of the verification process on both your end users and on ordinary database maintenance. Careful use of the Eseutil.exe throttling mechanism might also let you better balance verification performance with other I/O demands.

Considerations for Using VSS with Exchange Server 2003

For most administrators, the most important benefit of a VSS-based backup solution is that it allows for very rapid restoration of lots of data. VSS solutions are most useful for deployments that include large databases that require a restoration time of less than 60 minutes. This requirement is beyond the capabilities of current streaming or tape-based backup solutions. A VSS solution provides the following benefits:

Faster restore time
The ability to back up and restore larger amounts of data in a typical backup window than you can back up by using a traditional streaming online backup solution

A common misconception about VSS solutions is that they allow for backups to occur almost instantaneously and without an effect on a production server. This may be true from the point of view of an application; however, a VSS backup can require just as much underlying preparation and generate as much load as a streaming backup, especially when you are using clones. Backing up to disk and restoring to disk may give you more throughput and performance than using a tape-based solution. However, this does not change the fact that data must be copied from one location to another, regardless of the backup method chosen. With a VSS solution, this copy process can be optimized and scheduled, but the process must occur and copying lots of data necessarily consumes system resources.

Most production Exchange Server I/O involves many small, random I/O transactions to the databases. During backup and restore, the I/O throughput of your storage subsystem can become a bottleneck that artificially throttles your backup and restore speed. Make sure that you have sufficient throughput and load balancing to guarantee that you can meet your backup and restore needs.

Each Exchange Server 2003 storage group consists of up to five databases, transaction log files, and a checkpoint file. VSS considers both the database (*.edb) and streaming (*.stm) files as the database component, whereas the transaction logs (*.log) and checkpoint file (*.chk) are part of the log component.

If you use VSS for your backup solution, we recommend that you run the Windows Server 2003 operating system with Service Pack 1 (SP1). Contact your storage vendor to determine whether Windows Server 2003 with SP1 is supported. For information about a VSS update package that is available if you cannot upgrade to Windows Server 2003 with SP1, see the Microsoft Knowledge Base article 833167, A Volume Shadow Copy Service (VSS) Update Package Is Available for Windows Server 2003. For a list of additional hotfixes that you must apply if you are not running Windows Server 2003 with SP1, see “Appendix” later in this article.

You must make sure that any potential VSS solution for Exchange Server 2003 falls within the VSS framework and is a supported solution. For information about supported VSS solutions, see the Microsoft Knowledge Base article 822896, Exchange Server 2003 Data Backup and Volume Shadow Copy Services.

Running checksum integrity verification is an I/O-intensive and memory-intensive operation. We recommend that for stand-alone and clustered Exchange servers, you offload this work to a backup server that mounts and runs checksum integrity verification on the read-only shadow copy. When you can, it is always best to run the checksum integrity verification against shadow copies that are not hosted on the same physical disks as the production LUNs.

VSS Backup Types

You can use a full, copy, differential, or incremental backup type for your entire server or single storage group. For more information about VSS backup types, see Backup Operations.

Full Backup Use the full backup type for Exchange Server deployments. This backup type performs a backup of all the databases, transaction log files, and checkpoint files in a storage group, and after the backup is complete, truncates the log files.

Log file truncation is the process of deleting excess transaction log files that are not necessary to restore or roll forward the most recent backup. You must verify the checksum integrity of the most recent backup before log file truncation occurs. Truncation removes log files that are required to roll the system forward from a backup previous to the most recent backup. Although truncation does not invalidate previous backups, after truncation, you can restore the database only to the point in time at which the previous backup was taken.

Copy Backup A copy backup performs the same steps as a full backup, but it does not truncate the transaction log files. You can use a copy backup to create a copy of the database for testing or analysis purposes.

Incremental Backup You must be running Exchange Server 2003 with Service Pack 1 (SP1) or a later version to use an incremental backup type. The incremental backup backs up the transaction logs to record changes that occurred since the last incremental or full backup, and then truncates the transaction logs. To restore from an incremental backup, you must first restore the last full backup, and then restore all the incremental backups. The incremental backup can give you a faster backup window, but it can increase the restore time and log replay time.

Differential Backup A differential backup type requires Exchange Server 2003 with SP1 or later. A differential backup backs up the transaction logs to record changes that occurred since the last full backup, and does not truncate the transaction logs. To restore from a differential backup, you must first restore the last full backup, and then the most current differential backup. The differential backup can give you a faster backup window, at the expense of capacity and restore time.

A shadow backup typically involves the following stages, managed by the requestor and writer:

Synchronize Removes the previous shadow copy set from the backup server and synchronizes with the production LUN.
Fracture Freezes writes on the source LUNs when the shadow copies are synchronized, fractures the shadow copy synchronization, and resumes writes to the source LUN.
Transport and Checksum Transports and exposes shadow copy data and transaction log LUNs to the mount host. Runs checksum integrity verification against the shadow copy set. For more information about checksum integrity verification, see "Exchange Requestors and Checksum Integrity Verification" later in this article.
Log Truncation Completes backup by truncating storage group transaction logs on success and flags the full backup as complete.

VSS Restore Process

You can choose to restore an entire storage group, or, if the databases are hosted on separate LUNs, which is not a best practice, you can restore one or more databases in the storage group.

To restore even a single database, you must first take all databases in the storage group offline. Then, after the restore has finished, automatic database recovery (transaction log file replay) is invoked for the entire storage group by mounting any database in the storage group.

For this automatic recovery to succeed, the following minimum conditions must be met:

The database file names and logical file paths must be the same as when the backup was done. For example, if the file names were Priv1.edb and Priv1.stm, and the files were stored in the path D:\Databases, the restore location must also be D:\Databases and you must not change the file names.
The storage group prefix must match the file names of any transaction log files that are to be replayed.
In cases where you are restoring to the original server, these conditions are automatically met unless you have changed database paths since the backup was taken.
Some VSS requestors allow restoration to alternate servers. This might be useful for mounting databases on laboratory servers or for advanced recovery scenarios in which the original server is unavailable. For more information about backing up and restoring Exchange Server 2003, see the Exchange 2003 Disaster Recovery Operations Guide.

Recovery occurs in one of two ways:

Roll-forward recovery A roll-forward recovery is a recovery to the time of failure. A roll-forward recovery can be done if the current log LUN is available. In this case, you can restore the database files from backup, but not the transaction log files, and use the current logs on the server to roll the database forward. Assuming that all log files that were generated since the time of backup are available, no data is lost by restoring from backup.
Point-in-time recovery A point-in-time recovery is a recovery only of the data in the last backup. All newer data is lost. When you use a point-in-time recovery, only the transaction log files that are part of the backup set are used. Additional log files generated since the time of backup are not used, and those databases are recovered only to the point of the backup.

Exchange Server Clustering

Many enterprise solutions take advantage of Windows Clustering to increase server availability. When you run Windows Server 2003 with SP1 and Exchange Server 2003 in a cluster, a new feature named maintenance mode is available to help with some restoration methodologies. Clustering adds some unique challenges to VSS that you must understand and plan for in order to be successful. Make sure that you are aware of the backup and restore implications of your clustering solution.

During a backup, the checksum integrity verification is run against the shadow copy. Checksum integrity verification is a memory-intensive and disk-intensive operation that most administrators do not want to run on a cluster node hosting a production Exchange Virtual Server. During checksum integrity verification, the LUN is presented as read-only. This can cause problems with the disk signature of the original LUN and cause it to go offline. That is why most cluster solutions implement a backup server that mounts the backed-up LUNs to run the checksum integrity verification.

During a restore, the cluster physical disk resources are monitored with IsAlive and LooksAlive heartbeat requests. Restore solutions that dismount the production LUN and mount the backup LUN might encounter a timing problem: if the cluster service sends these heartbeat requests to the physical disk during the switch between the production and backup LUN, the cluster physical disk resource can fail, causing a cluster failover. Solutions that resynchronize the backup LUN to the production LUN are not at risk for cluster failover.

If you are running Exchange in a clustered environment, and you use a shadow copy backup or restore provider that causes LUNs to become temporarily unavailable to the cluster, we strongly recommend that you use the Microsoft Windows Server 2003 operating systems with Service Pack 1 (SP1) and that the provider takes advantage of the disk resource maintenance mode feature. For more information about the disk resource maintenance mode feature, see Microsoft Knowledge Base Article 903650, Extended Maintenance Mode Functionality for Cluster Physical Disk Resources in Windows Server 2003.

Alternatively, if you cannot run Windows Server 2003 SP1 or your VSS provider does not yet support disk resource maintenance mode, you can reduce, but not eliminate, the possibility of a cluster failover during critical operations by increasing the IsAlive and LooksAlive values for the resource to 5 minutes. Note that you should not leave these values at 5 minutes; revert them to typical values for regular operation. For information about increasing the IsAlive and LooksAlive values, see Frequently Asked Questions.

Planning Your VSS Backup Strategy

When planning your backup strategy, you must create a service level agreement (SLA) that defines the required the backup and restore window. This enables you to accurately determine the number of databases, storage groups, and Exchange servers that you require in order to achieve your required backup window. Most administrators also define an additional window for online database maintenance, online database defragmentation, and operating system maintenance.

When you create a VSS solution, you can use one of two strategies:

Upgrade your current infrastructure to support VSS.
Design a new highly available Exchange Server 2003 VSS solution.

For information about detailed strategies for high availability for Exchange Server 2003, see the Exchange 2003 High Availability Guide. For information about how to plan for disaster recovery, see Worksheet: Disaster Recovery Preparation for Microsoft Exchange Server 2003.

Evaluate Your Current Infrastructure

Regardless of whether you design a new solution or upgrade an existing solution, the first step is to evaluate your current backup and restore method and time window, your database and storage group size, and your current storage capacity consumption and available space. You should measure the performance metrics of the storage solution during both production and backup windows.

It also is important to understand your Mailbox profile, which includes the individual mailbox size and the number of database I/Os per user. This is a factor for VSS, because the act of backing up and restoring puts additional load on your storage subsystem, which you must carefully design in order to guarantee low latencies. For information about the steps for defining your mailbox profile, see Optimizing Storage for Exchange Server 2003.

You must also determine whether you can use a streaming backup solution or a VSS backup solution. Streaming online backup uses an Exchange backup API to back up databases and storage groups that are mounted. During the streaming backup, checksum integrity verification is run on each page so that you know that when backup succeeds, you have a reliable backup. The streaming I/O load on your production LUNs during backup is as severe as during the checksum integrity verification of a shadow copy, and your storage infrastructure must be appropriately sized to meet your backup and restore SLA.

You cannot mix Exchange-aware streaming backup with Exchange-aware VSS backups in the same storage group because of transaction log file management conflicts. One backup type might truncate log files that are required by the other backup. However, you can perform a generic streaming file backup of a VSS shadow copy backup set in order to preserve the set permanently before it is overwritten by a succeeding backup.

In a streaming backup environment, the bandwidth of the physical disks or network, physical disk isolation, and tape speed are all considerations and bottlenecks that generally help prevent one Exchange server’s backup from affecting the backup of another Exchange server.

Generally, with the VSS requirements for Exchange Server 2003, your storage enclosure and the storage controller handles all backup sequential I/O in addition to handling typical random I/O demands. Therefore, you must make sure that you test your throughput during backup and restore. Make sure that you understand how many GBs per hour a controller can sustain when synchronizing, while under the production load you expect during the time of day of the backup operation, in order for you to meet your SLA.

Define New SLAs

When you work with VSS, you must create an SLA that defines your backup window and the acceptable downtime for a particular outage. This will affect your storage design. You must also consider creating an SLA for different scenarios that cause you to restore. When defining your backup strategy, you must balance the need for short backup and restore time windows against the associated costs; for example, a 10 minute restore SLA costs more in hardware and technical expertise than a 72 hour restore SLA.

For more information about SLAs and availability management, see Microsoft Solutions for Management: Availability Management. For more information about SLAs for Exchange Server 2003, see the Exchange 2003 High Availability Guide.

As part of your SLA, define the following:

Recovery Point Objective (RPO) The RPO is the amount of data that you can tolerate losing. For example, if you cannot tolerate losing any data, your RPO is zero.
Recovery Time Objective (RTO) The RTO is the period of time from outage to a return of service. In order to meet the RTO, some solutions require a more frequent backup window, so that during the restore, log replay time will meet the restore SLA. Generally, you want to specify the RTO for the following:
- Mailbox We recommend that you use built-in features such as the Recover Deleted Items feature in Microsoft Office Outlook® 2003 or item retention policies in Exchange Server 2003 for restoring a single mailbox or data in a mailbox. For more information about the Recover Deleted Items feature, see Microsoft Office Assistance: Recover Deleted Items from Any Folder. For more information about item retention policies, see HOW TO: Use System Policies to Configure Mailbox Storage Limits in Exchange Server 2003.
- Storage Group To restore corrupted data or databases, or log file data on the storage group, you must restore from backup. Additionally, your solution must meet your defined SLA and restore window. Although you can use VSS to back up and restore individual databases in a storage group, we recommend as a best practice that you back up and restore whole storage groups together. Backing up and restoring on a per database basis is more complex and limits you to storing only a single database on a particular LUN. Backing up individual databases is supported, and there may be overriding considerations in your environment that make this an appropriate option, but you should consider the drawbacks too.
- Server In the event of a server failure, an alternative server restore of your VSS backup is a good solution. Some storage vendors can restore VSS backups that have been asynchronously replicated to a different site. To do this, the requestor must be able to use VSS to restore to a different server that uses the same path and host name.
- Site In the event of a site failure, the Exchange Server data must be available at another site. You can use replication or copy your VSS backups to tape, or both approaches, and store those tapes offsite so that they are available for a restore from the alternative site.

Define a Replication Strategy

Defining a replication strategy is an important part of implementing a VSS solution because some replication methodologies affect the latencies of the production LUNs and must be carefully designed to meet your SLA. Exchange Server 2003 does not provide an application-level replication mechanism for mailbox databases. We recommend Windows Clustering for server resiliency but defer to our storage partners for their storage and site resiliency solutions.

Replication is becoming more important as businesses change their views on messaging from a “nice to have,” to a mission-critical application. You can implement replication in several ways, although most solutions can be classified as either synchronous or asynchronous. For more information about Exchange Server 2003 replication support, see Deployment Guidelines for Exchange Server Multi-Site Data Replication and the Microsoft Knowledge Base article 895847, Multi-Site Data Replication Support for Exchange 2003 and Exchange 2000.

Designing Your VSS Solution

After evaluating your current infrastructure and defining new SLAs, the next step is to work closely with your storage vendor to design a storage solution that meets your SLA. Provide your storage vendor with information about accurate storage group size and I/O performance, backup and restore windows, acceptable performance levels during production and backup, and how frequently you expect to back up your data. The storage vendor can then give you a suggested solution, which you can then validate.

When you design your storage infrastructure, you must verify that the whole end-to-end solution is qualified by Microsoft and listed in the Windows Catalog. The strategy that your solution uses to back up your data can heavily influence the storage design.

RAID Levels

Most storage enclosures are purchased by capacity. Although you must have enough storage capacity for future growth, it is also critical to the success of your storage solution that the solution can deliver enough I/Os with low latency to be perceived as successful by your end users. Your disk subsystem is performing poorly if the averages of the read and write latencies exceed 20 milliseconds, and if latency spikes above 50 milliseconds last for more than several seconds.

Currently, most Exchange Server architects put Exchange Server databases and transaction log files on RAID10 LUNs, both for performance and to help protect the database. You can consider using other RAID levels if the solution is thoroughly tested and the SLA is met. Many storage vendors have specific recommendations for deploying their products with Exchange Server 2003. Customers should ask their storage vendors about Exchange-specific disk configuration and fault tolerance recommendations.

After the storage is purchased, many administrators try to obtain every byte of available capacity by using RAID5. You can use RAID5 if enough spindles are allocated to guarantee expected performance and latency, and sometimes the performance can be better than RAID10. However, frequently, correctly sizing the performance of your spindles requires that you use more physical disks for RAID5 than for RAID10. Additionally, you should test the decrease in performance suffered during disk rebuild operations for various RAID levels. Jetstress, discussed in Testing Your VSS Solution, should be used to test your actual LUN configuration to make sure that it satisfies Exchange I/O requirements, regardless of RAID level chosen. For more information about RAID levels, see Optimizing Storage for Exchange Server 2003.

In summary, disk capacity is only one of the factors you should consider when planning storage for Exchange Server 2003. You must also balance these critical factors:

Fault tolerance. Does the solution provide a high degree of redundancy and resilience to drive and media failures?
I/O profile. Do the RAID level and number of spindles support both the required I/O load and the actual I/O mix (read vs. write, random vs. sequential)?
Recovery profile. After a failure, is there a significant decrease in performance while the drive set recovers?

Basic Design

Consider the following when you develop a basic design for your VSS solution:

Is circular logging enabled on the storage groups?
We recommend that you disable circular logging. When circular logging is enabled, only point-in-time storage group restores are possible. This can cause data loss. This is because when circular logging is enabled, single database restores are not possible, and you cannot roll forward the logs. This can affect the RPO of your SLA.
Does the VSS solution back up and restore Exchange Server data exclusively by using the Exchange writer?
Exchange Server 2003 requires that the Exchange writer exclusively backs up and restores Exchange Server data.

Cluster Design

Consider the following when you develop a cluster design for your VSS solution:

Is the restore fully automated, requiring no manual intervention to adjust cluster resource dependency?
We recommend that the requestor handle all required cluster resource dependency changes.
Does recovery affect physical disk resource health?
We recommend that you use Exchange Server 2003 with SP1 on clusters, and a requestor that is aware of cluster maintenance mode. This prevents resource failures during the restore by temporarily disabling IsAlive and LooksAlive checks.

Provider Design

Consider the following when you select a provider for your VSS solution:

Does your storage array have a VSS provider with snapshot or clone functionality, or both snapshot and clone functionality?
Exchange Server 2003 requires that you have a VSS-aware provider. The provider does the work of communicating with the storage device to create and delete shadow copies.
Does the provider support clustered Exchange configurations?

Requestor Design

Consider the following when you select a Requestor for your VSS solution:

Does the requestor validate the checksum integrity of the shadow copy backup set?
Exchange Server 2003 requires that checksum integrity verification be run on the shadow copy to determine whether the backup is good. Restoration of data for which checksum integrity verification has not been run is not supported.
Does the requestor run a single checksum integrity verification process at a time against a single LUN?
When multiple databases are on the same LUN, it is likely to be more efficient to run the checksum integrity verification serially for each database. This prevents excessive head movement and preserves sequential read operations.
Does the requestor automatically import the current shadow copy to a backup server for checksum integrity verification?
We recommend that you offload the Eseutil.exe checksum integrity verification to a backup server.
Does the requestor support clustered Exchange configurations?
Does the requestor support scheduling and queuing?
Some solutions might have different performance characteristics. One solution might perform best when shadow copies of all LUNs are created together; others might perform better if shadow copies are made serially. The solution should have scheduling flexibility or optimization to allow for scheduling to optimize both performance and administrator convenience.
Does the requestor scan for potential corruption before it starts the backup, and terminate the backup if corruption is found?
Some requestors look for database corruption events (-1018, -1019, -1022) to make sure that a corrupted database does not overwrite a previous good backup. If your requestor lacks this functionality, you can use Microsoft Operations Manager (MOM) or other event scanners, or manually examine the event logs to detect corruption.
Monitoring the event logs for these errors is not a substitute for backup checksum integrity verification. This is because events are logged only for pages in the database that are actually accessed. Errors on infrequently accessed pages are not reliably detected by monitoring the event logs. Monitoring the event logs provides you additional and earlier warning of damage to the database.
Does the requestor use events to signal failure and success?
We recommend that the requestor use events that can be monitored by scripts, and tools such as MOM. These events help you in proactively monitoring your Exchange Server VSS solution.
Does the requestor, by using CDOEXM, fully dismount the storage group before it restores?
We recommend that the requestor dismount the storage group before restore. If the requestor does not do this, you must manually dismount the storage group before it starts the restore.
Does the requestor support Eseutil.exe checksum integrity verification I/O throttling?
Does the requestor manage shadow copy retention and deletion without requiring manual administrator intervention?

Storage Design

Consider the following when you develop a storage design for your VSS solution:

Does restoring the storage group of the VSS solution affect other storage groups or Exchange servers?
We recommend that you design the LUN configuration for your storage so that the restoration of a storage group does not affect your other production storage groups or Exchange servers. It is best to isolate the physical disks on a per storage group basis, and, where that is not possible, to test the solution production workload in addition to restore I/O workload to make sure that the impact is acceptable to your users.
Does the solution synchronize the shadow copy data from the backup to the production LUN during a restore?
A solution that supports synchronization, whether by using a snapshot or a clone, has to copy data from the shadow copy to the original LUN. How long this takes depends on the amount of data that must be copied. The length of time to restore is also affected by the number of log files that must be replayed during database hard recovery when the storage group mounts.
Is the VSS solution compatible with your site resiliency plan?
Exchange Server 2003 requires that a solution that replicates VSS backups to a separate site uses VSS to restore the data at the second site. For more information about replication, see the Exchange 2003 High Availability Guide.

Clone Design

Use of a clone backup involves copying all your data. This copy requires resources and takes time, depending on the size of the LUN to be copied. Therefore, you must understand the effect that this procedure has on your production LUNs, and whether your storage vendor provides features that enable you to minimize this effect. Storage controllers have a limit to how fast they can clone data. If you understand this limit, you can increase your total throughput by positioning your LUNs and Exchange servers in such a way as to take advantage of the storage controllers.

Consider the following when you design for a clone backup VSS solution:

Does the clone target use a different set of physical disks than the source production LUNs?
We recommend that the clone use physical disks that are separate from the source production LUNs. If the clone uses the same physical disks, the checksum integrity check significantly affects the latency on the production LUNs, and you must schedule the backup to occur at a time of low activity to minimize the impact to users.
If the clone target is a different set of physical disks, does it use the same RAID type?
If your production LUNs are RAID10 and your clone target is RAID5, performance can be sufficient for backup. If during restore, the RAID5 LUN is made available as the new production LUN, you must take the performance implications of possibly slower storage into account when you design your storage solution.
Does the requestor support multiple clone targets?
We recommend that the solution support at least two clone targets to cycle between in order to prevent data loss if a disaster occurs during backup. This will allow for quick recovery from the last known good backup.
Does the requestor wait until the clone is fractured or fully synchronized before running checksum integrity verification?
We recommend that the requestor wait until the clone is fractured or normalized before running checksum integrity verification. This is required to prevent block dependency on the production LUNs, and prevents the production LUNs from experiencing high latency.
Does the solution provide a mechanism to recover the clone, if the production LUN hardware fails?
If you clone to separate physical disks, but have a disk or enclosure failure, you must be able to restore the clone. If the VSS solution does not provide a mechanism to restore the clone, your SLA must document alternative ways to restore if a disk or enclosure failure occurs, and you must test that alternative method.
Is the clone target using near line storage?
Slower disk types (SATA) with regard to rotational speed and head seek time are not an optimal choice for a random workload. Reduced cost storage can perform adequately for most sequential workloads or environments with low production I/O workloads. Many storage companies are adding SATA and FATA devices to their storage enclosures; these devices can perform well as backup targets in VSS environments in which the data is accessed sequentially on these devices during backup and restore. You must make sure that the time that is required to complete the checksum operation on slower storage meets your SLA. The danger is with solutions that present that lower-quality storage to the host as the production LUNs during a restore. Solutions that present reduced speed storage as the production LUNs must be sized to make sure that the solution can handle the production workload.
Does the solution swap the clone and production LUN during a restore?
A solution that supports LUN swapping generally uses a clone backup, and restoration occurs very quickly, regardless of the size of the data, by replacing the production LUN with the backed-up LUN. This strategy is also affected by the number of log files that must be replayed during database hard recovery when the storage group mounts. Considerations for different RAID types (RAID10 and RAID5) are important when a restore is required.

Snapshot Design

Consider the following when you design for a snapshot backup VSS solution:

Are there provisions for making fully independent backups in addition to snapshots?
Does the snapshot allocate space as needed?
Most solutions consume capacity as it is needed, as changes are made. Some solutions allocate the whole production LUN size with each snapshot, to prepare for the event that every bit of data has changed. You must consider this when planning for capacity if more than one snapshot per day is required. If space is not preallocated, you must still consider the additional space that might be required as copy-on-write snapshots grow in proportion to the number of changes in the live dataset.
What are the performance implications of multiple snapshots?
The requirement to keep track of multiple snapshots at the same time can create performance overhead. You must measure this performance impact so that you can determine the number of snapshots that it is realistic to have at any time. Deleting a snapshot can also create a performance impact for some, because indexes must be updated, and, sometimes, data must be reorganized on the physical disks before the space can be reallocated to the array.

Validating Your VSS Solution

VSS will have an effect on your storage infrastructure. After you design your VSS solution, you must validate your solution by measuring that effect and making sure that you meet your SLA. You must validate all the restore scenarios that you expect to support in your SLA by using a proof-of-concept approach.

After you outline your solution requirements, you must validate the whole end-to-end solution. This includes the following details:

How many GB/hr the storage controller can back up under the expected Exchange Server load during the backup window.
How quickly those databases can be restored.

Use the validation period to determine whether your solution meets the following requirements:

The typical read and write latency on the database LUNs is under 20 milliseconds and peaks that are higher than 50 milliseconds last no longer than several seconds.
You can perform a backup and checksum integrity verification in the defined backup window.
Your restore meets your SLA for recovery, without affecting other storage groups or Exchange servers.

You must test the actual end-to-end solution in the way that you expect to deploy it in your production environment. Testing enables you to make sure that your solution is using the VSS framework and meets Exchange Server 2003 requirements, and gives you an opportunity to understand some of the performance implications of the solution. You must design your end-to-end testing around your own SLA. Deploying an Exchange Server 2003 VSS solution without taking the performance impact into account may cause poor performance and create unhappy users.

Make sure that you test the following:

The performance of the production LUNs during a backup
The performance of checksum integrity verification. This includes performance of the production LUNs during verification. Checksum integrity verification must complete quickly enough so that it does not extend past the time of the next backup.
Restore
Log replay
Replication

Note that the number of users per storage group should match your expected deployment numbers.

You can perform deployment testing by using the following tools:

Microsoft Exchange Server 2003 Load Simulator (LoadSim) LoadSim simulates Outlook MAPI users who are using Exchange Server 2003. You can use LoadSim to create users and initialize mailboxes with mail. This creates databases that are similar in size to those in your production environment. LoadSim requires Outlook 2003. To download LoadSim, see Microsoft Exchange Server 2003 Load Simulator ( LoadSim ).
Exchange Server 2003 Jetstress Tool Jetstress simulates disk I/O load to verify the performance and the stability of your storage array, and has an easy-to-use graphical user interface. To download Jetstress, see the Exchange Server 2003 JetStress Tool.

Phase One – Burn In

The goal of the first phase of testing is to burn in the solution as a means to quickly identify VSS and storage configuration and stability issues.

The following is an example of the execution of phase one of the testing process:

Run a 24-hour Jetstress stress test against each database and log LUN.
Use LoadSim to create production-sized databases, and run a backup job against the server every 2 hours for a 48-hour period.
You should adjust the 2-hour backup window to your proposed production backup window. Use a stand-alone or clustered server based on the configuration of your production environment.

Phase Two – Production Simulation

The goal of the second phase of testing is to make sure that your VSS solution can back up and restore production-sized and stressed databases under production load within the backup and restore windows defined in your SLA.

The following is an example of the execution of phase two of the testing process:

Run nightly backups after an 8-hour LoadSim run by using the MAPI Messaging Benchmark 3 (MMB3) profile.
Run restore in the morning before kicking off the next LoadSim test. Make sure that you test the restore cases that you plan to use in your production environment, as defined in your SLA. The three most common restore cases include the following:
- Roll-forward recovery, recovering the database(s) and rolling the logs forward
- Point-in-time recovery, recovering the logs
- A full restore, whereby the whole storage group must be restored

Make sure that you monitor the performance impact of the checksum integrity verification. If the checksum integrity verification causes unacceptable latency, first determine whether a bottleneck exists. Storage controllers, processors, cache, and bandwidth to the storage can all create a bottleneck. If disk performance is creating the bottleneck, the best solution may be to add more physical spindles to support the LUN. Contact your storage vendor to identify strategies for improving unacceptable latency.

Monitoring Your VSS Solution

You must monitor your solution’s health so that you can take proactive steps to manage growth and prevent problems as your production environment evolves, for example as your user processes change, you add more users, or your mailboxes become larger. For the purposes of backup and restore, you must make sure that you monitor the following:

Whether your latencies are changing.
Whether you are achieving your expected backup and checksum integrity verification rates.
Notifications of problems that can occur during the backup or restore process.

The first step in monitoring is to establish a baseline for healthy performance characteristics. Over time, continue to monitor for deviations from the established baseline. For more information about monitoring, see the Exchange 2003 High Availability Guide.

Microsoft Operations Manager 2005

Microsoft Operations Manager (MOM) 2005 and the Exchange Management Pack provide a central way to monitor Exchange Server 2003 performance and availability. The Exchange Management Pack provides a knowledge base for alerts together with suggestions and links to information related to the alerts. By using the Exchange Management Pack, you can easily keep track of the following:

Database size
Number of mailboxes
Configuration
Availability
Client monitoring
Mail traffic analysis

The Exchange Management Pack also enables you to receive alerts when particular thresholds are met. To download the Exchange Management Pack, see the Exchange Server Management Pack Guide for MOM 2005. For information about best practices for monitoring by using MOM 2005 with Exchange Server 2005, see the Exchange 2003 Management Pack Configuration Guide. For information about troubleshooting Exchange Server 2003 performance issues, see Troubleshooting Exchange Server 2003 Performance.

The management packs provided by your storage vendor are also useful. These management packs can alert you when the storage exceeds capacity, performance, and fault tolerance thresholds.

Appendix

Some VSS solutions require that you apply the following hotfixes:

For Windows Server 2003 with SP1: 891957, 898790
For Exchange Server 2003 with SP1: 892514

To determine whether you require these hotfixes, contact your storage vendor.