A Guide to Exchange Disaster Recovery Planning
At a Glance:
- Basic Exchange Server disaster
- recovery strategies
- Using Exchange on a Windows cluster
- Recovering Exchange databases
- New recovery features in Exchange Server 2007
In all but the smallest organizations, Exchange Server 2003 runs as a distributed system. This architecture provides Exchange with a great degree of resiliency. In fact, the great majority of an Exchange
server's configuration information is stored away from the server, in Active Directory. This distributed design makes rebuilding a damaged Exchange server nearly automatic. In addition, Exchange works well with the Microsoft® Windows® Clustering Service. Running Exchange in a Windows cluster (see Figure 1) can provide you with even faster recovery and greatly reduce maintenance downtime for tasks like installing service packs and security upgrades.
Figure 1 Running Exchange Servers as Clusters
For Exchange to work, the network, DNS, and Active Directory® must all be working properly. Increasing the reliability of these vital infrastructure components is often the best thing you can do to increase Exchange reliability and availability. A detailed discussion of architecting and recovering your network and Active Directory is beyond the scope of this article, which, for the most part, deals with how Exchange servers can be affected by a disaster.
The most fundamental aspect of preparing for and recovering from an Exchange disaster is making sure you have good backups of your databases, and I'll spend a good bit of time here on Exchange database recovery. I'll also discuss recovery of server installations and configuration information, and re-assignment of special Exchange server roles after a disaster (see the sidebar "Special Server Roles").
Redundancy is the first principle of disaster recovery. Redundant network paths, DNS and Active Directory servers, and Exchange front-end (client access) and clustered servers turn failures into inconveniences instead of disasters.
Some failures can be automatically handled with little apparent effect on the system. For example, an emergency generator can monitor electrical power. In case normal power fails, the system automatically fails over to the generator with a negligible interruption in service. Other failures can be mitigated by having hardware backups or spares. If a single light bulb burns out, for example, you simply replace it with a new one.
Some components of an Exchange system can be designed for recovery by failover: Active Directory, DNS, front-end servers, public folder databases, message routing and connectors, clustered Exchange virtual servers, and general network infrastructure. Other components typically rely on backups, manual reconfiguration, or failover to an alternate server or application after a disaster: mailbox databases, special server roles, standalone Exchange server configuration information, in-transit messages, and client applications.
Several third-party vendors offer replication capabilities for Exchange Server 2003 mailbox data, so you can devise a failover-based recovery strategy for mailbox data, but it requires purchasing add-on products and services. Exchange does not currently provide this capability out of the box. (Later in this article, I'll preview new mailbox database replication features coming in the next major version, Exchange Server 2007.)
For reduced downtime and system impact, it is always better to design for failover instead of restoration or reconfiguration. But failover strategies typically cost more, not only in monetary terms but also in complexity and management. What's most important is to have a strategy for protecting and recovering each of the components noted previously. Regardless of the details of your strategy, you should take into account the following:
Recognize the scope and limits of your strategy What disasters won't your strategy handle? What if there is no electric power for more than a day? What resources are you assuming will be available after a disaster? What if an entire building is destroyed or the entire Internet is down? Understanding what your strategy can't handle is critical for designing Service Level Agreements (SLAs) and for assessing risks. The bigger the scope of a disaster, the less effective are detailed plans for dealing with it. That doesn't mean you can't plan effectively, but you need to build in flexibility and ensure that you have the necessary resources to solve all unexpected problems. Step-by-step procedures save time and reduce mistakes, but they are no substitute for people who can creatively handle unexpected situations.
Test your strategy regularly When you devise a plan on paper, you'll inevitably miss certain elements that you'll discover only during a fire drill. For example, if your strategy for replacing a burned out light bulb is to replace it with another one, it's a good idea to find out before the light goes out that you'll need a ladder to reach the fixture as well as a flashlight to find the replacement bulbs and to help make sure you don't fall off the ladder in the dark.
Update the strategy periodically As conditions change, parts of your strategy will become obsolete. Maybe an updated version of your backup program requires different restoration procedures. People change jobs or phone numbers. Other groups you depend on may reorganize or disappear entirely. You should schedule periodic reviews to evaluate and update your plan. Quarterly reviews are common. Your first quarterly review may reveal whether that is often enough.
Have a backup plan for your backup plan Do you know how to quickly find a different ISP or hardware provider? If key personnel are unavailable, do you have the budget and contacts to get substitute expertise quickly from a vendor or system integrator? Redundant strategies are as important as redundant hardware.
Train more than one person It's not just technical knowledge that matters. What about passwords? Account numbers? Telephone numbers? Don't let the absence of a key person or a changed cell phone number become your most important single point of failure.
Nearly all the configuration information for an Exchange 2000 or 2003 server is stored in the Active Directory database. If Active Directory replicas are kept on servers separate from Exchange (which is the case in all but the smallest Exchange systems), then the configuration for an Exchange server can survive the loss of the entire server.
Of course, this means that it is very important to take good care of Active Directory. You should have multiple replicas of the Active Directory database for each domain in your organization or do frequent backups of the database. If Active Directory is entirely destroyed but an Exchange server survives, you can still recover but it is not a simple process.
The Exchange Setup.exe program can be run in three different modes: initial setup, maintenance, and disaster recovery. Initial server setup runs automatically if the setup detects that Exchange has never been installed on the server and there's no configuration information already in Active Directory. If you are installing Exchange to a standalone server, setup will install all necessary Exchange program files, configure default registry and Internet Information Service (IIS) metabase entries, and then store the server's configuration information in Active Directory. If you are installing Exchange into a Windows cluster, setup will install all necessary Exchange program files, configure default registry and IIS metabase entries, and stop. Configuration of Exchange virtual servers inside a cluster is done by the administrator after setup has completed. I'll explain how this works and other differences between Exchange on a standalone server and in a cluster shortly.
Maintenance mode setup runs automatically if setup detects that Exchange has already been installed on the server. This mode is useful for adding new Exchange components (such as foreign connectors) or for reinstalling Exchange program files and default settings to try to repair a damaged Exchange installation.
Disaster recovery mode setup (see Figure 2) must be run manually by starting Setup with the /Disasterrecovery command-line option. For this mode to work, Setup must find intact configuration information for this server in Active Directory. Use this mode when an Exchange server installation has been damaged badly enough that maintenance mode setup does not detect that Exchange is already installed on the server. Setup /Disaster recovery will install Exchange, configuring the server automatically with the server information already in Active Directory. This mode can also be used to transfer ("forklift") an Exchange server to different or upgraded hardware.
Figure 2 Disaster Recovery Setup
The /Disasterrecovery option for Setup.exe is not available for an Exchange cluster, but there is equivalent functionality in a clustered environment. To understand how this works, you must first know something about how Exchange running in a cluster differs from Exchange running on a standalone server. And for that, you need to know the basics of Windows clustering.
Windows Clustering Basics
A Windows Server™ 2003-based cluster is formed by defining a cluster name in the Cluster Administrator console and then configuring two or more computers (or nodes) as members of the cluster. Note that it is possible to create a single-node cluster on any computer running Windows Server 2003 for testing and learning purposes. You can even install Exchange on a single-node cluster to see how it works. But a single-node cluster isn't much good for anything except testing and learning.
Cluster nodes share a quorum disk and a cluster database. (Quorum types other than shared disk are now supported by Windows Server 2003. They won't be discussed here because they don't affect basic clustering principles.) These resources store information about the cluster state and about the configuration of applications running in the cluster. At any given time, one node in a cluster is responsible for running a particular application that's under cluster control. If this node fails, or at administrator discretion, another node can take over running the application. Switching an application to run on a different node is called failing over. Other nodes can also run the application because the application has been installed in a cluster-aware configuration on each node, and all information needed to run the application is replicated continuously to every node.
In general, clustered applications run as virtual servers. This means that the application appears in the Cluster Administrator as a set of resources. For example, an Exchange virtual server typically has four resource types in it: a network name, an IP address, physical disks, and services.
The network name is the computer name that clients on the network see when connecting to the Exchange virtual server. This name is different from the cluster name and from the computer names for each node in the cluster. Typically, only cluster administrators connect directly to the cluster or its nodes. All other clients connect to virtual servers running in the cluster which appear to them to be full-fledged standalone computers.
The IP address is the address associated with the network name. It is the IP address clients can ping for the virtual server. In DNS, there will be an entry for the network name associating it with this IP address.
Typically, there will be multiple physical disks in an Exchange virtual server to support multiple databases and transaction log file streams. These disks share a physical bus of some kind among all the computers that belong to the cluster. When the virtual server is failed over to a node, the node can acquire exclusive access to the disks that belong to the virtual server.
All of the services you typically see on a standalone Exchange server are configured as resources in an Exchange virtual server. Instead of starting and stopping Exchange services from the Services console, you do it in the Cluster Administrator console. (Stopping services incorrectly in a cluster will probably make the cluster think the service has failed and cause the entire virtual server to fail over to another node.)
Setting up an Exchange Cluster
Recall that when you set up Exchange on a cluster, only the Exchange program files, default system registry, and the IIS metabase entries are configured. No configuration information is put in Active Directory by Exchange cluster setup. The reason for this is that while you must install Exchange program files on a specific cluster node, you will run Exchange in a virtual server that floats among multiple nodes in the cluster. You set up Exchange virtual servers through the Cluster Administrator program, not Exchange setup. In a cluster, all Exchange setup does is prepare the node to be a possible owner of an Exchange virtual server. It gets the program files needed to run Exchange installed on each cluster node.
To create an Exchange virtual server, you must create a new virtual server in Cluster Administrator and assign it an IP address, a network name, and one or more physical disk resources (see Figure 3). These basic resources are available and configurable from Cluster Administrator's menus. After configuring them, you're ready to add a Microsoft Exchange System Attendant resource to the virtual server.
Figure 3 Configuring Exchange in a Window Cluster
When you add this resource to a virtual server, Exchange checks Active Directory for an Exchange virtual server with the same network name. If a virtual server with the same name already exists, the rest of the setup proceeds in much the same way as it would during a disaster recovery setup on a standalone Exchange server. Information already in Active Directory is used to complete the configuration of the virtual server.
If there is no matching virtual server in Active Directory, initial setup is completed and configuration information for the virtual server is added to Active Directory. In all cases, the other services required by an Exchange virtual server are automatically created when you create the System Attendant resource.
Advantages of an Exchange Cluster
In a cluster, Exchange virtual server configuration information is shared and replicated among all cluster nodes. If a particular node is destroyed, the impact on Exchange is minimal; another node simply takes ownership of the virtual server and keeps running the app. You can add or remove nodes from the cluster without disturbing other nodes or apps already running in the cluster. If a cluster node is destroyed, it can be replaced simply by adding another Windows server to the cluster and installing Exchange program files on it. You don't have to run a disaster recovery setup or incur any user downtime.
If you need to install service packs, firmware upgrades, or security updates, you can move Exchange to a different node while you complete maintenance at your leisure—even during peak usage hours. Typically, moving Exchange to a different node takes only a minute or so. If your end users run Outlook® 2003 in cached mode, they are unlikely to even notice the failover. In a standalone configuration, maintenance or upgrades that require a server reboot usually mean you have to stay late or work weekends so you can take a prolonged outage during off peak hours.
So why doesn't everyone run Exchange in a cluster? There are two important issues to consider when deciding whether to cluster Exchange. The first is hardware. To be fully supported by Microsoft, you must run Exchange on Windows cluster-certified hardware and configurations. Consult the Windows Catalog for a list of systems that are cluster certified. Cluster certification goes beyond ordinary Windows Hardware Certification List (HCL) approval. Because cluster machines must work together as a single system, cluster hardware is certified as a full working system, not just component by component. This means that you may have to replace or upgrade hardware to run Exchange in a cluster.
The second factor is training. As you've just seen, Exchange behaves differently in a cluster than on a standalone machine. Exchange administrators must be trained to understand not only how Exchange behaves in a cluster, but also how to administer clusters in general. Well-managed standalone Exchange servers in a stable environment can easily achieve 99 percent or better availability. If you are running Exchange on standalone machines with less than 99 percent uptime, switching to clusters is unlikely to improve things. In fact, it may make them worse because you're adding extra complexity to an already unstable environment.
If you are currently running below 99 percent Exchange uptime, improvements other than clustering are likely to be more effective. Very often, these improvements do not even cost anything, but are achieved through more effective management and operational procedures.
If you are considering clustering for the first time, choose Windows Server 2003 instead of Windows 2000 Server. Clustering has been made more simple and robust in Windows Server 2003 and the learning curve for new administrators is greatly reduced. Clustering is installed by default in Windows Server 2003, and initial cluster setup has been streamlined and improved.
Finally, there is one important respect in which running Exchange in a cluster offers no additional protection over running Exchange on a standalone server: protecting databases. If something damages the databases, failing over to a different node will not help you. The damaged database disk is transferred to the new node. It will fail there in just the same way as on the previous node.
The next version of Exchange is expected to provide Cluster Continuous Replication (CCR) of databases, which will remove this single point of failure. You will be able to fail over to a completely independent copy of the databases. There will also be Local Continuous Replication (LCR) available for standalone servers.
Special Server Roles
There are several functions all Exchange servers need that can be delegated to a single server. These roles are easily transferable to different servers in case of a disaster, but it is important to document which server carries each role in case you need to reassign it after a disaster. Many an Exchange administrator has gone through hours of unnecessary troubleshooting after forgetting something like which server is the Routing Group Master—and that this server is down. Here are the main roles to document:
Domain Recipient Update Service (RUS) There must be at least one Exchange server designated for this role per Windows domain with Exchange recipients. This server is responsible for managing and updating mailbox-enabling attributes for all Exchange users in the domain. Without it, new mailboxes can’t be enabled. To change the Domain RUS Server, change the designated server on each Domain RUS object in Exchange System Manager.
Enterprise RUS There is one Exchange server designated for this role per Windows forest. This server is responsible for managing and updating mailbox-enabling attributes for all Exchange system mailboxes in the Active Directory Configuration container. Without it, System Attendant functions on new servers will fail and critical system messages may not be delivered. To change the Enterprise RUS Server, change the designated server on the Enterprise RUS object in Exchange System Manager.
Routing Group Master There is one Exchange server designated for this role per Exchange Routing Group. This server is responsible for coordinating and updating routing information between all servers in the routing group and between routing groups. Without it, mail may get stuck in queues or be returned as undeliverable when routing topologies change. To change the Routing Group Master, open the Routing Group in Exchange System Manager and select a different server by right-clicking its object and setting it as the new Master.
Offline Address Book Generation Server There is one Exchange server designated for this role per Offline Address Book. This server is responsible for creating the Offline Address Book that offline Outlook clients use to address messages. Without it, updates and changes to the address book will not be available to clients. To change The Offline Address Book Generation Server, open the Offline Address Lists container in Exchange System Manager and select a different server in the properties of each list.
Free/Busy Folder There can be multiple Exchange servers that have replicas of this folder, but typically there’s only one. Clients connect to servers with replicas of this folder to upload Calendar information that allows others to check when they are free or busy. If a replica of this folder is unavailable, scheduling efficiency is impacted and this problem is usually very visible to end users and executives. If the original folder is unavailable and you configure a new replica on a different server, clients must all make changes to their calendars or run Outlook for a period of time before their information will be up to date.
Front-End Servers Front-end or Client Access servers (CAS) do not host mailboxes but are used for load balancing and convenience in client access and firewall configuration. Typically, a downed front-end server can be simply replaced with another one that is similarly available to the same clients.
When disaster strikes, there may be messages "in flight" within the system. A disaster may prevent the server from functioning, but it may leave messages intact that have been queued for delivery. Depending on the scope of the disaster, these messages may be stranded somewhere, but still be recoverable. In such cases, many administrators simply instruct users to resend recent messages that may not have gotten through before the disaster. But that is not the only option. You may be able to locate and save the messages for later delivery after the system has been recovered.
Exchange supports several different message transport mechanisms. Each transport protocol queues messages in a different way. The default transport for messages in Exchange 2000 and Exchange 2003 is SMTP. If this is the only message transfer method you use, then saving and replaying messages for later delivery is simple. If you copy a properly formatted SMTP message into the Pickup directory for the SMTP service, SMTP will grab the message and try to deliver it. This works even if you copy a message that was stranded on one server to the Pickup folder on another SMTP server. See the Microsoft Knowledge Base article "Supported Methods to Replay Outgoing SMTP Messages in Exchange 2003 or in Exchange 2000" for more details.
Third-party connectors and other message transfer protocols may be more difficult to preserve and replay for delivery. If in-transit messages are of concern, refer to Microsoft or third-party documentation for descriptions of how messages are queued and transmitted in each connector.
Recovering Exchange Databases
The information presented so far should give you a good foundation for understanding the various elements of an Exchange system you need to protect and get running again after a disaster. It may seem like there are a lot of moving parts to keep track of, but if you know how to install Windows and Exchange and are reasonably familiar with the Exchange System Manager program, getting the system back up and running depends more on having spare parts than anything else. The actual tasks are relatively simple.
After an Exchange server is back up and configured correctly, you come to what is often the most difficult—or at least time-consuming—part of disaster recovery: restoring user data. More and more information is being kept and sent through e-mail. Exchange databases are frequently 50GB or more in size and there are often 20 of these databases on a single server. Without specialized (and still very expensive) disk hardware, backing up and restoring that amount of data just takes time.
Exchange includes many features to help you back up, restore, manage, and repair Exchange data efficiently and flexibly. In most cases, recovering Exchange data after a disaster is simple: restore your last backup and let automatic transaction log replay roll the database forward. Things get complicated, though, if you discover that your backups haven't run for weeks—and you forgot to check whether they were working. Loss of transaction logs also makes recovery more difficult, but you're not out of luck even then. Exchange has a robust suite of tools and utilities for repairing databases and salvaging and merging mailboxes between databases. Still, your best recovery strategy is to do the things you should to avoid needing to use such tools.
Many companies don't think of Exchange as a Tier 1 or database application. They don't provide the same quality of hardware or protection they do for other mission critical applications. Sometimes it takes a disaster before companies realize that e-mail is mission critical and they should be more serious about backing it up and protecting it. Ask yourself (and perhaps your boss) these questions: How would people react if the mail system was shut off for a day? What would the impact on your business be if all the Exchange databases were suddenly deleted? If you don't use a reliable method to back up your Exchange data, you're just one disaster away from finding out the answers to those questions.
The first line of defense in recovering an Exchange database is its transaction log files. Not only are the transaction log files a good crash-recovery mechanism, they also play an important role in recovery using backups. Because every change to the database is recorded in a transaction log file, you can use the logs to bring a restored copy of the database completely up to date. Essentially, the transaction logs allow you to restore from backup and then fast-forward through everything that happened since the backup. If you have all the transaction logs generated after the backup, the result of restoration will be a database that is logically identical to the one you just lost, right up to the moment that disaster struck.
A critical best practice is to store the transaction log files away from the database files on a separate physical disk or array. This decreases the chance that both the database files and the log files will be damaged or lost in the same failure. If you keep the database and log files on the same drive, you are storing all your eggs in one basket.
There is no need to take a database down to back it up. Microsoft provides two application programming interfaces (APIs) for backing up Exchange databases while they are running and available to clients. Depending on hardware capabilities, the Windows Volume Shadow Copy Service (VSS) API in Windows Server 2003 is capable of making a backup copy of a database of any size in a backup window of only a few seconds. This API is implemented by several third-party hardware providers and backup applications. The Windows Backup streaming backup API has been available since the first release of Exchange. It is also implemented by many third-party applications. To make Exchange streaming backups available in Windows Backup, all you have to do is install Exchange administration tools on the backup server.
Database Recovery Strategies
If a disaster leaves Exchange database and transaction log files intact on disk, database recovery will be automatic. Exchange will replay necessary transaction logs, bring the databases to a logically consistent state, and then allow you to start them. This works even if you have to move disks to different hardware.
The success of transaction logging as a crash recovery mechanism depends on the database files and the transaction log files surviving a crash with no damage. If a crash causes file corruption or loss of an entire disk, the automatic recovery mechanism can't work. What you should do to recover after that depends on whether you have lost a database drive or a transaction log drive.
Strangely enough, if you have to choose between losing the Exchange transaction logs and losing the database files, lose the databases—if you have a good backup. After the loss of database files, you can restore from backup and roll the database forward with the transaction log files for a zero-loss restoration. But losing transaction log files that haven't been backed up means you lose all changes recorded in them.
In short, if no transaction logs are missing after a disaster and you have a good backup, full recovery is simple. If you are missing transaction logs, you have four recovery choices. First, if the database files are logically inconsistent but otherwise intact, the database can be examined with the Eseutil and Isinteg utilities to detect and correct any transactional inconsistencies. Because all data in the files must be carefully examined, this repair can take several hours for a large database. Repair is typically very successful in this circumstance and may result in zero or very minimal data loss.
However, many Exchange administrators follow a policy of not leaving a repaired database indefinitely in production. After repair, they move all mailboxes or folders to a different database.
Second, you can restore from backup. If none of the transaction log files generated since the backup are available, you'll lose all changes to the database since the backup. Note that it is possible to do partial replay of transaction log files. Suppose only the last several files were lost in the crash. You can still replay changes into the database up to the point of the first damaged or lost transaction log file.
The third strategy is to restore, repair, and merge. This is a hybrid of the first two strategies with an additional twist—merging the contents of two databases together. You restore a previous database and allow clients access to it. They don't have immediate access to recent data, but do have access to everything from before the last backup. At the same time, you repair the inconsistent copy of the database in a Recovery Storage Group (RSG) or on a separate lab server. Prior to Exchange 2003, it was necessary to build a separate Exchange server to perform parallel repairs. The Recovery Storage Group feature removes this requirement.
There are a number of ways you can merge the contents of two databases or mailboxes together. The most basic method is to open the two mailboxes in Outlook, then drag items from one mailbox and drop them into the other. This is impractical, though, if you need to recover a large number of mailboxes. You can do bulk mailbox merges using either the Exchange Mailbox Merge Wizard (ExMerge.exe) or the Recover Mailbox Data wizard, which was first introduced in Exchange 2003 Service Pack 1 (SP1).
The ExMerge utility, shown in Figure 4, works for any version of Exchange. ExMerge is very powerful, with multiple filtering and selection options. It is useful in many scenarios other than disaster recovery and has been an Exchange administrator favorite for years. You can download ExMerge from go.microsoft.com/fwlink/?linkid=68533.
Figure 4 Using ExMerge to Recover Mailbox Data
The Recover Mailbox Data wizard lets you do bulk mailbox merges right from within Exchange System Manager (see Figure 5). It does not have the advanced filtering and selection capabilities of ExMerge, but administrators typically don't use those features in a disaster recovery situation. There is no setup needed to use this tool, and merging mailboxes is as simple as selecting multiple mailboxes running in an RSG and right-clicking to merge their contents back to the production databases.
Figure 5 Recover Mailbox Data Wizard
There is no downtime required with either tool. Both work while databases are online and both have excellent duplicate-item detection and suppression. The client experience is that lost data suddenly reappears as they are working.
The fourth strategy is to provide send/receive functionality immediately and restore the data later (also known as dialtone recovery). This strategy is essentially the same as the previous hybrid strategy, except that instead of restoring a backup for clients to use, you start over with a new database as soon as possible after a disaster. Clients see an empty mailbox, but they have send and receive capability more quickly than if they had to wait for restoration or repair to complete.
From the administrator point of view, this strategy is the most complex. At some point, it may be necessary to merge the contents of three separate copies of the database: the "dialtone" database, a restored backup, and a repaired copy of the database. This is the best strategy for use when send and receive capability is mission critical, such as for users who handle very time-sensitive customer orders. (Though even for these users, it may be more efficient to set up temporary mailboxes and reroute mail to them.) You might also use this strategy in situations where it will take a very long time to restore or repair previous data, or in single storage group outages. There can be only one RSG on a server, and it can be used for only one production storage group at a time. This means that repair and merge operations will have to be done serially after a full server outage or that separate recovery servers will be needed to handle multiple storage group recoveries in parallel.
The Notorious Error -1018
Experienced Exchange administrators know and dread error -1018, an example of which is shown in Figure 6. In most cases, this error will not cause an immediate outage or other obvious problems, but it is still critical because it warns of the possibility of imminent hardware failure. Error -1018 means that a page in an Exchange database has been damaged at the file system level. An Exchange database is divided into pages, and each page has a checksum on it that validates the integrity of the page. When Exchange reads a page and checksum verification fails, a -1018 error is the result. Error -1018 is recorded in the server's Application log each time Exchange tries to touch the bad page.
Figure 6 The Dreaded Error - 1018
A single appearance of the error may happen on perfectly good hardware—even the best hardware will have a glitch now and then. But if you see it more than once on the same server, it is likely that the hardware is defective or failing.
A great deal of the information in a database is client data that is seldom accessed (such as items in the Deleted or Sent folders). Because checksum errors are only detected when a page needs to be read, a -1018 problem could lurk undetected in a database for a very long time on a page that is seldom retrieved. Online backup is therefore used to ensure timely detection of -1018 problems. During backup, the entire database must be read or copied and so this is a good time to check the whole database.
Exchange will not let you complete an online backup of a database with a bad page in it. This ensures that your last backup is good and allows you to expunge a -1018 error by restoring and rolling forward. You can also repair a database to remove a -1018 page, but this will result some data loss—at least the loss of the data that was on the bad page.
Because you can't do a new online backup of a database with a -1018 error in it, there is some urgency in resolving such errors. To back up the database before correcting the problem, you'll have to take it offline and back it up as a set of raw files. No transaction log files will be removed from disk until you do another successful backup. This ensures you will be able to roll forward all the way from the last backup, but it also means that if you neglect dealing with the problem you will eventually run out of space on the transaction log drive.
Moving all mailboxes from a database with a -1018 page error is also an effective strategy and causes minimal downtime. Moving mailboxes will result in the loss of whatever was on the bad page. Restoring and rolling forward is the only strategy that works to recover what was on the page before it was damaged.
As of Exchange Server 2003 SP1, there is a new defense against -1018 errors. A new Error Correcting Code (ECC) checksum has been added to each database page. The previous page checksum reliably detects bad pages but can't correct them. The new ECC checksum lets Exchange fix the page when the problem is a single-bit error (also called a "bit flip"). Bit flips are a common hardware glitch, and historical analysis of -1018 database errors revealed that approximately 40 percent are caused by bit flips.
The ECC checksum is a good thing for administrators because it means that 40 percent of the time it's not necessary to restore, repair, or abandon a damaged database. But there is a new danger too: it is now easier to ignore a -1018 error. Just because Exchange can fix up the damage doesn't mean it is less serious. The implications of a -1018 error remain the same, whether or not the error can be fixed. A -1018 indicates the possibility of progressive hardware failure. Ignoring these errors is just as dangerous as before, even if Exchange helps you recover more easily.
Coming Up: Easier Recovery
Many of the data recovery features that are found in Exchange Server 2003 will be carried over and incrementally improved in the next major revision—Exchange Server 2007, and exciting new features are being developed as well. The most significant new features are Cluster Continuous Replication (CCR) and Local Continuous Replication (LCR). CCR and LCR provide application-level replication of mailbox databases in near real time.
In Exchange Server 2003, it's possible to replicate Exchange mailbox databases using third-party solutions. Many such solutions work very well but they can be expensive and they are all resource and bandwidth intensive. This isn't the fault of the solution vendors—there is a lot going on in an Exchange database, with hundreds or thousands of users constantly updating their mailboxes and thousands of new messages coming in and going out. Vendors who implement file or disk-based replication for Exchange have to replicate every single change, in correct order, made to both the database and transaction log files. This imposes severe limits on how far away the replicated databases can be located and it means that you need a very high-speed network between the locations. Replicating an Exchange mailbox database can reduce the number of mailboxes you can put on it by half or more because you often must scale down to allow the replication engine to keep up. There is currently no way for disk- or file-based replication solutions to be efficient about which changes are replicated because Exchange Server 2003 has no application-aware replication mechanism for mailbox databases. (Exchange has always had application-aware replication for public folders.)
In Exchange Server 2007, transaction logging will again be taken advantage of to provide additional functionality, this time for replication. You will be able to copy a full mailbox database once to a different disk or remote location and then update it by subsequently copying only the transaction logs. The transaction logs will be applied to the replica database soon after they've been copied, and you'll thus be able to fail over to the copy with minimal startup delay. This kind of incremental log file replay can't be done in Exchange Server 2003. A number of very difficult technical problems had to be solved to make it possible in Exchange Server 2007.
CCR will be fully integrated with Windows clustering. You'll have two copies of the mailbox database that are closely synchronized. During a controlled failover, you'll be able to completely synchronize the DataSets and fail back and forth between each as you wish. In case of a disaster that requires an uncontrolled failover, data loss will be minimal to zero (administrators can define individual tolerances for data loss and failover triggers). CCR will also be able to query SMTP bridgehead servers for messages delivered during the data loss window and can ask to have them redelivered. This message redelivery capability will not be available for LCR.
LCR will be available to every Exchange administrator, from Small Business Server installations with half a dozen users to the largest standalone Exchange server with thousands of mailboxes. LCR failovers must be manually initiated and you must set up replication again after a failover. Still, LCR can get you back in business in seconds or minutes, even after catastrophic loss of both databases and transaction log files on your primary disks.
There are several advantages to using LCR or CCR compared to file and disk-based replication methods. First, there is a dramatic reduction in the amount of data that must be transmitted across the network. Second, replication is simplified because rapid changes to the open database files no longer need to be replicated, and it is no longer necessary for transaction log and database changes to be perfectly synchronized and write-ordered as they are replicated. Third, LCR and CCR are more tolerant of network glitches and interruptions. Resynchronization even after prolonged interruptions in replication is automatic.
CCR and LCR replicate transaction log files that have been filled and closed. They will not replicate the current transaction log file. Thus, the data in that last log file is at risk in case of a disaster. To reduce the amount of data at risk, Exchange Server 2007 log files will be reduced from the current 5MB to 1MB.
Neither LCR nor CCR replace backups. You still need backups for point-in-time snapshots of data and for legal and archival purposes—and as a fallback strategy in case a large-scope disaster destroys the LCR or CCR database copies. But LCR and CCR make it much less likely that you'll ever have to restore a backup to recover from a disaster. They enable dramatically quicker data recovery in the most common disaster scenarios and will let you select backup cycles and strategies that focus on archival features instead of being driven by the need for fast restoration.
A nice side-effect of using LCR or CCR is that you can do your backups from the replica instead of from the primary copy of the database. While you have always been able to back up Exchange while it is online and serving users, Exchange administrators know that performing backups during peak hours has significant impact on performance. By backing up from replica, you remove nearly all of the I/O impact. As another bonus, you no longer have to balance your online database maintenance window with your backup window. Backup from replica will require using the VSS API, not the legacy streaming backup API. You won't be able to do it with NTBackup, but will need a third-party VSS solution.
This article has been a whirlwind tour through multiple strategies for making your Exchange deployment more reliable and available. The goal has been to give you an overview of how the system works and what to think about before things go horribly wrong.
Hopefully, you have never experienced an Exchange-related disaster. Things have been humming along for years, and so disaster planning for Exchange has always been a low priority. That's fine: doing good backups and calling Microsoft when things go wrong is a respectable default strategy for many organizations.
But if e-mail is mission critical, you have work to do. Because Exchange offers such a wide variety of recovery mechanisms, you need to understand in advance how to choose the right one and how to make all the parts work together smoothly.
Database recovery is the most critical "last mile" of disaster recovery. If you have preserved your databases, you can lose everything else and still recover completely. An entire Exchange organization can be rebuilt from scratch, if necessary. But end users will be after you with pitchforks if you lose their irreplaceable mailbox contents. So, if you remember only one thing, it should be: back up and protect your Exchange databases. If you do that right, you can recover from almost any other kind of disaster.
Michael Lee has worked on Exchange Server at Microsoft for eight years, both in technical support and, for the last few years, in the Exchange product group. He is currently a Program Manager with the Customer Experience Team.
© 2008 Microsoft Corporation and CMP Media, LLC. All rights reserved; reproduction in part or in whole without permission is prohibited.