Recovering a Data Center with Database Backups
Despite the best-laid plans, you cannot prevent every disaster that might befall a data center. You must therefore ensure that you can recover the data center from backups after a catastrophic failure, such as a user error or a hardware-induced database corruption. This chapter discusses using database backups to ensure a complete recovery of the data center in the event of a disaster. It focuses on techniques to optimize the Microsoft SQL Server 2000 recovery process and reduce the total recovery time to meet data center availability goals. A basic understanding of SQL Server backup and recovery procedures is assumed. For more information about SQL Server backup and recovery procedures, see "Backing Up and Restoring Databases" on the MSDN Web site at http://msdn2.microsoft.com/library/aa196685.aspx.
This chapter also discusses third-party solutions that extend the SQL Server backup and recovery technology to provide additional performance and functionality to meet the availability requirements of large databases.
Organizations that experience total data loss and do not have an adequate disaster-recovery strategy experience severe disruption to their business. An organizations ability to recover quickly from an outage or a disaster — from a component failure to the complete destruction of a site — directly contributes to its survival. Although other strategies should be implemented to minimize the need for recovery from backup, recovery from backup is the only solution to some types of failures and errors.
On This Page
Solving High-Availability Barriers Using Backups
Recovery from database backups resolves data-center availability problems caused by storage subsystem failures and application or user errors. Some storage subsystem failures can also be resolved by using component redundancy, as discussed in Chapter 4, "Preventing Downtime by Using Redundant Components." Application or user errors can frequently be resolved only by using backups. In addition, catastrophic disasters, such as fires or floods, may require restoring from backup. Although server redundancy strategies can also be employed to minimize the need for recovery from backup when these types of disasters occur, do not overlook the importance of a good backup and recovery plan. When all other mitigation efforts fail, recovery from backup is the only solution.
Resolving data-center availability barriers caused by storage subsystem failures and application or user error are discussed later in this chapter.
Storage Subsystem Failure
A storage subsystem failure can result in data being unreadable or damaged. Using backups is the simplest and least expensive way to recover from unreadable or damaged data. The most common database problems caused by storage subsystem failures are inability to access data because of a complete hard-disk failure and torn pages caused by controller failure or a partial hard-disk failure.
Hard-disk failure — A hard-disk failure is generally reported to the operating system as read or write errors. In the absence of redundant drives or servers, media failure causes the data center to become partially or completely unavailable. Recovery from this type of error, in the absence of redundancy, requires you to replace the failed drive. If the failed drive contains a data file, you must recover it. If the failed drive contains a transaction log file, you may have lost the most recent transactions. If these transactions are lost, users have to manually resubmit them.
Torn Pages — A torn page is an incomplete write of a page caused by the operating system's inability to write to all of the disk sectors comprising a page. A page with incomplete data is a corrupted page. Torn pages are caused by disk-drive or disk-controller failures. A page is 8 KB and must always be read or written as a unit by the operating system on behalf of SQL Server. If an incomplete write occurs, SQL Server 2000 detects this as a torn page.
If you experience torn pages, you must replace the device causing the error and repair the damage. You can repair the damage directly, recover data from backup, or fail over to an alternative server with a copy of the database that does not have the error. If you choose to repair the damaged data directly, run DBCC CHECKDB using the physical_only option to determine the extent of the damage. If the damage is to only one or more indexes, you can rebuild the indexes rather than restoring them. This reduces the length of the unavailability. If the damage affects data pages, you need to recover from backup. When restoring from backup, identify the data files containing errors and recover only those data files containing errors. It is faster to recover one or more data files in a database than to recover the entire database.
Note: When recovering from torn pages, back up the tail of the transaction log before you begin recovery. This ensures that you can recover all committed transactions.
Although you can use redundant components and redundant servers to reduce or eliminate downtime caused by storage-system failures, sometimes database backups are the only way to recover from this barrier to availability.
Application or User Errors
Application and user errors can result in dropped tables, deleted rows, and incorrect updates. The error type, extent of the damage, and amount of work done before the error was detected determine the difficulty and type of repair. Repairing application or user errors manually and merging corrected data into the production database is time consuming and prone to errors.
When serious errors occur or when damage is extensive, restoring the database to a point in time before the error occurred is frequently the only solution. The SQL Server 2000 recovery tools perform recovery to any specific point in time if the database is using the full recovery model. If the database is using the bulk-logged recovery model, you can recover only to the end of a transaction log backup. If you are using the bulk-logged recovery model, performing frequent transaction log backups enables you to recover to a point that is close to the time the error occurred.
The primary disadvantage of recovering a database to an earlier point in time is that all work done since the error occurred is overwritten and therefore must be redone (or accepted as lost). The second big disadvantage is that the database will be unavailable for the duration of the recovery. When recovering to an earlier point, you must recover the entire database to that point. With a large database, recovering the entire database to an earlier point can take a long time unless you are using one of the third-party backup solutions discussed later in this chapter.
To avoid the problems associated with recovering the entire database to an earlier point, you can recover the entire database or just the affected data files to an alternative server, specifying an earlier point for the recovery. This leaves the production database available and preserves the work performed since the error occurred. In addition, restoring just the file or filegroups affected by the error is much faster than restoring an entire database. On the alternative server, you can then manually extract the deleted or damaged data or tables from the new database and merge or insert it into the production database. Merging and inserting the damaged data into the production database without causing further damage to the production database is time consuming and risky. Often this type of repair is undertaken only in cases in which it is imperative that no committed work be lost or in which the time it takes to recover to an earlier point is too long.
Performing frequent full and differential database and transaction log backups helps reduce the time required to recover from application or user error. Redundant components and redundant servers do little to protect against this type of barrier to high availability. Adequate application testing and user training is the best defense against application and user errors.
Microsoft works closely with the following vendors, which offer tools that facilitate recovery from application or user errors:
These third-party tools include log analyzers that provide detailed information on the history of a database. Such tools can be used to locate the problem and identify a point in time for recovery. In addition, these tools can generate SQL queries that can undo an error. Other third-party tools can extract information directly from backups, reducing the amount of data that must be restored and merged with the current database to repair the problem.
Developing a Backup-and-Recovery Solution
Understanding your availability goals and the types of problems resolved by database restorations helps you develop a backup-and-recovery solution for your data center. After you have determined the acceptable length of downtime from each of the barriers that you intend to resolve by using database backups, use the following steps to develop a backup-and-recovery solution:
Determine whether you can meet these availability requirement by using the SQL Server backup-and-recovery tools. Measure the time required to recover from backup by measuring the total time to completely or partially recover a database from backup. If you cannot reduce the total recovery time to meet the availability requirement using the SQL Server 2000 recovery tools, you must either implement one of the third-party solutions discussed later in this chapter or revise the availability requirement.
After you have determined the technology solution, develop a database backup regimen consisting of full, differential, and transaction log backups that enables you to recover the database in the allotted recovery time.
Develop recovery procedures for each type of barrier. Include these procedures in the run book and in the associated scripts in your script library.
Your backup-and-recovery solution must be able to recover a damaged database in the data center in the allotted recovery time. To determine whether the SQL Server 2000 recovery tools can meet the availability requirement, use an identical test environment to measure the time required to recover the data center from database backups. Test combinations of full, differential, and transaction log backups. The longer the time between each full database backup, the longer the total recovery time if a database fails just before the next full database backup is scheduled to occur. Determine the best use of differential database backups between full database backups to reduce the total recovery time.
Measuring Total Recovery Time
Total recovery time is the time required to completely restore a database from backup, not just the time required to restore a single backup file. There are three methods you can use to obtain the total throughput for the recovery process:
Use the message displayed by the SQL Server recovery process. The recovery process displays a message indicating the total number of pages transferred and the time in seconds for the transfer. Knowing that a page is 8 KB, you can calculate an average data rate. For small databases, the overhead involved in starting the process can account for a significant amount of the total time. This overhead can be ignored as insignificant for large databases, however.
Use the length of the data-transfer phase by using messages recorded in the SQL Server error log. Using the times reported by these messages and the number of pages transferred, you can compute the average data rate excluding overhead.
Use the backup throughput performance counter in System Monitor. This counter gives the instantaneous throughput while the recovery process is running.
Use this total recovery time calculation to optimize your backup schedules to achieve the best possible total recovery time. If this is still not sufficient to meet your availability requirements, use a third-party solution or revise your availability requirement.
Improving Total Recovery Time
You can take a number of steps to improve the recovery time by using the SQL Server 2000 recovery tools. These steps include improving the write performance of the data disk, striping the backup devices, backing up to disk rather than to tape, and adjusting the frequency and type of database backups.
During recovery from backup, SQL Server writes data from the backup device to the disks containing the database data files. To improve the performance of the data disks, follow these guidelines:
Use multiple data files — To increase recovery write performance, create multiple data files for the database and spread these data files evenly across multiple matched disks. SQL Server automatically writes to multiple data files in parallel for increased performance. It is approximately three times faster to recover to three data files spread across three disks than to recovery to a single large data file on one disk.
Use a disk array — To improve recovery write performance even further, place the data files on a disk array configured for either striped mirroring or striping with parity. Striped mirroring provides the best performance for database recovery. Striping with parity provides better performance than no array, but it has significantly slower write performance than striped mirroring. For more information on disk arrays, see Chapter 4, "Preventing Downtime by Using Redundant Components."
Use one of the above methods to increase the performance of the data disks and to reduce the total recovery time.
During recovery from backup, SQL Server reads data from the backup device and streams the data to SQL Server, which then places it on the data disks. To improve the performance of the backup device, follow these guidelines:
Use a disk backup device—To increase recovery read performance, recover from a disk backup device rather than a tape backup device. Recovering from backup files on a disk array is faster than recovering from a backup file on a single disk. Keeping the most recent database backups on a disk backup device can drastically reduce the total recovery time. For fault tolerance, be sure backups to disk are also archived to tape.
Use multiple backup devices — Another way to improve recovery read performance is to stripe the backup file across multiple backup devices. SQL Server automatically reads from multiple backup devices in parallel for increased performance.
Use separate SCSI controllers — To improve read throughput, do not share SCSI controllers between data disks and backup tape devices. This degrades recovery performance and can stop the recovery process entirely. To maximize read throughput, use a dedicated SCSI controller for each tape device.
Enable hardware compression — To maximize read throughput, enable hardware compression on tape drives.
Use above methods to increase the performance of the backup device and to reduce the total recovery time.
Adjusting the Frequency and Type of Database Backup
To recover a database to the time of failure or to a specific point in time, you must restore the most recent full database backup, restore the most recent differential database backup, restore all transaction log file backups that are more recent than the last full or differential database backup, and manually initiate recovery. The time required to fully recover a database depends on the number and size of these backup files. To decrease the number and total size of these files and improve the total recovery time, follow these guidelines:
Perform more frequent full database backups — Perform a full database backup as often as is practical to reduce the amount of data that must be recovered in addition to the most recent full-database backup. The practical frequency of a full database backup depends on the time required to perform a full backup and the impact of performing it on the data center. The longer the time between each full database backup, the more data has to be recovered from differential and transaction log backup files if the failure occurs just before the next full database backup. The more data that has to be recovered from differential and transaction log backup files, the longer the total recovery time.
Perform differential database backups — Perform one or more differential database backups between each full database backup to reduce the number of transaction log backups that must be recovered to restore the database to the selected point in time. You need to recover only the transaction log backups that are more recent than the differential backup that is restored. The time required to perform a differential database backup is proportional to the number of extents modified since the most recent full database backup rather than the size of the actual database. As a result, performing a differential database backup on a large database can be much faster than performing a full database backup. The less frequently you perform full database backups, the more useful it is to perform frequent differential database backups.
Perform frequent transaction log backups — Perform frequent transaction log backups to make it easier to recover to a specific point in time. If you use the bulk-logged recovery model for the database, you can recover only to the end of a transaction log file. More frequent transaction log backups give you more points in time to choose from. If you use the full recovery model for the database, you can recover to any point in time, but you can recover only to a read-only mode to the end of a transaction log file. Recovering to a read-only mode helps you determine when an error occurred.
Perform file or filegroup backups — To assist in recovering rapidly to an alternative server, to resolve an application or a user error, or to recover to the time of a failure resulting from a disk failure, perform file or filegroup backups in addition to full database backups. File or filegroup backups enable you to more quickly recover a single data file or a filegroup to the production system or to an alternative system. Although you can recover a file or filegroup from within a full database backup, this takes longer than recovering from a file or filegroup backup because the recovery process must find the appropriate backup from within the full backup file. Performing file or filegroup backups provides the most performance benefit when used with medium and large databases. With large databases, also use differential file or filegroup backups between each file or filegroup backup. For more information about the benefits of differential backups, see "Perform Differential Database Backups" earlier in this chapter.
Use the above methods to decrease the number and total size of the backup files required to recover a database and reduce the total recovery time.
Using a Third-Party Backup-and-Recovery Solution
If you need additional performance or functionality beyond that provided by SQL Server 2000, consider implementing a third-party backup-and-recovery solution. Third-party solutions include support for extremely quick backups (snapshots) and platform backups, heterogeneous operating system backups, tape silos, tape RAID, and remote tape backups.
Using Snapshots to Improve the Recovery Time
In addition to the conventional backup technology used by the SQL Server 2000 recovery tools, SQL Server 2000 provides a high-performance backup interface called the Virtual Device Interface for Backup (VDI), for performing high-speed backups (snapshots) using third-party hardware and software. VDI is based on a shared memory design, which allows data to be transferred extremely rapidly with little overhead between SQL Server and a third-party snapshot solution. Third-party vendors offer two types of snapshot technologies, split mirror and copy-on-write, which accept the VDI data stream to back up a database at a very high speed. During restore, SQL Server receives the VDI stream from the snapshot to recover the database at a very high speed.
You can combine snapshot backups with differential, file differential, and transaction log backups to recover a database to the point of failure or to a specific point in time. Snapshots and conventional backups are recorded in msdb, so you can use SQL Server Enterprise Manager to identify the backups required for a particular restore sequence.
There are many technology variations, and the technologies are evolving. Consult with your storage and backup vendors to determine the exact features and benefits of their solutions. All vendors offer a variation of either split mirror snapshots or copy-on-write snapshots. The general features and functionality of these two types of snapshots are discussed below.
Split-mirror snapshots are based on a striped-mirror storage solution. A split-mirror snapshot is an exact copy of a database at the time the mirror is split. This split-mirror snapshot is referred to differently by different vendors. The most common names are clone or business continuance volume (BCV). The backup of the database is a split-mirror snapshot plus a small amount of descriptive data created by SQL Server describing the snapshot.
You can use split-mirror snapshots to recover from application and user errors. You can rapidly recover the production database to a specific point in time or recover the snapshot to an alternative server to resolve application or user errors. You generally do not use split-mirror snapshots to recover from storage subsystem failures because mirroring itself, along with duplexed controllers, generally protects the database from hard-disk and controller failures. You can also use split-mirror snapshots to initialize a standby server for log shipping, a subscriber for transactional replication, or a test server for the QA environment.
Split mirror snapshots work as follows:
Splitting the mirror — You perform a backup by splitting one drive from the mirrored array of drives. The split is a very fast operation. SQL Server suspends writes briefly to prevent torn pages; this is the only impact on the availability of the database. Typical split times are measured in seconds.
Note: Use a three-way mirror so that the production volumes remain fault-tolerant after the mirror is split.
Remirroring — In preparation for another backup, the clone volume must be rejoined to the mirror of the production volume. During this remirroring operation, the vendor software reconciles the clone volume with the data on the mirror. This reconciliation process occurs in the background, and the database remains fully available for updates. Some vendors copy only changed data from the production volume to the clone volume, while other vendors copy the entire contents of the production volume to the clone volume.
Recovering — During recovery, the clone volume is the master and the SQL Server reconciles the production volume with the clone volume. As part of the recovery, the vendor software supplies SQL Server with the description of the backup being restored. With many vendors, the data is presented to SQL Server immediately, and the reconciliation occurs in the background. In this case, the restoration occurs in seconds, and the database can be recovered and made available immediately. Users can actually use the data on the clone volume while the reconciliation occurs. To recover to a time more recent than the time of the snapshot, the database can be restored without recovery, and then you can apply differential, file differential, and/or transaction log backups before you initiate recovery.
Split-mirror snapshots eliminate the single most time-consuming phase of the recovery process — the restoration of a full database backup. This is the single most important benefit of split-mirror snapshots. Recovery to a different server can also be accomplished very quickly, providing a fast way to create a copy of the database for maintenance and reporting tasks.
The major disadvantage of this technology is its cost. Because split-mirror snapshots are based on mirroring, the number of separate physical disks required increases the cost of this solution. Other disadvantages include the following:
Volume-based backup — Split mirrors work at the volume level. You must back up or restore an entire volume at one time. This means that you cannot store more than one database on a volume, or you will not be able to back up or restore the databases independently. Restoring one database corrupts the mirror of the other database.
Time to remirror limits frequency of backup — Remirroring can take a significant amount of time for large databases. Although the database is available to users while the reconciliation occurs in the background, you cannot perform another backup until the reconciliation is complete. In addition, the length of time between backups increases the amount of data that must be rolled forward using transaction log backups after a failure. Transaction log backups must be applied to a snapshot to recover the database to a time more recent than the time of the snapshot.
Performance during remirror and recovery — There is some performance degradation during remirroring and recovery. This degradation is typically small.
Maintaining a backup during remirror — Remirroring destroys the snapshot, eliminating your backup. To solve this problem, you can:
Add a fourth mirror and create two clones. In this case, to preserve the most recent backup, always remirror the oldest clone when preparing to make another backup.
Copy the snapshot to tape or to other media. If you choose this method, consider the additional restore time required to get the copy of the clone back on disk if it is needed in an emergency.
No reusability of a backup after restoration — Recovering a split-mirror backup converts that backup into a database. The backup no longer exists. Consider multiple clones or copying a clone to tape.
Microsoft works closely with the following storage vendors that offer split-mirror backup technology:
Hitachi Data Systems
These storage vendors work with a number of third-party backup software products, such as CommVault Systems and Veritas, to provide support for split-mirror snapshots.
A copy-on-write snapshot is a copy of all the original disk blocks in a database that changed since the snapshot was created. Copy-on-write snapshots are implemented entirely in software and are available on all classes of storage. The copy-on-write snapshot and the production database share the majority of their disk blocks. They differ only in those disk blocks that have been modified since the snapshot was created. The older the snapshot is, the more likely it is that more blocks will change. An older snapshot also requires a larger space to maintain the snapshot. Ultimately, the snapshot can occupy as much storage as the original if a new snapshot is not created. Some vendors support multiple snapshots of the same volumes at different times.
You can use copy-on-write snapshots to recover from application and user errors. While copy-on-write snapshots are not as fast as split mirror snapshots, recovery using copy-on-write snapshots is significantly faster than recovery using conventional backups if the snapshot is recovered in place. Copy-on-write snapshots provide no protection against storage subsystem failures. To protect the data center against media failure, use a disk array (mirror, striping with parity, or striped mirror) or back up the copy-on-write snapshot to a disk or to a tape backup device.
Copy-on-write snapshots work as follows:
Creating the snapshot — To create the snapshot, the snapshot software sets up an empty catalog of changed blocks, which takes little time. Initially, the snapshot shares all disk blocks with the production database.
Maintaining the snapshot — When a write to the production data occurs, the change catalog is checked to determine whether the disk block that SQL Server is writing has changed since the snapshot was created. This read of the change catalog on every write slightly degrades write performance. The action taken by SQL Server and the performance impact depends on whether the disk block was previously changed.
If the disk block SQL Server is writing to has not been changed since the snapshot was created, the original disk block is copied from the production disk and saved to the snapshot. SQL Server then writes the change to the production disk. Copying this original block from the production disk to the snapshot seriously degrades write performance.
If the disk block that SQL Server is writing to has changed, SQL Server simply writes the change to the production disk because the original block has already been written to the snapshot.
Backing up the snapshot — When copying the snapshot to a backup device, the backup process reads the change catalog to determine where to find each block of data in the snapshot. This is a relatively slow process. Data blocks that have not changed are read from the production disk, and data blocks that have changed are read from the snapshot. When you back up the snapshot, ensure that a description of the snapshot is included in the backup. Whether this occurs automatically depends on the vendor's implementation of this technology.
Recovering — Recovering the production database directly from the snapshot is the fastest restoration method. The original saved blocks are copied back to the production database, restoring it to its state at the time of the snapshot. The time to accomplish the restoration is proportional to the number of blocks that were modified since the snapshot was created. To recover to a time more recent than the time of the snapshot, the database can be restored without recovery, and then you can apply differential, file differential, and/or transaction log backups before you initiate recovery. Restoring the snapshot from a backup of the snapshot to the same or to an alternative server is identical to restoring a database using conventional SQL Server 2000 recovery tools.
The major advantage of copy-on-write snapshots over split-mirror snapshots is cost. Copy-on-write snapshots are less expensive because they require no special hardware and relatively little hard-disk space. The majority of the space in a copy-on-write snapshot is shared with the production database. In addition, with some vendors, you can create and maintain multiple backups simultaneously.
The major disadvantage of copy-on-write snapshots compared to split-mirror snapshots is the time required for recovery. Restoring a copy-on-write snapshot takes longer than a split-mirror snapshot, although it is faster than a conventional backup. Another disadvantage of copy-on-write snapshots is the performance degradation on disk writes. Copy-on write snapshots also provide no protection against media failure.
Using Third-Party Solutions to Increase Functionality
Third-party backup solutions also offer additional backup functionality, including support for the following:
Full platform backup — Although SQL Server backup supports only SQL Server databases, third-party solutions provide the ability to back up the entire operating system when the SQL Server databases are backed up.
Tape silos — Third-party backup solutions support media changers, which automatically change tapes as needed.
Backup device sharing — Third-party backup solutions support remote tape devices and shared central libraries. Some third-party solutions support backup streams from many nodes to be multiplexed onto a single backup device.
Rich media management and backup catalog — Third-party backup solutions offer sophisticated media management and tracking, including media life cycle management.
Tape RAID—Some third-party party backup solutions support tape RAID for additional reliability.
The additional functionality provided by third-party solutions increases availability of the entire platform and the availability of the data center.
For more information, see:
"Backup and Restore Solution" on the Microsoft Web site at http://www.microsoft.com/technet/prodtechnol/windows2000serv/maintain/backuprest/default.mspx.
"Windows Server System Reference Architecture" on the Microsoft Web site at http://www.microsoft.com/technet/itsolutions/wssra/default.mspx.