Corruption Recovery Tips, Database Shrinkage Advice and More
Paul S. Randal
Q uestion: My backup strategy is a daily full backup at 1 a.m. and a log backup every hour. A DBCC CHECKDB also runs every day at 4 a.m. If I get to work at 8 a.m. and discover that the nighttime consistency checks found extensive corruption, how can I recover without losing a lot of data?
Answer: That depends on when the corruption occurred, what is corrupted, and what backups you have. Ideally, you’d have a disaster recovery plan worked out in advance that would guide you through your next steps, but I’ll assume this is a hypothetical question.
In the scenario you describe, the corruption is discovered by the DBCC CHECKDB that runs after the full database backup, but there’s no easy way to tell whether the corruption occurred before the database backup or after it. If the corruption occurred sometime before the database backup, then the backup contains a corrupt version of the database and recovery will be more complicated.
The first thing I’d do is restore the most recent database backup to a different location and run DBCC CHECKDB against it. If no corruption is found, you should be able to restore without losing any data using the following procedure:
- Take a tail-of-the-log backup of the corrupt database (to capture the most recent transactions).
- Restore the most recent full database backup, specifying WITH NORECOVERY.
- Restore all transaction log backups since the full database backup and the tail-of-the-log backup, in sequence and using WITH NORECOVERY.
- Finish the restore sequence using the command RESTORE databasename WITH RECOVERY.
Finally, you should run another DBCC CHECKDB to verify that no corruption exists, perform some root-cause analysis to figure out what caused the corruption in the first place, and then take steps to correct the problem.
In the event that corruption does exist after following the restore sequence above, something may be corrupt in one of the transaction log backups, or possibly something was corrupted in memory and then included in the transaction log. In that case, you may need to perform a point-in-time restore to find the time at which the corruption occurred, and stop the restore just before that. The procedure is beyond the scope of this column but it is covered in detail in Books Online.
If the most recent database backup does contain the corruption, you may need to follow the above procedure but starting with the next most recent full database backup. This assumes that you still have it available, and all the intervening log backups too.
An alternative strategy (that you may be forced to use to satisfy a maximum-allowable downtime—or RTO—constraint) may be to delete the corrupt data, either manually or using the repair functionality in DBCC CHECKDB, and then try to recover some of it from an older set of backups.
Recovering from corruption can sometimes be very easy and sometimes very challenging, depending on what went wrong and what options you have available. I’ll be covering this in a couple of articles over the next few months.
Question: Our development team is going to build a solution that involves the change-tracking feature of SQL Server 2008. From reading the documentation, it seems we should enable snapshot isolation on the database involved—but I’m concerned about the performance impact. Can you comment on this?
Answer: I discussed change tracking back in the November 2008 issue (“Tracking Changes in Your Enterprise Database”), but yes, you need to enable row versioning. This is because the mechanism to retrieve changed data is generally as follows:
- Query the change-tracking system to find which table rows have changed.
- Query the table itself to retrieve the changed rows.
Without any row-versioning mechanism, the first query may return invalid results if the change-tracking cleanup task runs while the query is executing. And the second query may fail if some table rows referenced in the results of the first query are deleted before the second query executes.
One way to provide stabilization is to lock the change-tracking data and the necessary user tables, but this leads to poor concurrency (through blocking) and reduced workload throughput. The other way to provide the stabilization is to use snapshot isolation.
There are two kinds of snapshot isolation—one that provides consistency at the transaction level (database option: allow_snapshot_isolation) and the other at the T-SQL statement level (database option: read_committed_snapshot). The transaction-level option is the one required for using change tracking correctly, and this is simply called snapshot isolation.
Snapshot isolation maintains versions of table records so that if, for example, an explicit transaction starts, the transaction is guaranteed to see a consistent, point-in-time view of the database, as of the point at which the transaction began. To use change tracking, the two queries above should be wrapped inside a single explicit transaction and the isolation level should be set to snapshot; this combination will guarantee consistency.
Snapshot isolation is explained in great detail in my wife Kimberly’s white paper, “SQL Server 2005 Snapshot Isolation.”
There are two possible performance problems with snapshot isolation. The first is that all updates to all tables in the database must generate versions of records as they change them—even if the version is never used. The pre-change version of the record must be copied into the version store, and the new record has a link to the older one—just in case another transaction begins before this completes and needs the correct version of the record. This adds some processing overheard to all update operations.
The version store is inside the tempdb database, and this is the second possible performance problem. Tempdb can be the busiest database on some SQL Server instances as it is shared among all connections and databases. In general, tempdb can become a performance bottleneck—even without row versioning. Adding row versioning means putting even more pressure on tempdb—in terms of space used and I/O operations—which can lead to general workload throughput degradation.
You can read a lot more detail about this in the white paper, “Working with Tempdb in SQL Server 2005.” Although both white papers mentioned here were written for SQL Server 2005, they apply equally to SQL Server 2008.
Question: Is DBCC CHECKDB a completely comprehensive check of everything in the database? Someone told me it isn’t. Also, can repair fix everything? Again, I’ve been told it can’t. Is there something else I can do if DBCC CHECKDB isn’t comprehensive?
Answer: The answer is yes and no! DBCC CHECKDB is an extremely comprehensive set of consistency checks, and the set of checks it performs has grown from version to version. You are right, however, that there are a few things it does not validate.
Very simply, here is what it does:
- Check the system catalogs for consistency
- Check the allocation metadata for consistency
- Check all user tables for consistency
A deeper description of what checks are run is beyond the scope of this answer (but you can find more info on my blog or in the recent SQL Server 2008 Internals book), but every database page that is being used is at least read into memory and validated. This will catch common corruptions caused by faults in the I/O subsystem (roughly 99.99% of all corruptions are caused this way).
The two most well-known items that are not checked in any version of SQL Server are the contents of column and index key statistics that are stored in the database, although this may be added in a future release, and the validity of constraints (for instance, foreign-key constraints between tables). Constraint validity can be checked using the DBCC CHECKCONSTRAINTS command separately from DBCC CHECKDB, and in fact, if you are forced to run a repair operation on a database that contains constraints, it’s a good idea to validate the constraints afterward as repairs do not take constraints into account and may inadvertently invalidate them. This is documented in Books Online.
The repair system cannot fix everything. There are some corruptions that are impossible to fix with a guarantee of correctness within a reasonable time. The list of such corruptions is very small and is documented in my blog post “CHECKDB From Every Angle: Can CHECKDB Repair Everything?” For example, consider a corrupt page in a system catalog—the only possible repair may be to delete the page. But what if the page is storing the metadata for some user tables in the database? Deleting that page would effectively delete those user tables—so that repair cannot be performed.
Most repairs involve data loss in some way (as this is the only way to guarantee correctness in a reasonable time), so repair should only be viewed as a last resort when performing disaster recovery. Using the backups from a comprehensive backup strategy is the only way to avoid data loss (unless some form of synchronous copy of the database is maintained).
DBCC CHECKDB is comprehensive enough to detect damaging corruptions, and it should be run periodically as part of a database maintenance strategy (see my blog post “Importance of Running Regular Consistency Checks”) to ensure that corruptions are found as soon as possible. There is nothing that does a more comprehensive job, but you can increase the effectiveness of DBCC CHECKDB by making sure you have page checksums enabled on all databases. This allows SQL Server to detect when something has changed a database page outside of SQL Server’s memory.
Question: I’m confused about shrinking. In one article I read that shrinking data files is good, and in other places I’ve read that it’s bad. The same thing happens when I try to find information about shrinking log files. What’s the answer?
Answer: Shrinking is a very misunderstood operation, and the difference between data file shrinking and log file shrinking is a great cause of confusion.
A shrink operation on a data file tries to move the database page nearest the end of the file down toward the start of the file. This creates “empty” space at the end of the data file that can be returned to the OS. In other words, the data file is made physically smaller.
A shrink operation on a transaction log file, on the other hand, doesn’t move anything around—it simply removes the empty portion of the transaction log at the end of the file, as long as the transaction log records are not being retained for any reason. If the operation is successful, the log file is made physically smaller.
The confusion comes from the side effects of the two operations, and when they should be performed.
People are advised (or decide) to shrink data files to reclaim space. It may be that their index maintenance job causes the data files to grow, or their drive is getting full, and the natural reaction is to reclaim some of this “wasted” space. However, it is likely this space will be needed again and often it’s better to leave the remaining free space available to the data file rather than repeatedly shrink and auto-grow a file.
Shrinking a data file should be a very rare operation because it has a nasty side effect. Shrinking a data file causes massive index fragmentation, which may affect query performance. My blog post “Why You Should Not Shrink Your Data Files” includes a simple script that shows this.
The blog post also describes when it’s acceptable to shrink a data file (almost never) and an alternative method that avoids the fragmentation side effect. I’ve unfortunately seen many cases where data file shrinking is recommended without any discussion of the side effects.
Shrinking a log file should be an even more rare operation than shrinking a data file. Commonly I’ll see people wanting to shrink a log file that has grown disproportionately large compared to the data file(s) through improper size management, or because they see it grow and want to keep it as small as possible. The log file for an active database does need to be a reasonable size, but the log should be managed so that it does not need to be shrunk and grown to accommodate activity in the database.
You can find out more about the transaction log in the article Understanding Logging and Recovery I wrote for the February 2009 issue of the magazine. I also have a blog post that discusses managing the size of the transaction log—see “Importance of Proper Transaction Log Size Management.”
The bottom line is that any kind of shrink operation should be exceedingly rare and only undertaken when the ramifications are fully understood.
Paul S. Randal is the Managing Director of SQLskills.com, a Microsoft Regional Director and a SQL Server MVP. He worked on the SQL Server Storage Engine team at Microsoft from 1999 to 2007. Paul wrote DBCC CHECKDB/repair for SQL Server 2005 and was responsible for the Core Storage Engine during SQL Server 2008 development. Paul is an expert on disaster recovery, high availability and database maintenance and is a regular presenter at conferences worldwide. He blogs at SQLskills.com/blogs/paul, and you can find him on Twitter at Twitter.com/PaulRandal.