SQL Q&A: Checksum Problems, Choosing the Correct Recovery Model and More

Article
08/31/2016

SQL Q&AChecksum Problems, Choosing the Correct Recovery Model and More

Paul S. Randal

QI'm a SharePoint administrator and I've recently discovered that one of my Content databases became corrupt when my monthly consistency checks found errors. We traced it back to a faulty RAID controller and, after doing some research, we're turned on checksums on the database pages. My question is: How can I tell when there's a checksum problem without waiting until the monthly maintenance plan?

AThere are a few things you can do. First, you can add the WITH CHECKSUM option to your backups. For full and differential backups, this option will cause the backup operation to test any page checksums it sees and fail with an error if it finds a corruption. (I covered this in more detail in last month's article, "Understanding SQL Server Backups.")

Second, consider running your consistency checks more frequently than monthly. I recommend running some form of consistency check at least weekly, whether that's a DBCC CHECKDB on the database or maybe on a restored copy of the database. That, of course, is going to depend on your comfort level with your I/O subsystem.

Third, add some SQL Agent alerts. An alert can be set to fire on a variety of factors, such as a particular error number being raised by SQL Server, an error with a particular severity being raised or a performance counter crossing a threshold. This capability provides a really powerful mechanism to monitor for server problems.

When an alert fires, a message is sent to a predefined "Operator" using one or all of these options: a pager message, e-mail, or NET SEND. You can use the stored-procedure sp_add_notification to define an Operator.

As far as I/O subsystem problems are concerned, the errors you're interested in are 823, 824, and 825. The first two are raised when an I/O issue occurs (specifically, 824 is when a page checksum is found to be broken, and 825 is when SQL Server has to try a read operation multiple times before it completes). These are all problems that you want to know about as soon as possible in order to limit further damage to the database (and possibly the amount of downtime to recover).

823 and 824 are both severity-level 24 errors but 825 is only a severity-level 10 "informational" message (for more information, see my blog post, "A little-known sign of impending doom: error 825.") To alert for these errors, you should define an alert for all severity-level 24 errors, and one specifically for error 825 (in fact, it's a best practice to have an alert for each severity-level from 19 to 25).

To define the actual alerts, you can use T-SQL or Management Studio. Below is an example of the T-SQL code to add an alert for the 825 error.

USE msdb;
GO 
EXEC msdb.dbo.sp_add_alert @name = N'825 - Read-Retry Required', 
    @message_id = 825,
    @severity = 0,
    @enabled = 1,
    @delay_between_responses = 0,
    @include_event_description_in = 1,
    @category_name = N'IO Subsystem Error';
GO

You can find more information on defining and adding alerts, including a step-by-step walk-through on adding alerts using Management Studio, in my SQL Agent blog category "(SQL 2005 SP2 maintenance plan bug masking corruption)."

QI'm a developer-DBA who's responsible for some code and the database it runs in. I've been arguing with some other database developers about how to get a unique value to identify table rows. I'd like to use a GUID as the clustered index key, but the others are arguing that it can lead to performance issues with indexes. Is this true and, if so, can you explain why?

AYes, it's true—GUIDs are one of the leading causes of index fragmentation in SQL Server databases.

A GUID is a Globally Unique Identifier. In SQL Server, this is a 16-byte value that is either generated in SQL Server or somewhere else (such as through .NET in the client- or mid-tier). GUIDs generally have a random value unless generated with the NEWSEQUENTIALID function introduced in SQL Server 2005.

This function produces ranges of GUIDs, which can help to alleviate some of the problems I'll describe here. But it requires generating a server-side GUID, which doesn't work in many application environments because the unique identifier must be generated before the data is sent down to the data tier.

No matter where the non-sequential GUID is generated, having it as the leading key in an index means that because the key is essentially random, the insertion point of a new record in the index is also random—the key value of a record determines its placement in the index. It also means that the 16-byte GUID will be present in every row of each non-clustered index as part of the link that allows the Storage Engine to navigate from non-clustered index records to clustered index records to obtain column values for the query select list that aren't in the non-clustered index that was used (also commonly known as a bookmark lookup).

As more records are inserted into the index, the pages storing the records fill up. If a record is being inserted on a page that's already full (remember, the key value determines where the new record goes), then the page must split, with some records moving to a newly allocated page. A page split is an expensive operation to perform (a new page is allocated and linked into the index, and records are moved onto the new page), and it causes fragmentation.

Page splits cause logical fragmentation in the index (which affects range scan performance) by causing the physical and logical ordering of pages in the index to be different. It also causes poor page density (where there is unused space on pages), which leads to wasted space on pages and poor disk, I/O, and memory utilization.

For more information about index fragmentation and how to detect and remove it, see my August 2008 article, "Top Tips for Effective Database Maintenance." Picking a good clustered index key is beyond the scope of this article; I'll leave it to my wife, Kimberly L. Tripp, to explain. See her great blog post on the topic, which also goes into more details about GUIDs and clustered index structures "(GUIDs as PRIMARY KEYs and/or the clustering key.")

QOur high-availability strategy is to use log shipping to a couple of secondary servers. Our management team is pressuring me to make some use of the redundant servers to save on capital expenditures. My idea is to use the secondary servers to allow reporting queries to run, which would also have the benefit of offloading the report workload from the primary server. What problems might I run into doing this?

AThis kind of scenario is becoming much more common in the current economic climate, where companies don't like to have servers sitting around that appear to be idle (even though they're providing a redundant copy of the database).

As you probably already know, when you set up log shipping, you can define how the transaction log backups are restored on a secondary server—either WITH NORECOVERY or WITH STANDBY. The former doesn't allow any access to the database, whereas the latter allows read-only access to the database. This uses a special mode where recovery runs and the database is transactionally consistent, but the operations performed are stored in a separate undo file so that further transaction log backups can be restored (I'll discuss this in more detail next month in an article about using RESTORE).

To allow reporting on the secondary server, you're going to use WITH STANDBY so the reporting queries can connect to the database. Once you allow user connections, you immediately run up against some issues.

First, using the WITH STANDBY option can lead to transaction log backups seeming to take a long time to restore on a secondary server, as the contents of the undo file must be replayed before the next transaction log backup can be restored. This can be a problem if the undo file contains a large number of operations.

Second, the restore of a transaction log backup isn't an online operation. No one can be connected to the database for the duration of the restore. This means that all connections to your secondary reporting server must be dropped, then reconnected after the restore completes. Here you have a dilemma: When it's time to restore the next transaction log backup, do you terminate user connections or allow them to finish their queries? That's completely up to you.

One question to bear in mind if you decide not to terminate connections: How long do you allow the queries to continue before you forcibly terminate them? The longer you wait, the longer it's been since the last log backup was restored, and the further behind the primary database the secondary one gets. This could be a problem if you then have to fail over to the secondary, because there could be a queue of log backups waiting to be restored to bring the database as far up to date as possible, minimizing data loss.

You can find more information about these options, as well as on monitoring issues such as length of time since the last log backup restore on the secondary server, in the Books Online topic, "Using Secondary Servers for Query Processing."

QHow can I choose the correct recovery model? From what I've read, it seems like I should be using bulk-logged to cut down on my transaction log size, but it seems like my log size still keeps growing. Is it possible to use a mode that doesn't use the transaction log at all and completely avoid the whole issue?

Ane of the requests I've heard several times is for a non-logged database, when no transaction log records are generated at all—especially for tempdb, which people view as a scratch database because it's recreated on server startup.

As far as I know, this will never happen for SQL Server. The best you'll get is the BULK_LOGGED mode, which drastically cuts down the amount of transaction log generated for certain operations (such as index rebuilds and bulk loads). All databases, even tempdb, must have some level of logging to allow transactions to roll back (that is, to undo all the operations that were part of the transaction) in the event of a user canceling the operation or some error causing the operation to fail.

What's more important, for databases apart from tempdb, is that if a system crash occurs, the database must be able to recover without leaving transactionally inconsistent data or a structurally inconsistent (that is, corrupt) database. Imagine if there were no record of what had been changing in the database before the crash—how would SQL Server run recovery? You can read more about how logging and recovery work in my February 2009 article, "Understanding Logging and Recovery in SQL Server."

For choosing a recovery model, one overriding question will determine your choice: Do you want to be able to do "point-in-time" or "up-to-the-minute" recovery in the event of a disaster? If so, you're going to use the FULL recovery model (and possibly the BULK _LOGGED model on occasion). If you're not interested in doing this, then use the SIMPLE recovery model.

The reason for using SIMPLE instead of FULL if you don't want to be able to recover the database (without losing work since the last database or differential backup) is that with the SIMPLE recovery model, you don't have to take transaction log backups to manage the size of the transaction log.

Now, there may be other reasons why you have to use the FULL recovery model—for instance, if you want to use database mirroring (DBM only supports the FULL recovery model) or log shipping (log shipping supports both the BULK_LOGGED and FULL recovery models). In either case, you're going to have to make sure that you're taking transaction log backups so that the log doesn't grow too large (even if you just end up discarding them).

I mentioned that you may occasionally want to use the BULK_LOGGED recovery model, rather than constantly running in that mode. This is because there are some limitations around taking and restoring from log backups when there has been a minimally-logged operation in the BULK_LOGGED recovery model since the last transaction log backup. The details are too complex to explain in this column, but you can find out more on this, and about choosing a recovery model in general, in the Books Online topic, "Recovery Model Overview."

Paul S. Randal is the Managing Director of SQLskills.com, a Microsoft Regional Director and a SQL Server MVP. He worked on the SQL Server Storage Engine team at Microsoft from 1999 to 2007. Paul wrote DBCC CHECKDB/repair for SQL Server 2005 and was responsible for the Core Storage Engine during SQL Server 2008 development. Paul is an expert on disaster recovery, high availability and database maintenance and is a regular presenter at conferences worldwide. He blogs at SQLskills.com/blogs/paul, and you can find him on Twitter at Twitter.com/PaulRandal.

Additional resources