Processes like backup and restore and consistency checking can cause some unexpected behaviors, but they do make sense.
Q. I’m working out downtime requirements for some of our SQL Server instances as part of disaster recovery planning. Is it sufficient to consider just the time it takes to restore backups?
A. No, there are a few other things you need to consider. First, consider the total time it takes to restore all necessary backups. That includes your most-recent full database backup, the most-recent differential backup and all transaction log backups. Always assume the worst-case scenario—where the database is destroyed just before taking the next full backup, so you have the largest possible number of log backups.
Next, consider the additional time it will take to restore the initial full backup to create data and transaction log files if they don’t already exist. If you’ve enabled instant file initialization, then data files will be created nearly instantaneously. The transaction log file, however, must be zero-initialized.
If you have a large file that’s upward of hundreds of gigabytes, then a restore could take several hours. If you then have to restore a differential backup, that will again completely zero-initialize the transaction log file. You’ll have to account for this time. If there are any additional transaction log files that were temporarily added (but not removed) you’ll have to zero-initialize these as well—potentially twice.
The last phase of the database restore process is to run a crash recovery. The time necessary for this will depend on how many transaction log records you need to roll back. They’re part of uncommitted transactions at the time of the final log backup. If you have long-running transactions in your database, assume the worst. Assume you’ll have to roll back almost all of the longest-possible transactions. You need to add that time to the equation.
Finally, also consider how long it takes the physical server to get to the point where you can start restoring backups. In other words, how long does it take the server to boot (running POST, memory checks and so on) and start up Windows? This could also add to the downtime.
If you consider all these things at their worst cases, that will give you a maximum possible downtime. You might be surprised when you add everything up.
Q. I recently came across an interesting problem. I tried to interrupt a DBCC CHECKDB process that was taking longer than usual. I found I couldn’t interrupt it and had to wait a long time for the process to end. Can you explain what was happening?
A. This behavior is expected, but not intuitive at all. When DBCC CHECKDB starts, it creates a hidden database snapshot. The database snapshot is required to provide DBCC CHECKDB with a transactionally consistent, unchanging view of the database. That way, DBCC CHECKDB knows it’s checking the consistency of a static database that shouldn’t have corruptions.
The process creates a database snapshot by first checkpointing the database. Then it creates the empty database snapshot and uses the database’s transaction log to run crash recovery on the database snapshot. In other words, it rolls back any active transactions into the database snapshot without actually affecting the real database. The database snapshot thus becomes transactionally consistent.
The time it takes to run crash recovery while creating the database snapshot is proportional to the amount and length of the uncommitted transactions in the database when starting the database snapshot. If there’s a long-running transaction, it might take a long time to roll back. This means creating the database snapshot and the DBCC CHECKDB process will take longer.
In extreme cases, when creating the database snapshot takes much longer than normal and you decide to kill the DBCC CHECKDB process, nothing will happen right away. You have to wait for the database snapshot crash recovery to complete before the process will respond to the kill signal. You can’t interrupt crash recovery, and there’s no distinction in the crash recovery code in SQL Server between real crash recovery after an unexpected shutdown and a crash recovery for a database snapshot.
The only alternative in this case is to restart the instance of SQL Server, which will remove the hidden database snapshot. This doesn’t work in the case of a regular database real crash recovery. In those cases, crash recovery will continue after an instance restart.
There are several ways you can avoid this scenario. Try to only run DBCC CHECKDB when you know there are no long-running transactions active in the database. You’d have to have these rolled back as part of creating the DBCC CHECKDB hidden database snapshot. You could also use a consistency-checking mechanism, which is to restore the database to another server and then consistency check the restored copy. This avoids the possibility of long-running transactions altogether.
Q. Last week I had to restore backups to salvage a table someone had accidently dropped. The default trace had already lost the information about when the table was dropped, so it was a tedious process to find the backup position to which I needed to restore. Is there a way to find the right point in time to which I should restore?
A. Any time you’re trying to determine when a table was dropped, check the default trace. That makes note of Data Definition Language (DDL) events. You can read more about the default trace on SQL Server Books Online.
The only problem with default trace is that it’s a finite size. It has also been deprecated in favor of Extended Events in SQL Server 2012. So if there’s a lot of activity occurring on the server, the record of when the table was dropped might not exist in the trace any more.
That means the only way to find when the table was dropped is to do what I call “inching through the transaction log.” Restore a copy of the database to a time when the table was known to exist. Then repeatedly perform point-in-time restores using the WITH STOPAT and WITH STANDBY options. Move slightly forward in time each time. When you find the time when the table no longer exists, restore the database to just before that time and you can retrieve the table data.
This process is very tedious and can take a long time. Each time you restore the database using WITH STANDBY, all uncommitted transactions at that point are rolled back into an undo file. The next restore in the process undoes the undo, restores a little bit more, and again rolls back uncommitted transactions into the undo file. You have to repeat this process until you find the correct time.
There’s a neat alternative way to do this. Analyze the log records in the transaction log backups to look for transactions called DROPOBJ. Do this with an undocumented table-valued function called fn_dump_dblog. This behaves in the same way as the more widely known fn_dblog, which dumps log records from an active transaction log, working against a database backup.
You can use this function to find the transaction that dropped the object in which you’re interested. Then you can use the transaction’s Log Sequence Number (or LSN) to execute a restore WITH STOPBEFOREMARK = 'lsn:<the LSN of the transaction>'. This will restore the transaction log up to, but not including, the transaction that dropped the table. Doing it this way saves you from having to “inch through the log,” as previously described. You can read more about this function and how to use it on my blog.
Q. Now that SQL Trace has been deprecated in SQL Server 2012, I’d like to understand more about Extended Events. Can you explain how Extended Events are supposed to be more lightweight than SQL Trace?
A. The main reason for the performance differential between the two mechanisms is how events are filtered. When defining a trace or event session, you can filter events in both cases based on various event criteria. Filtering on activity in a certain database is a good example of this.
With SQL Trace, events are generated all the time. The event consumer does the filtering. This means SQL Server is burdened with generating all events, even though some won’t be consumed. This process is very inefficient.
With Extended Events, the Extended Events engine within SQL Server performs event filtering. The Extended Events engine evaluates the predicates specified when the event session was defined. This means when the event fires, only minimal work is required to gather base event data. This lets the Events engine evaluate the predicate. If the predicate evaluates to false, the event is immediately discarded. The Events engine performs no further processing. This minimizes the performance overhead of collecting events when compared with SQL Trace.
Also, SQL Trace collects all columns associated with an event and discards any columns not required. Extended Events, on the other hand, only collects columns and other data specified. This further minimizes the effort required to fire an event.
Although Extended Events is a far superior mechanism for collecting troubleshooting data, it can still adversely affect SQL Server performance if the event session isn’t carefully constructed. If an event session requires producing a T-SQL call stack every time a very common event happens (such as acquiring a lock or a thread wait), this will obviously affect performance.
With either mechanism, you should test event collection before putting it into production. You need to ensure that workload performance won’t be compromised.
Paul S. Randal is the managing director of SQLskills.com, a Microsoft regional director and a SQL Server MVP. He worked on the SQL Server Storage Engine team at Microsoft from 1999 to 2007. He wrote DBCC CHECKDB/repair for SQL Server 2005 and was responsible for the Core Storage Engine during SQL Server 2008 development. Randal is an expert on disaster recovery, high availability and database maintenance, and is a regular presenter at conferences worldwide. He blogs at SQLskills.com/blogs/paul, and you can find him on Twitter at twitter.com/PaulRandal.