I/O Errors, Database Mirroring, and More
Paul S. Randal
Q I've started running regular consistency checks on the databases that I manage, and I've even added some SQL Agent alerts to catch any I/O errors that user queries encounter. I don't know if the logic I've implemented around the checks and alerts will work because none of my databases have corruption issues. How can I create some corruption in a test database so I can make sure everything I've set up works correctly? Also, is there anything else I should do to detect I/O errors?
A In SQL Server® 2000, the old trick to create a corrupt database for testing was to manually delete a row from the sysindexes table of a test database. But with SQL Server 2005, corrupting a system table in such a manner is very difficult. The best way to corrupt a test database yourself is to use a hex editor to alter a data file while the database is shut down. Here's what to do:
- Shut down the database so the data files are not locked. (Be careful that you don't detach the database, though, because if you corrupt the wrong page, you may not be able to attach it again.)
- Pick an offset more than, say, 100 pages into the file (at least 819200 bytes), but make sure it's aligned on an 8192-byte boundary (a page boundary). This avoids critical metadata pages and allocation bitmaps, letting you start the database and run DBCC CHECKDB on it.
- Write a few bytes of zeroes into the file at the chosen offset. Using this technique is almost guaranteed to introduce some page header corruption errors.
That said, the fastest way to create a corrupt test database is to use one that someone else has already created. You can find samples of corrupt SQL Server 2000 and SQL Server 2005 databases (with explanations) on my blog (at go.microsoft.com/fwlink/?LinkId=115151
As for your second question concerning what you should do in order to detect I/O errors—you should enable page checksums. This feature was introduced in SQL Server 2005 as a way of protecting an entire database page from errors that were introduced by the I/O subsystem.
Basically, when a page is written to disk, the last thing SQL Server does is calculate a checksum over the whole 8KB page and stamp the checksum on the page. When a page is read from disk, if it has a page checksum, the checksum is recalculated and compared to the one stored on the page. If they don't match, something outside SQL Server corrupted the page and an 824 error is raised. The error is displayed to the connection that caused the page to be read, as well as logged to the SQL Server error log and the Windows® Application Event Log.
Page checksums are on by default for all databases created on SQL Server 2005 and SQL Server 2008. However, they must be manually enabled for databases upgraded from previous versions of SQL Server. You can enable page checksums with this code:
ALTER DATABASE mydb SET PAGE_VERIFY CHECKSUM;
Q To remove all the fragmentation in my database, I've set up a nightly maintenance plan that rebuilds all the indexes in the production database, which is running on SQL Server 2005 Enterprise Edition with SP2. I've noticed that this causes the database to grow excessively, so I added a step to shrink out all the extra space, as there isn't much space on the disk. Now it seems like the rebuild step isn't working. What's going on?
A You've hit upon a common problem people face when setting up a maintenance plan. You're caught in a shrink-grow-shrink-grow cycle.
When an index is rebuilt, a new copy of the index is created before the existing index is deleted. This procedure requires extra space in the database files—usually around the same amount of space that the current index uses. In SQL Server 2000, extra space was also required for sorting the index rows (roughly 20 percent of the size of the index), but that requirement has been eliminated for a simple index rebuild in SQL Server 2005.
Admins sometimes want to remove the extra space that was created during the index rebuild, so they add a shrink operation to the maintenance plan after the rebuild step. It is not widely known, however, that this shrink will cause index fragmentation due to the nature of its algorithm. This means that the newly rebuilt and defragmented index will immediately become fragmented—canceling out the effect of rebuilding it in the first place.
Given that the database file will just grow again the next time the index rebuilds take place, it is better to allow the database to have the extra space in it and entirely avoid running the shrink operation. (In addition, continually growing and shrinking the database files will cause file fragmentation at the OS level—which can contribute to poor performance in the same way as index fragmentation.)
Finally, one thing you might consider is reducing the frequency of when you rebuild indexes. You may even want to try using an alternative method, such as the old DBCC INDEXDEFRAG that I wrote for SQL Server 2000 or the newer ALTER INDEX REORGANIZE syntax in SQL Server 2005 and SQL Server 2008.
There is a useful white paper that discusses index fragmentation and gives guidance on when to remove fragmentation (at go.microsoft.com/fwlink/?LinkId=115154
). While this paper was written for SQL Server 2000, the concepts are still the same.
Q We've been evaluating the disaster recovery strategy at my organization, and I think database mirroring is the right way to go for our situation. The server I'm trying to protect has a lot of unrelated databases (the result of a prior server consolidation project), and I'd like to use database mirroring for all of them. The question I'm struggling to answer is how many databases will I be able to mirror before performance starts to degrade?
A The answer to this question is one I use very frequently: It depends! The published guidelines say not to mirror more than 10 databases per instance, but 10 is just a rough guess at what the maximum will be for most users. You need to consider the following factors for your hardware setup:
- How much memory do the principal and mirror instances have? (Ideally, this should be the same.)
- How much processing power do the principal and mirror instances have? (This should also be the same.)
- How much bandwidth does the I/O subsystem have on the mirror instance? (This should be the same as on the principal.)
- How much transaction log does the workload on each of the databases generate?
- How much network bandwidth is available between the principal and the mirror instances?
The last two factors are the most critical. If the network bandwidth available between the two instances is not enough to handle the combined transaction log generation rate per second from all the databases being mirrored, then performance will drop on the principal databases. SQL Server 2008 helps alleviate some of this burden with log stream compression.
The next most critical thing to consider is the memory and thread requirements for mirroring. Each mirrored database takes one thread plus some memory. On low-powered servers, lots of mirrored databases may be too much load on the server when combined with the regular workload.
You also need to consider how you're going to run database mirroring. In synchronous mode, transactions on the principal database cannot commit until all the transaction log records have been copied to the mirrored database's transaction log. Therefore, any delay caused by an overloaded network could cause a workload performance issue on the principal.
In asynchronous mode, transactions can commit on the principal without having to wait, but a network delay could cause the amount of transaction log waiting to be sent to the mirror to grow. This can cause issues with transaction log size. Worse, any unsent transaction log will be lost in the event of failure—so the more unsent transaction log, the more potential there is for data loss in a recovery situation.
Scenarios can vary greatly, and I've seen some interesting examples in real production environments. For instance, I saw one environment with 150 databases, all of which had very small amounts of activity and not all at the same time. All 150 database are mirrored with no problems.
On the contrary, I've seen a setup that has only three heavily loaded databases, but without a good network connection. In that scenario, they can barely mirror one database before the lack of network bandwidth causes workload degradation.
The key to success is to, first, do the log generation calculation. If it seems that the available network bandwidth will support the number of databases you want to mirror, you may be OK. Test your configuration before you put it into production and make sure to include all operations that could generate transaction log—especially any database maintenance you may do.
Paul S. Randal is the Managing Director of SQLskills.com and a SQL Server MVP. He worked on the SQL Server Storage Engine team at Microsoft from 1999 to 2007. Paul wrote DBCC CHECKDB/repair for SQL Server 2005 and was responsible for the Core Storage Engine during SQL Server 2008 development. An expert on disaster recovery, high availability, and database maintenance, Paul is a regular presenter at conferences. He blogs at SQLskills.com/blogs/paul.
© 2008 Microsoft Corporation and CMP Media, LLC. All rights reserved; reproduction in part or in whole without permission is prohibited