Top Tips for SQL Server Clustering
Tom Moreau, PhD
At a Glance:
- Running SQL Server on a cluster
- Hardware and software requirements
- Clustering one node
- Cost-effective options
A server cluster allows you to connect a number of physical servers—or nodes—that act as failover partners for each other. The redundancy a cluster provides spells more uptime for your critical
operations. I've implemented lots of clusters in my 13 years working with SQL Server™, and each had its own set of issues. That experience allowed me to gather a number of tips that can help make your clustering efforts easy and successful.
Server clusters take advantage of the built-in clustering capabilities of the Enterprise Editions of the Windows Server® family. In fact, for clustering purposes, Windows Server 2003 is significantly better than Windows 2000 Advanced Server. To maximize the benefits you'll gain from clustering, you need the right hardware-and that involves some expense. It's not sufficient to slap a couple of servers together with a shared disk, and you can't rely on the fact that individual hardware components may be in the Windows® Catalog (formerly known as the Hardware Compatibility List). The system as a whole must be in the Windows Catalog. But don't worry-there are some approved, lower-cost cluster solutions available. Figure 1 shows a typical cluster configuration.
Figure 1 A typical cluster (Click the image for a larger view)
Of course, there's more to clustering than hardware-you also need to choose the right edition of SQL Server 2005. The Enterprise Edition enables clustering, as well as other useful features, such as the ability to leverage more CPUs, distributed and updateable partitioned views, built-in log shipping, automatic use of indexed views. If you already have an Enterprise Edition license, you should consider clustering, whether or not you have the two to eight servers necessary to form a traditional cluster (we'll talk about one-node clusters in a minute). If you have SQL Server 2005 Standard Edition, you can install a two-node cluster.
Windows Server 2003 Enterprise and Datacenter Editions come with clustering built in. All you have to do is run Cluster Administrator to set up your cluster. You can add all nodes at once or one at a time. Similarly, when you install SQL Server, you can elect to install on an individual, non-clustered server, or you can install a virtual instance on a cluster. If you elect to install a virtual instance, you can install on all nodes of the cluster, just some nodes, or even only one node.
Finally, to achieve the true aim of clustering-high availability-you need qualified people and well-rehearsed procedures to follow if bad things happen. Although clustering is good insurance against hardware failure, it doesn't prevent user errors, such as the dropping of a critical table by a sleep-deprived DBA in the middle of the night.
Even if all you need at the moment is a single server, consider creating a one-node cluster. This gives you the option of upgrading to a cluster later, thus avoiding a rebuild. Just be sure that the hardware you choose is on the cluster portion of the Windows Catalog.
It's not merely for high availability that you'd want the option to add a node at a later date. Consider what happens if you find that your server just doesn't have the necessary capacity. That translates to a migration-and that takes time and effort. If you have a one-node cluster, migration is easier with far less downtime. You add the new node to the cluster, add the SQL Server binaries and service packs to the new node, and then failover to the new node. Then you add any post-service pack updates and, finally, evict the old node. The downtime is only the time it takes to failover and add the updates (if any).
Since all nodes in a cluster must be the same, you'll want to act sooner, rather than later, to get that extra node. If you wait too long, the node may go out of production. On one project, I had to rebuild a node in a SQL Server 2000 cluster. I had the OS/network admin handle the basic machine build, then I jumped in to add it back to the cluster and prepare it for service as a SQL Server node. All went well until I failed over to the new node. Much to my dismay, it failed right back. To make a long story short, although I had prepared a detailed document on building a new cluster, including adding the cluster service and SQL Server service accounts to both nodes, the document wasn't followed explicitly. The admin didn't add those service accounts to the rebuilt node, so the privileges they had before the rebuild no longer existed.
It took me a long time to track that one down. One day it occurred to me to look at local group membership. Once I added the two accounts, failover went smoothly. Then I got to thinking. Rebuilding a node is something you don't do frequently and, if you do, it's an emergency. Although I had a document in place, it wasn't used. We could have automated the security part by simply writing a brief script to add those two accounts and make any other necessary customizations. Things have improved in SQL Server 2005, though. The installer requires you to set domain-level groups for the SQL Server service accounts.
Of course, this got me thinking even more. You can create scripts that invoke CLUSTER.EXE to add the node to your Microsoft® Cluster Server (MSCS) cluster. All you have to do is feed the script the name of the node and it can handle the rest. In an emergency, automation is really your friend.
Sometimes, the reason for adding a node to a cluster isn't that you're replacing a node. You could be adding more SQL Server instances to your cluster and each instance needs separate disk resources. Though multiple instances can run on a single node, they would be sharing CPU and RAM-and that could spell poor performance. Ideally, only a single instance should run on a single node. How do you ensure that when you fail over? Simple: the answer is that one node has no services running on it, while the other nodes each run one SQL Server instance. In fact, that's the definition of an N+1 cluster: N instances running on N+1 nodes. The extra node is the backup.
Upgrading SQL Server
Upgrading a clustered instance of SQL Server is not for the faint of heart: it's clustered for a reason-you need uptime. But SQL Server 2005 offers a number of enhancements you want to take advantage of, so if and when you've got to do it, you can proceed without a lot of downtime.
What are your choices? Let's look at the most expensive solution first: creating a whole new cluster, which means new servers and perhaps a new storage area network (SAN). You can probably keep the existing network switches, but that's about all. Obviously, this approach isn't cheap but it has advantages. New hardware generally performs much better than old, with disk capacity and speed increasing at an ever-growing rate. Thus, you'll get a performance boost with new hardware alone. You may even want to lease your equipment just to stay ahead of the curve.
Once you have the hardware in place, you can create your new virtual SQL Server on this setup, copy your production databases over, and then put the new system through its paces, leaving plenty of time to shake out the bugs before cutover day. Just be sure to script out the logins from your existing server. (Check out support.microsoft.com/kb/246133. It's also a good idea to update your login build script in case of catastrophic failure.)
To minimize downtime, you'll likely have to use log shipping, unless your databases are quite small and you have a period of time in which no users are connected. You can log-ship right up to just before cutover. Then, kick the users out, cut and ship the final log, then point the app at the new instance. (Check out the database mirroring section below for an interesting alternative to log shipping.) If you use DNS aliases, you probably won't even need to point the apps to the new instance. Just update the DNS alias instead. This approach has the advantage that if you get part way through the migration and have to revert back to the original, at least you have the original.
You can take a less expensive route, but it requires more advance planning. A cluster can support more than one SQL Server instance, but each instance must have its own disk resources. So when you're carving up your SAN, set one LUN aside for a future upgrade. To perform the upgrade, install SQL Server binaries on this disk resource. You can exercise the system and, when you're ready, shut down the current SQL Server, move the disk resources from the old SQL Server group, update the dependencies, and bring the new SQL Server instance online. Attach the databases from the old instance, and you're up and running. (You did back everything up ahead of time, right?)
That's the less-expensive approach-and it carries with it some risk. If something goes bad, you can't detach the databases from the new instance and put them back. You're reduced to restoring from backups-and that can mean some serious downtime.
An alternative is putting two instances of SQL Server on your SAN, assuming you have enough space. You restore production backups (and log ship) to the new instance, and proceed much as I described earlier. However, now you have a fallback. Once you've done the migration, you can free up the SAN resources from the old instance. It'll cost you only the price of the extra disks.
Let's start by debunking a common misconception. You use MSCS clustering for high availability, not for load balancing. Also, SQL Server does not have any built-in, automatic load-balancing capability. You have to load balance through your application's physical design. What does that mean?
As a table grows, you can expect to see some degradation in performance, particularly when table scans are involved. When you get into the millions or billions of rows, the traditional solution has been to use partitioned views, which are made up of tables with identical schemas hooked together with UNION ALL's. Also, CHECK constraints are put in place to differentiate the member tables, and this prevents data duplication across the partitioned view. If the column that is used in the CHECK constraint is also part of the primary key, the view is updatable.
If the member tables are on their own filegroups, you may get better disk performance if the files in those filegroups are on separate physical drives. The tables can even be in separate databases. In SQL Server 2005, however, as long as all of the data is in the same database, you can use table partitioning, which is far easier to implement.
But let's say you've tried your best with table partitioning or (local) partitioned views and still things are slow. If you have SQL Server 2000 or SQL Server 2005, you can use distributed partitioned views. The major difference is that the member tables can reside on different instances of SQL Server and those instances can be installed on an N+1 cluster. Why is this a good idea? If any one member table goes offline in a partitioned view, the entire view goes offline. Making those members part of a cluster then gives you the reliability you need to support performance and provide load balancing.
Do You Really Need a Cluster?
Perhaps you have some spare servers lying around, but they aren't in the Windows Catalog for clusters. It's a shame to have to go out and buy new ones just to support a cluster when those servers are available.
Database mirroring may be an attractive alternative to clustering. Mirroring involves three elements: an instance that houses the mirrored database is known as the principal; the backup server is known as the mirror; and, if you want automatic failover, a third server-known as the witness-is required. Briefly, a transaction in a database on the principal gets run again in the mirror. If the principal goes down, the database can fail over to the mirror, automatically if you have a witness. You have to set up mirroring for each of your application databases and you can't mirror system databases.
The mirror is a separate instance of SQL Server, unlike in a cluster, and can be located thousands of miles away. Its caches get populated by the update activity that occurs as a result of the transactions duplicated from the principal. Assume, of course, that there is no activity on the mirror other than receiving the mirrored transactions from the principal. Failover is generally quicker than in a cluster since SQL Server is already running on the mirror. Because the caches are at least partially primed, the initial performance is not as sluggish as it might be in the clustered scenario. And note that when a mirrored database fails over, the role of principal and mirror are reversed.
The downside of database mirroring is the need for double the total disk capacity than with a cluster. You'll also need more CPU power-if you go for synchronous mode with no data loss. As I said, high availability isn't cheap.
A Combined Approach
Since a mirror can be located quite remotely from the principal, it's a good choice for Disaster Recovery (DR) plans. Your cluster can be your first line of defense, but what happens if you employ both clustering and mirroring? In a cluster failover, if you have a witness as part of your mirroring configuration, the mirror will become the principal while the clustered SQL Server is coming online. However, note that failover from the new principal back to the (clustered) new mirror is not automatic. Consequently, it's better not to enable automatic failover for your mirrored databases when used in conjunction with a cluster.
DR isn't the only reason you'd use mirroring; it's also useful if you have to apply a service pack or hotfix to your principal, in which case you can manually failover to your mirror. While applying the service pack or hotfix, the old principal server is temporarily offline and the committed transactions occurring at the new principal are queued up, waiting to be sent back to the new mirror (old principal). Once the service pack or hotfix installation is complete, synchronization will take place and eventually the two servers will be completely in sync. Now, you can switch the roles of principal and mirror. Downtime was the couple of seconds to failover and back. You can use this approach to migrate your SQL Server to another box. Just don't fail back.
Virtual Server Adds Flexibility
Virtualization allows you to run one or more operating systems concurrently on a single physical server. Virtualization software adds another layer of capabilities to the cluster concept because you can cluster the software. Consequently, if the server on which the host is running fails, then it-and its guest OSs-failover to a backup node. This could be a very easy way to migrate a guest server. Plus, the guest OS does not have to be cluster-capable. Thus, you could run SQL Server Workgroup Edition inside a guest Windows Server 2003, running on Microsoft Virtual Server 2005 on a cluster. Indirectly, you have essentially clustered Workgroup Edition (see Figure 2).
Figure 2 Using a virtual server (Click the image for a larger view)
If you're in charge of a SQL Server implementation, you need to know that your server is always available. Server clustering helps to ensure that's always the case. This article provides some hard-earned tips to help you get started, and you'll find more useful information in the "Clustering Resources" sidebar.
Tom Moreau, PhD, BSc, PhD, MCSE, MCDBA, is an independent consultant specializing in SQL Server database administration, design, and implementation, and is based in the Toronto area. Tom’s been using SQL Server since 1993 and has been an MVP since 2001. He’s written over 100 articles and coauthored a book on SQL Server. Thanks to SQL Server MVP Geoff Hiten for his useful input.
© 2008 Microsoft Corporation and CMP Media, LLC. All rights reserved; reproduction in part or in whole without permission is prohibited