Scaling Out SQL Server with Data Dependent Routing

By Man Xiong and Brian Goldstein

Summary: As an application expands to service millions of global customers, it may become advantageous to move to a scale-out architecture rather than to host on a single, mainframe-class machine. This paper discusses the reasons why some companies choose to scale out database applications and how to use a method known as Data Dependent Routing to partition and access data across a federation of servers. In order to demonstrate manageability and linear scaling, we simulated a real-world customer scenario, part of the communication services platform (CSP) for MSN (The Microsoft Network), in the SQL Server Scalability lab. Tests were performed using Microsoft® SQL Server™ 2005 Beta 2 running on Microsoft Windows® Server™ 2003, Enterprise Edition.

On This Page

Introduction
Why Scale Out?
Data Dependent Routing
MSN Communication Services Platform (CSP)
Pilot Study of MSN CSP on SQL Server 2005
Conclusion
Appendix A: Hardware Configuration

Introduction

As an application expands to service millions of global customers, it may become advantageous to move to a scale-out architecture rather than to host on a single, mainframe-class machine. This paper discusses the reasons why some companies choose to scale out database applications and how to use a method known as Data Dependent Routing to partition and access data across a federation of servers. In order to demonstrate manageability and linear scaling, we simulated a real-world customer scenario, part of the communication services platform (CSP) for MSN (The Microsoft Network), in the SQL Server Scalability lab. Tests were performed using Microsoft® SQL Server™ 2005 Beta 2 running on Microsoft Windows® Server™ 2003, Enterprise Edition.

Target Audience

The target audiences for this white paper include:

  • Developers and database administrators who are planning to implement scale-out database applications. This audience will benefit by reading this white paper in its entirety.

  • Developers and database administrators who have already implemented scale-out database applications in Microsoft SQL Server. They may benefit from the sections on using SQL Server transactional replication as a high availability and system maintenance platform.

  • Database and system administrators who are interested in Storage Area Network management and I/O system scaling.

Why Scale Out?

This past decade has seen an explosion in the storage of data. Now that many applications are available on the Internet, companies have to contend with millions of online users who are shopping, storing e-mail messages, viewing financial information, and other activities. Database systems are at the heart of these enterprise applications. And SQL Server is among the leading database platforms in these large data centers.

A scalable database platform allows the application designer to start small and to grow the system as large as necessary. Traditionally, most scalability has been achieved thorough symmetric multiprocessing (SMP) scale up—by adding more processors, memory, disks, and network cards to a single server. Scale up is sufficient in the majority of SQL Server implementations to date. However, there is a class of applications where a single database server (also referred to as a “node” throughout the document) hits its capacity limitation and cannot grow any further. This bottleneck can surface in a number of ways, but typically we see customers hit this bottleneck on applications that require thousands or even millions of customer requests every second. Each connection and request requires CPU, memory, disk and network resources which can only scale so far on a single system.

When faced with the possibility of a single system bottleneck, application designers may choose to adopt a scale-out architecture in which the workload and database are partitioned among an array of SMP nodes. These scaled-out systems grow by adding more nodes to the array. Ideally, this partitioning is transparent to the clients and to the application. The cluster is programmed and managed as a single system, but it is in fact an array of nodes.

Note:  This array of nodes is often referred to as a federation of servers.

It is important to note that there are several ways to scale out an application. One approach is to implement a partitioned service-oriented architecture where you spread services across nodes. A good example of this is to deploy a shopping catalog on one server, the product inventory database on another server, and the shopping basket application on yet another server. The application middle tier understands which server to access for the different information sources.

Another partitioning strategy, and the focus of this paper, is data partitioning where you divide a large table across a number of database nodes.

Scale out by any method poses complex management challenges, since more components must be managed and not every application can be conveniently partitioned across nodes. So, while scale out architecture is well suited for some applications, it doesn’t work for all applications.

If your application does lend itself to scale out, this approach has several advantages over a monolithic scale up system:

  • Arrays can grow in small increments of commodity components, often leading to significant cost savings.

  • The loss of a single node does not necessarily make the entire application unavailable.

  • The relative independence of nodes provides a natural failover and high availability design.

Data Dependent Routing

One of the major design decisions for any scaled-out data platform is to determine the best way to partition the data among the different nodes. Some applications can be partitioned by a key value, such as customer name, store location, time/date, or register name. The challenge is to align the partitioning scheme with how the data is accessed. For example, each branch office of a large insurance company can maintain their client records on the branch office servers that are running SQL Server. In this case, the application is partitioned by branch office. Every night a batch job can run to replicate the new or modified records to the central server that is running SQL Server and is located at headquarters. In most cases, agents working out of the branch offices do not access the central server, and corporate analysts can run their reports against the central database without needing to access every branch office server. In this example, the data access is local.

Let's look at another example. Suppose an online retailer wants to store all retail transactions. As you might expect, the data size can grow quickly. There are many ways to partition the data — by product ID, by date, by customer ID, and so on. But how do you accommodate the various consumers of the data? Customer service people want to search on date, customer ID, or product ID, depending on the information that is provided by the customer who is requesting customer support. Market analysts need to search by product ID and customer ID. It makes a lot of sense to scale up this application if at all possible so that all of these queries can be localized to a single SQL Server instance. Suppose you run out of hardware resources or are budget constrained and need to use off-the-shelf commodity hardware to scale out this retail sales application? And you choose to partition on customer ID because displaying a customer's records quickly is a top priority for customer satisfaction.

There are several ways of scaling out this retail application. You could use SQL Server distributed partition views. One requirement of distributed partition views is to horizontally partition the database and distribute the partitions across a federation of SQL Servers. All participating servers must have a similar database schema and a UNION ALL statement to combine the tables into an updateable view — the distributed partition view.

Note:  Distributed partition views have certain deployment requirements. See SQL Server 2000 Books Online for a discussion of distributed partion views. Another good reference is Don Jones’ e-book The Definite Guide for Scaling Out SQL Server.

Let's consider another scale out technique; Data Dependent Routing (DDR). DDR requires enough intelligence in the client application, typically in the middle-tier layer, to route the database requests to the appropriate nodes. With DDR, there are no views across nodes — each federated server is independent of the others (with the exception of sharing the database schema). The middle tier contains mappings to how the data is partitioned and which node contains the data.

Returning to the retail application example with partitioning, we need a way to track where records are located. Let's assume we create a SQL Server database on the middle-tier Web server. We've partitioned our database by Customer ID across a number of federated servers and we create a lookup table on the middle tier that maps the Customer ID to the node where the data resides.

This lookup table would contain records such as these:

Customer ID

Partition ID

10015

1

10016

2

10017

1

10018

3

Table 1: Partition Lookup Table

When customer service wants to display all transaction records for customer 10015, the application can determine from the middle tier to send the request to node 1. There is no need for nodes 2 and 3 to receive any requests. Access is localized to a single node.

What happens when we run our nightly inventory reports based on Product ID? Since we partitioned the application on Customer ID, for each Product ID there could be records stored on every database node. This requires the application to query every node, bring back all records that match each Product ID, then combine and sort the results. Since access is not localized, this is a time-consuming operation. However, this could be run as a background batch job in such a way that it does not impact the customer online experience.

Challenges to Scaling Out

There are several challenges you face when scaling out an application:

  • Management – An increased number of nodes means increased operations management overhead. All planned maintenance tasks (such as database backup, OS and application service pack and bug fix patches, and index defragmentation) must be applied across many nodes rather than a single node. Adding and removing nodes must be done without impacting application users.

  • Data partitioning – Selecting the right partition key is not always straightforward. As your application grows, you may find your business needs change, requiring you to rethink the partition key. And load balancing across nodes can be difficult to achieve. For example, if you partition your database by Customer Last Name and implement this on 26 servers, one for each letter of the alphabet, it is likely that the server with last names beginning with “S” will see a lot more read/write activity than the server with last names beginning with “X.”

  • Application development and revision – Business and data access needs will change over time. How will these changes impact application availability?

  • High-availability practices – If you lose a single node, will your application continue to serve your users? How long does it take to restore a database to a single node? Can you take a single node offline without impacting your users?

We will address these challenges in the following sections, where we present an application where scale out has been very successful.

MSN Communication Services Platform (CSP)

Millions of people worldwide use Microsoft's MSN Messenger and Microsoft's Hotmail services. At the heart of these services is a communication services platform (CSP) stored in a large SQL Server database. Today the database that supports CSP is partitioned on 100 four-processor back-end servers running Microsoft SQL Server 2000 Enterprise Edition. This federation of servers services hundreds of millions of user accounts.

This application is a good example of how well SQL Server scales out using federated servers with Data Dependent Routing (DDR) because of the following considerations:

  • The sheer amount of workload exceeds the processing power of any available single server hardware system.

  • The isolation of queries on accounts makes the application a perfect fit for row-based partitioning and DDR.

  • By intelligently partitioning the database, the application can be run on low-cost commodity hardware (four processor servers).

Figure 1 shows an overview of the system architecture. This part of the MSN communication services platform consists of four tiers:

  1. Web servers running Microsoft Internet Information Services (IIS)

  2. Lookup partition database servers (LPS) running SQL Server 2000

  3. Database back-end servers running SQL Server 2000

  4. MSN scale-out management layer

Records are ordered and partitioned by Passport User ID (PUID) across the back-end database servers. The scale-out management layer stores the mapping of data partitions to the physical back-end database servers in its own SQL Server database, which is independent from LPS and back-end databases. The LPS database stores the mapping of PUID to data partitions, and is partitioned across multiple LPS servers to accommodate growth. Consumers of the communication services post requests to the Web server, which queries the LPS repository with the PUID to obtain the data partition where the records are located. Then the Web server queries the scale-out management layer to determine which back-end database server contains the information for that user. Information is returned to the client in a matter of seconds.

Cc966448.scddrtng01(en-us,TechNet.10).gif

Figure 1: Overview of MSN CSP architecture

MSN Scale-Out Management Layer

The MSN scale-out management layer provides a platform for the MSN CSP to deploy the partitioning, DDR, and failover topology for the LPS and back-end database servers. It is managed through an administration console.

The MSN scale-out management layer defines a failsafe set as a set of databases that contains one primary database and its replicas, called secondary databases. A failsafe set is the high-availability unit for the MSN scale-out management layer. One failsafe set can have one or multiple secondary databases. In practice, the primary and secondary databases are placed on different servers for high availability.

Primary databases and secondary databases for CSP are synchronized by SQL Server transactional replication. It is also possible to use log shipping instead of replication. The replication solution provides lower latency for synchronization while log shipping allows for higher transaction rates at the cost of latency.

A partition is defined as a set of partitioned data, which is the unit for data partitioning and DDR. A partition is stored in a failsafe set with its master copy on the primary database and its replicas on the secondary databases. In general, one failsafe set can store one or multiple partitions.

A failover group is defined as a group of servers that function as backups for one another. The primary and secondary databases of each failsafe set are placed on different servers for high availability. Since workload only goes to primary databases, primary databases for partitions are placed across servers carefully to distribute and balance the workload among servers of the failover group. Failover groups are made independent from one another by not allowing a failsafe set to go across failover group boundary.

The example shown in Figure 2 is a simple failover group consisting of two servers. This group hosts two failsafe sets, one colored in blue and the other in gold. In this particular example, each failsafe set stores only one partition and has only one secondary database. The data on the primary database is replicated to its replica on the secondary database. The primary database of Failsafe set 1 is placed on Server 1 and its secondary database is placed on Server 2. The primary database of Failsafe set 2 is placed on Server 2 and its secondary database is placed on Server 1. Both arrangements are designed for high availability. The primary database for Partition #1 is placed on Server 1 and the primary database for Partition #2 is placed on Server 2 to balance the workload.

Cc966448.scddrtng02(en-us,TechNet.10).gif

Figure 2: A Simple Failover Group

The scale-out management layer maintains a configuration database to store information about deployment and current status, including:

  • The topology of the failsafe sets and failover groups.

  • The mapping of each data partition to the appropriate SQL Server database for DDR.

  • The current status of all databases, servers running SQL Server, and failsafe sets for automatic failover and real-time DDR.

It reads configuration files for the partitioning and failover topology of the database servers and stores the information in its own SQL Server database, then configures the servers accordingly. It monitors the status of servers for failover operation and maintains the partition mapping for DDR.

This application also provides an administrative interface for common system maintenance operations on scale out systems using replication as the high-availability solution:

  • Promoting a database, which converts a secondary database into a primary database by redirecting workload and establishing replication from the primary to the secondary database.

  • Demoting a database, which converts a primary database into a secondary database. This results in draining the replication queue and dropping replication for that database. If this is the primary database in a failsafe set, the appropriate secondary database will be promoted.

  • Marking a database as “offline,” which prevents client applications from querying that database and pauses all replication processes. If this is the only primary database, the appropriate secondary database will be promoted.

  • Marking a database as “online,” which resumes replication processes to and from this database.

  • Marking a database as “needs repair,” which results in draining the replication queue and dropping replication for that database. If the database is a primary database, causes a secondary database to be promoted.

  • Repairing a database, which recreates a database that is marked for repair through a backup/restore process, after which, the database is left in an offline state.

  • Marking a server as “offline,” which causes all databases on that server to be marked as “offline.”

  • Marking a server as “online,” which causes all databases on that server to be marked as “online.”

Although the MSN scale-out management layer code is not available outside of Microsoft, customers can implement these functionalities leveraging SQL Server’s programmability support for the .NET Framework, Distributed Management Objects (DMO), and Transact-SQL.

In the following sections we will discuss how this system overcomes the challenges of scale out, notably manageability and high availability.

Scaling Out to Accommodate Data and Workload Growth

The CSP architecture, leveraging scale-out management layer functionality, allows for natural data growth and increased client requests. The data is divided among multiple partitions, each of which holds a portion of the total data set. Partitions are hosted on failsafe sets using the scale-out management layer, which are placed on different servers for high availability.

When new records are added to the system, the number of client requests and the CPU usage on the back-end servers running SQL Server increase. There is a maximum operational guideline of resource usage across all the back-end database servers to accommodate for server failover. The limit for the simplest design as shown in Figure 2 is 50%. When this threshold is exceeded, CSP considers it is necessary to add new database servers. This is accomplished by adding a new failover group.

A sophisticated failover group design is used for MSN CSP in production. As shown in Figure 3, to allow better utilization of server resources, the primary database on Server 1 hosts four data partitions, which are replicated across Servers 2, 3, 4, and 5 accordingly. When Server 1 fails, Servers 2, 3, 4, and 5 each receive 25% of Server 1's previous load. Although not shown in the figure, all the primary databases on other servers in the group are replicated in the same way. With this configuration, up to 80% of the resources can be used on all servers because a single server failover adds only 80%*25%=20% additional load to the secondary servers. In practice, the operation team leaves a little bit of room by setting the limit to 75%.

The failover group shown in Figure 2 has the simplest design with a resource utilization limit at 50%. Increasing the number of partitions and number of servers in a failover group raises the upper limit with the tradeoff of an increase in management complexity.

Cc966448.scddrtng03(en-us,TechNet.10).gif

Figure 3: Actual failover group architecture used by MSN CSP in production

Adding Failover Groups

A new failover group can be added to the system by using the scale-out management layer interface. One updates the information in the scale-out management layer configuration database for DDR and failover topology. LPS servers examine the data distribution across the back-end database servers when new accounts are requested, so adding a new failover group results in new accounts being directed automatically to the new group. Figure 4 shows the DDR and high-availability architecture of a system, before and after adding a failover group. The details of actual execution will be explained in the Pilot Study portion of the paper.

Cc966448.scddrtng04(en-us,TechNet.10).gif

Figure 4: Adding a failover group

Load Balancing

The load rebalancing of database servers after adding nodes is automatically done in a gradual way. When a new account is created, the LPS servers estimate the load on each database server and add the account to the server that has the most capacity. Using a heuristic indicator of the load of a database server, the new database servers receive all the new user accounts and the corresponding workload until their load reaches a similar level to the pre-existing database servers. This ensures a smooth scale out with the total throughput scaling linearly to the number of nodes without sacrificing response time or performance during a failover.

High Availability

The database uptime requirement for MSN CSP is 100% for reads. Given the use environment over the Internet, a 10-minute downtime per year for write access is allowed. CSP has achieved these goals 100% of the time in the two years of operation.

There are advantages and disadvantages for high availability using a scale out database design. Since there are more servers, the chance of a single system failure increases. However the affected data is smaller when compared to the loss of a single server that is hosting the entire dataset. The scale-out management layer provides the deployment and administration foundation for replication-based failsafe sets for MSN CSP, working in conjunction with various hardware failsafe mechanisms such as RAID and redundant power supplies.

Replication as a High-Availability Solution

CSP uses SQL Server transactional replication to achieve high availability because it offers a combination of low latency and transactional consistency guarantees. If the hardware, operating system, or SQL Server primary instance fails, or is taken down for maintenance, the secondary copy picks up the workload. The secondary copy is only used for failover since CSP is not designed to read or write to both the primary and the secondary concurrently. There are two reasons for this decision:

  1. Application design would be much more complex. It would be necessary to implement bidirectional replication between primary and secondary.

  2. The secondary copies coexist with primary copies of different dataset partitions on the same nodes. Reading from the secondary database would take resources from those primary databases.

Transactional replication uses the transaction log to capture incremental changes that were made to data in a published table. Microsoft® SQL Server™ 2000 and 2005 monitor INSERT, UPDATE, and DELETE statements, or other modifications made to the data, and store those changes in the distribution database, which acts as a reliable queue. Changes are then propagated to Subscribers and applied in the same order in which they occurred by opening a connection to the subscriber database and issuing SQL commands to the subscriber database.

In applications where there are high write to read transaction ratios, replication may lag behind the transaction processing. The limitation is basically how fast the replication commands are executed. The common limiting factors are network latency, index overhead on the subscriber database, and number of connections to the Subscriber that is executing the commands. When the source system transactions per second exceed the replication capacity of the system, replication latency will keep climbing until the transaction load is reduced. Queue buildup can be monitored using the performance counter: \SQL Server: Replication dist.\ dist: Delivery latency for transactional replication latency.

Losing the primary system causes the loss of transactions in the replication queue. The tolerable level of transaction loss depends on the business needs and desired end user experience. For CSP, the upper design limit is ten minutes. In other words, the CSP client applications will tolerate the loss of up to ten minutes of write requests to the database. This may not be acceptable for other applications where the Service Level Agreement (SLA) does not permit data loss. Other high-availability solutions will then be required.

There are a number of ways to get around the replication bottleneck:

  • Spread the data and workload out across more server machines, which reduces the workload per distributor, although this can lead to underutilized hardware. This is the option elected for CSP.

  • Since each server running SQL Server has only one distributor, having multiple SQL Server instances provides multiple distributors per server. By spreading the original workload across these instances, the replication load can be processed by more distributors per server. In this way, additional servers are not needed but additional management is required for allocating hardware resources among instances, such as CPU and memory. For best practices on running multiple instances, please see SQL Server Consolidation on the 32-Bit Platform using a Clustered Environment.

  • Microsoft SQL Server 2005 supports parallel replication and guarantees that transactions are processed to the Subscriber in the same order as to the Publisher. Since SQL Server 2005 has not been released at the time of this writing, this feature is not yet implemented on the CSP production site.

  • Increase parallelism by creating multiple publications in SQL Server 2000 and using the independent agent option. This option does not guarantee that transactions are processed to the Subscriber in the same order as to the Publisher. Therefore, it does not guarantee transaction consistency between all sets of published data and was not used by the CSP team.

The MSN CSP operations team monitors the stress level of client requests on each node. As the number and size of the accounts increases, so does the number of queries per node. When the stress level reaches a threshold, another failover group is added to the system.

System Failure Detection and Failover

The MSN scale-out management layer monitors the status of all nodes. It detects the failure of a server or a database and promotes the secondary database for the failed partition by redirecting the workload traffic to the secondary database.

The Web server application establishes connections to the back-end databases according to the information in the scale-out management layer configuration database. The Web server application makes requests against the correct physical database instance during processing. Return codes from SQL Server indicate connection problems. Connection timeout is also treated as a failure. Web server runs a client of the MSN scale-out management layer, which communicates the failure to the management layer. The management layer blacklists the failed databases and re-directs the Web server to their backups very quickly in a matter of seconds.

System Maintenance

As discussed earlier in this paper, system management presents a special challenge for a scale out system. For an OLTP system like MSN CSP, common routine management tasks include OS and application patching, adding nodes, database backups, and index defragmentation. These tasks need to be accomplished with minimal impact to data availability and workload performance. A few of the tasks can be done without taking any databases offline, while others require offline processing or even bringing down the entire server. Here we are going to discuss how MSN CSP uses the scale-out management layer administration operations to maintain application availability for all of these common management tasks.

Index Defragmentation and Rebuild

To help meet high availability SLA, online index defragmentation (DBCC INDEXDEFRAG) is generally preferred to rebuilding the index (DBCC REINDEX) which requires taking the database offline. The MSN CSP operations team runs a DBCC INDEXDEFRAG job every Saturday night, when client requests are lowest. Index rebuild is done less frequently and only when fragmentation increases to 40%, approximately every 6 to 8 weeks. The workload throughput is improved by 5-10% after the indexes are rebuilt. Please see Microsoft SQL Server 2000 Index Defragmentation Best Practices for information on how to measure and reduce index fragmentation.

Using the scale-out management layer, offline index rebuild can be accomplished with no impact to application availability as discussed later in the Pilot Study testing.

Repairing a database

It is possible for a database to become damaged by a hardware failure or operator error. If the primary databases are damaged, the workload is redirected to the secondary copies by the scale-out management layer’s automatic failover operation. To restore the damaged copy, the scale-out management layer operates as follows:

  1. Marks the database as “need repair.”

  2. Executes “repair” operation on the database, which backs up the primary database and restores it to the database marked for repair, leaving the database in the offline state.

  3. Marks the repaired database as “online” after the latency between the primary and the newly restored secondary database comes down to the target level of ten minutes.

The database can now be promoted to a primary or it can continue to play the secondary role depending on the operator’s preference.

Patching of OS or SQL Server

Some patches require a system reboot or a restart of the SQL Server service. In these cases, the node needs to be taken offline, and then placed back online by the MSN scale out management layer as follows:

  1. Mark the server “offline”.

  2. Patch this server.

  3. Mark this server “online,” which will resume replication between the two copies and synchronize them.

  4. Switch roles of the pair by demote/promote and repeat steps 1-3.

  5. Switch roles of the pair by demote/promote to return to the original configuration.

Pilot Study of MSN CSP on SQL Server 2005

We deployed a smaller scale of MSN CSP using the simplest failsafe set in the SQL Server scalability lab. The pilot environment allowed us to conduct tests, try different configurations, and establish the various scalability data points. Our pilot lab deployment consisted of twelve clients, three Web servers, two LPS servers, four back-end database servers and one scale-out management layer management server. The three Web servers are connected to the network through a switch for load balancing. The data and log files for all database servers were stored across the same group of disks on an EMC Clariion Storage Area Network with two Emulex Host Bus Adaptors installed on each server to provide for I/O load balancing and failover. See Appendix A for more details regarding the hardware configuration.

Throughput Increases with Additional Nodes

Figure 5 shows how the total number of queries processed by all the back-end database servers increases proportionally to the number of back-end database servers in our tests. This demonstrates the reason why some customers choose to scale out applications; they expect linear scaling with additional nodes. This is one of the reasons that the MSN CSP team could grow their application to the current size of 100 back-end servers as their user base grows and they will be able to grow it beyond 100 when necessary.

Cc966448.scddrtng05(en-us,TechNet.10).gif

Figure 5: Workload performance scaling versus number of back-end database servers

Storage Subsystem Expansion

We picked Storage Area Network storage over Direct Attached Storage (DAS) for the following reasons:

  • Centralized management

  • Increased flexibility and scalability since storage can be added without adding servers

  • Non-disruptive business operations when adding or re-deploying storage resources

  • Storage Area Network provides hardware-level high-availability solution

Potential drawbacks to Storage Area Network storage include:

  • Storage Area Network storage can be significantly more expensive than DAS

  • Requires special expertise to manage

When adding servers, it is possible to reach a performance bottleneck on the disk subsystem resulting in increased disk queuing (as monitored by performance counters Logical Disk/Avg. Disk sec/[Read,Write]]). This is a different issue than running out of disk space. Several Storage Area Network vendors, including EMC Clariion, support online disk LUN expansion. As part of our testing, we simulated a disk subsystem performance bottleneck on the EMC Clariion Storage Area Network by intentionally limiting the number of physical spindles available to the back-end database servers.

Microsoft Windows Server 2003 provides an online disk expansion feature, which can be used to concatenate more spindles to the LUN instead of rebuilding the entire stripe set. EMC Clariion provides a feature to rebuild the entire stripe set by copying and moving data at the disk level. The EMC approach provides advantages for our application:

  1. Data is more evenly distributed on physical spindles after re-striping; this offers improved performance across the entire LUN.

  2. We have more flexibility for disk subsystem configuration, such as increasing the number of spindles while maintaining the same size of LUNs, allowing new capacity to be used for other databases, or increasing the capacity of LUNs without increasing the number of spindles when there is extra space on the spindles.

  3. The operation is transparent to the OS and can be performed on both basic disks and dynamic disks.

In our tests, we started with two back-end database nodes, and then added two more nodes, sharing the same group of physical spindles for the database files. After adding the two additional nodes, we doubled the workload stress level. The disk queue increased accordingly and the disk latency increased as well; the workload performance was I/O bound. The red line in Figure 6 shows a non-linear scaling of workload from two nodes to four nodes.

Note:  In a real production environment, it is recommended that the LUN expansion operation precede the workload growth. This allows the storage to be extended and redistributed prior to the addition of new scale-out database servers without impacting the performance of the live environment.

EMC provides three priority levels for the Storage Area Network LUN expansion operation. We used the default setting (low priority) for our testing, which takes longer to complete but minimizes the disk activity of the operation and therefore the impact on the concurrent workload.

 

Before Expansion

After Expansion

Total size for data

320 GB

320 GB

Total size for log

200 GB

200 GB

Number of Spindles for data

24

32

Number of Spindles for log

16

24

Workload Scale factor

1.6

2.0

Table 2: Comparison of EMC Clariion Storage Area Network before and after LUN expansion

Table 2 lists the EMC Clariion Storage Area Network disk configuration before and after LUN expansion. Notice that we increased the number of spindles for LUNs without increasing the LUN size. The whole process took 44 hours with no measurable impact on the workload. It would complete in less time when using the medium or high priority levels for the operation, but the impact on the workload would increase. Given the same priority setting, the time of the operation is proportional to the disk space to be reconfigured.

Figure 6 shows how the I/O bottleneck was eliminated after the expansion and the workload performance was restored to the level of linear scaling.

Cc966448.scddrtng06(en-us,TechNet.10).gif

Figure 6: Workload performance scaling with and without disk I/O expansion

Replication as a High-Availability Solution

As discussed previously in this paper, replication combined with the scale out management layer provides a great high-availability solution. One drawback is the single-threaded limitation of transactional replication in SQL Server 2000, which prevents MSN CSP from fully utilizing the available server resources.

This issue is resolved in SQL Server 2005, which supports parallel streaming of replication. The distributor can process replication commands in multiple streams (1-64) with a guarantee of strict transaction serialization. The optimal number of streams and how well the performance will scale depends on multiple factors:

  • The number of CPUs. It is recommended that you not use more streams than the number of CPUs. Our test results showed that using 64 streams of replication on 4-processor machines adds significant CPU usage while giving throughput comparable to that of using between one and two streams.

  • Blocking. If the transactions overlap on the tables, the streams may block one another. The write access requests that are issued by the transactions of CSP workload are almost randomly distributed across the large tables, each containing only very few rows. So in our pilot, blocking is not significant when no more than four streams of replications are used.

  • Spare CPU capacity. Adding more streams will increase CPU usage and spare CPU capacity should be available.

  • The number of replication commands in a given period. There is no need to run more streams if the replication queue can be drained in a timely way.

Figures 7 and 8 show that in our testing, the replication throughput indicator (number of replication commands delivered per second) increased significantly when the number of replication streams was increased. Each additional stream adds about 1-2% CPU usage.

Figure 7: Replication throughput versus number of streams

Figure 7: Replication throughput versus number of streams

Figure 8: CPU usage versus number of replication streams

Figure 8: CPU usage versus number of replication streams

DBMS Maintenance

We tested the following three operations to demonstrate how proper application and management design can result in minimal impact to application availability.

  1. Adding a failover group

  2. Index defragmentation

  3. Major version upgrade of SQL Server

Adding a Failover Group

This was accomplished by adding a new failover group to the configuration file of the scale-out management layer and setting up the hardware configuration, OS, and Server SQL instances on the new machines. The scale-out management layer then proceeded to set up the databases, including schema, stored procedures, and replication on the new SQL Server instances. The scale-out management layer then updated the partition mapping and DDR information on the LPS servers to reflect the newly available partitions, as illustrated in Figure 4. Afterwards, the requests from Web servers were directed to both pre-existing servers and new servers according to the new DDR information.

Index Defragmentation

Database applications with large numbers of insert and update commands will experience database index file fragmentation, resulting in performance degradation for some workload. Eventually defragmentation is required to maintain optimal I/O performance. For more discussion and best practices of index defragmentation, see Microsoft SQL Server 2000 Index Defragmentation Best Practices.

SQL Server 2000 provides two options for index defragmentation: DBCC INDEXDEFRAG and DBCC REINDEX. DBCC REINDEX runs significantly faster than DBCC INDEXDEFRAG when the fragmentation level is high and multiple processors are available. However, in SQL Server 2000 the database has to be taken offline to rebuild indexes. On SQL Server 2005, these two commands are replaced by ALTER INDEX <table> REORGANIZE and ALTER INDEX <table> REBUILD WITH (OFFLINE) correspondingly. Offline index rebuild was performed in our test lab using the scale-out management layer as follows:

  1. Marked the secondary copy as “offline.”

  2. Ran ALTER INDEX <table> REBUILD WITH (OFFLINE) to rebuild all the indexes on the secondary copy.

  3. Marked the database as “online,” which resumed replication between the two copies and synchronized them. It took 20 minutes for the Latency between the primary and the newly restored secondary database to come down to the acceptable level of 10 minutes. If the primary database fails, the potential transaction loss will be acceptable under the SLA as discussed earlier.

  4. Switched roles of the pair using demote/promote and repeated steps 1-3 on the new secondary copy.

  5. After defragmenting the original primary, switched roles of the pair using demote/promote to return to the original configuration.

On Microsoft® SQL Server™ 2005 Beta 2, indexes can also be created, rebuilt, or dropped online by using a new feature: Online Index Operation. The ONLINE option allows concurrent user access to the underlying table or clustered index data and any associated nonclustered indexes during these index operations. We executed this SQL Server command to rebuild all the indexes of a table online in our testing:

ALTER INDEX ALL ON <table> REBUILD WITH (ONLINE, MAXDOP = degree of 
parallelism desired)

The degree of parallelism of index rebuild was set in steps from one to four. Our servers have 4 CPU each. We tested index defragmentation using the three different methods on 55 GB of data with average 20% logical scan fragmentation and normal stress level of concurrent workload. The workload stress level was adjusted to consume 52% of CPU and the number of transactions was 286 on the Primary Database. Running ALTER INDEX <table> REORGANIZE against all the tables resulted in minimal additional CPU usage while online index rebuild consumed significant CPU depending on the value of MAXDOP. Online index rebuild with high MAXDOP settings may impact the workload performance. Offline index rebuild is much faster than online operations. Online index rebuild is faster than online ALTER INDEX <table> REORGANIZE, with the actual time dependent on MAXDOP. In most cases, for high-availability environments, online index rebuild is preferred over offline index rebuild, assuming that you can tolerate the longer duration required to complete the operation. Whether to run online index rebuild or online ALTER INDEX <table> REORGANIZE mainly depends on the following three factors:

  • The stress level of workload. ALTER INDEX <table> REORGANIZE can scale back resource usage depending on the current stress level.

  • How often the operation has to be run. ALTER INDEX <table> REORGANIZE takes longer to complete if it scales back.

  • The characteristics of the fragmentation. ALTER INDEX <table> REORGANIZE leaves the interleaved extents within a data file. ALTER INDEX <table> REORGANIZE does not correct extent fragmentation on indexes, either. Interleaving occurs when index extents (a group of eight index pages) for an index are not completely contiguous within the data file, leaving extents from one or more indexes intermingled in the file.

Online Upgrade of SQL Server

Upgrading the version of the DBMS is a challenging task in any data center. We wanted to demonstrate that we could perform a major version upgrade of SQL Server without incurring any downtime. This rolling online upgrade is only possible in conjunction with a sophisticated application management layer like MSN CSP.

We upgraded a failover group of two back-end database servers from SQL Server 2000 to SQL Server 2005 as follows:

  1. Ran SQL Server Best Practices Analyzer (BPA) against the back-end databases. Compatibility issues were identified by the tool and were corrected. BPA with all related documentation can be downloaded from the Microsoft Download Center. The use of BPA tool for this purpose is subject to change.

  2. Marked Server 1 offline. The primary database of Partition #1 on Server 1was demoted and the secondary database of Partition #1 on Server 2 was promoted. Replication processes are paused for both databases on Server 1. All the workload for Partitions #1 and #2 was directed to Server 2.

  3. Ran SQL Server 2005 upgrade on Server 1, which took 30 minutes.

  4. After the upgrade, Server 1 was brought back online using the Scale out Management Layer admin console. It took 47 minutes for the replication latency to reach the 10 minute target.

  5. Repeated steps 2-4 for Server 2. All the workload for Partitions #1 and #2 is directed to Server 1.

  6. Promoted and demoted to restore the original distribution of primary and secondary copies so the workload is balanced among Server 1 and Server 2.

The entire operation took 3 hours with no loss of system availability.

Conclusion

This paper discusses the benefits and challenges of scaling out database applications. Using a real-world customer application, the MSN communication service platform (CSP), we executed various scenarios in a pilot environment to demonstrate how Data Dependent Routing can be used to accommodate data and workload growth with linear performance scaling, to achieve high availability using SQL Server transactional replication and to perform online system maintenance. The results demonstrate that one can successfully manage and scale out an enterprise class application using SQL Server.

For More Information

Appendix A: Hardware Configuration

Machine Configuration

Machine Role

Model

CPU

Physical memory

Storage

OS Version

Application

Database Server (4)

Dell 6650

2 GHz Xeon x 4

8 GB

Storage Area Network (see Storage Area Network configuration)

Windows Server 2003 Enterprise Edition

SQL Server 2005 Beta 2

LPS Server (2)

Dell 6650

2 GHz Xeon x 4

8 GB

Direct attached SCSI disk array
146 GB x 5

Windows Server 2003 Enterprise Edition

SQL Server 2005 Beta 2

Web Server (3)

Dell 2650

2.4 GHz Xeon x 2

4 GB

Local disk

Windows Server 2003 Enterprise Edition

IIS 6.0

Scale Out Management Layer Server

Dell 2650

2.4 GHz Xeon x 2

4 GB

Local disk

Windows Server 2003 Enterprise Edition

 

Web Client (12)

Dell 1650

1.4 GHz PIII x 2

2 GB

Local disk

Windows Server 2003 Standard Edition

 

Table 3: Machine configuration

Storage Area Network Configuration

EMC Clariion CX600
Disk speed: 10,000 RPM
Disk size: 146 GB
Each back-end server is connected to a 2 gigabit/s switched Storage Area Network via 2 HBA (see HBA configuration) running PCI-X protocol.

Table 4 shows the disk layout for the four database servers in the scalability test.

File Groups

Disk Array Layout Before Expansion

Disk Array Layout After expansion

Log and tempdb RAID groups

Total of 16 disks on RAID 10. 8 LUNs (25GB each) per server.

Total of 24 disks on RAID 10. 8 LUNs (25GB each) per server.

Data RAID groups

Total of 24 disks on RAID 10. 8 LUNs (40 GB each) per server.

Total of 32 disks on RAID 10. 8 LUNs (40GB each) per server.

Table 4: Disk layout for the database servers

Figures A-1 and A-2 depict the storage configuration changes executed during the growth from two back-end servers to four back-end servers

Cc966448.scddrtnga1(en-us,TechNet.10).gif

Figure A-1: Disk layout for two servers

Cc966448.scddrtnga2(en-us,TechNet.10).gif

Figure A-2: Disk group expansion in preparation for four servers

Host Bus Adaptors

Emulex LP9802 Host Bus Adapter
Bus speed: 133/100/66 MHz
Link speed: 2 gigabit/s fiber channel