SQL Server 2000 Incremental Bulk Load Case Study
By Man Xiong and Sunil Agarwal
Summary
Introduction
Understanding Bulk Load
Scenarios
Appendix A: Test Environments
Appendix B: Predicting Crossover Point
Incremental bulk load refers to loading data into a nonempty table. The key question during incremental bulk load is whether indexes should be dropped before bulk load. The answer depends on multiple factors. This paper attempts to answer this question through a case study of incremental bulk load using the BULK INSERT statement on a representative decision support system (DSS) running Microsoft® SQL Server™ 2000 on Microsoft Windows 2003 Server®. You should view the results presented in this paper as recommendations instead of absolute answers, because your results will depend on your hardware (for example, number of CPUs, I/O, and network bandwidth), data distribution in the table, and the data files and parameters supported by the BULK INSERT statement. All tests were conducted on a Hewlett-Packard hardware system (for more information about the test environments, see Appendix A, "Test Environments"). This paper provides general recommendations for the following scenarios:
Target table with no indexes:
Execute multiple bulk load commands concurrently.
Set your database must be set in Bulk-Logged recovery mode.
Use the TABLOCK parameter to minimize locking overhead and load each data file in a single batch.
Note The recommendations for the following scenario apply to the type of workload that was tested in the case study presented in this paper. These recommendations provide a good starting point for other kinds of workloads.
Target table with a clustered index:
Bulk load with clustered index in place for better loading performance and data availability.
Execute multiple concurrent bulk load commands with the optimal batch size to minimize blocking. The optimal batch size will depend on data size and distribution of existing and incremental data.
Target table with a single nonclustered index:
Bulk load with the nonclustered index in place when the size of the incremental data is in the same order as the existing data in the table.
Execute multiple concurrent bulk load commands with optimal batch size to minimize blocking. Optimal batch size will depend on data size and distribution of existing and incremental data.
Drop the nonclustered index before the bulk load if the size of incremental data is larger than the existing data in the table. You will need to create the nonclustered index afterward.
Target table with a clustered index and multiple nonclustered indexes:
If the size of the incremental data is a small fraction of the existing data in the table, load data with all the indexes in place by using a single bulk load command with the optimal batch size (determined by testing) for the highest loading rate.
If the incremental data is much more than a small fraction of the existing data in the table, drop all nonclustered indexes and then bulk load with the clustered index in place. You will need to recreate the nonclustered indexes after the bulk load.
With few exceptions, these recommendations are applicable when bulk loading data by using the bcp utility and Data Transformation Services (DTS). Exceptions are noted in this paper.
In a typical DSS environment, data is periodically loaded from external sources (for example, an online transactional processing [OLTP] system). The tables usually have multiple nonclustered indexes in addition to a clustered index for optimal performance of queries. This makes loading slower compared to loading into a table with no index because every insert requires an update of all indexes. This paper recommends best practices for incremental bulk load and explains the rationale behind these practices by using test data from four different scenarios. The recommendations are mainly applicable to achieving maximum bulk load throughput. Bulk loading is also common in OLTP workloads; it is not unique to DSS workloads. The recommendations provided in this paper are equally applicable to OLTP workloads.
Several factors affect the performance of bulk load operations (for example, the BULK INSERT statement and bcp utility). The following sections describe these factors and suggest ways to improve performance. The parameters described are applicable to the BULK INSERT statement. Similar parameters are also available for other bulk load methods.
SQL Server supports three bulk load methods:
Bulk Insert task as a Transact-SQL command
bcp utility
DTS
Transact-SQL BULK INSERT statement executes in-process with SQL Server, sharing the same memory address space. Because the data files are opened by a SQL Server process, cross-process data copying is avoided.
BCP and DTS executables are both run out of process and therefore require inter process data copying and parameter marshaling to move data across process memory spaces. Inter process data marshaling is the process of converting parameters of a method call into a stream of bytes and can add significantly to CPU overhead. With Bulk Insert, you can skip these steps and can go straight to OLE-DB. However, unlike Bulk Insert, with BCP parses the data and does the data conversion into native storage format in the client process. This can provide a performance advantage over Bulk Insert if SQL Server is running say on a single processor box as otherwise SQL Server process is burdened, when doing BULK INSERT, with the parsing and data conversion. Depending on the processing and copying, you may see a gain of 20 percent or a degradation up to 100 percent (for example, when bulk loading LOB data) in the bulk load performance with BCP over Bulk Insert.
For this case study, we chose Bulk Insert command for bulk load operations.
Bulk load operations can perform optimized logging (that is only log page allocations are logged, not the actual data or index rows) under certain conditions. If the database is configured in full-recovery mode, specialized logging for bulk and sort operations is not enabled. If possible, you should set the recovery mode of the database to either Bulk-Logged Recovery (preferred) or Simple Recovery during bulk load operations.
Note Setting the database recovery mode is just one of the conditions to enable optimized bulk logging. It also depends on other factors discussed later in this paper.
The schema of the target table affects query plans for bulk load operations, optimized logging, and concurrent bulk load and ultimately has bearing on the throughput of the bulk load operation. The following sections explain elements of the target table schema that affect the query plan.
Most tables have indexes; the target table for bulk load operation is no exception. However, having indexes on the target table affects the performance of bulk load. You have two options for the target table indexes: drop one or more indexes, bulk load the data, and then recreate the dropped indexes or do not drop any indexes. Consider the following cases when the target table has:
No index When the target table has no indexes and TABLOCK is specified, you can execute the multiple bulk load commands concurrently. These concurrent bulk loads use a Bulk Update (BU) lock at the table level that inserts the data without blocking other sessions. If you use TABLOCK with no index, the bulk-logged optimizations are also available. It does not matter if the table is empty or full. This is the fastest way to bulk load data. The BU lock, however, conflicts with regular share locks and exclusive locks.
If TABLOCK is not specified, you can still execute multiple bulk load commands concurrently, however, each of these commands uses regular locking, like the locking used by Transact-SQL Insert, and the bulk-logged optimizations are not available.
Single nonclustered index or a clustered index When the target table has an index, concurrent bulk loading with BU lock is not possible with TABLOCK.
Note You can still execute concurrent bulk loads without specifying TABLOCK (as discussed in the case of no index), and it will be done using regular locking. However, the concurrent bulk load sessions may become blocked. In this case, the bulk-logging optimizations are only available when the table is empty to start with and the bulk load is done in one batch.
Note If you bulk load data in multiple batches in this case, the bulk-logged optimization will not be available starting with the second batch because the table becomes "nonempty" after the first successful batch.
Typical query plan on a table with clustered index is
(datafile-Scan) --\> (Sort on Clust-Key) --\> (Insert into Clust-Index).
You can avoid the sort if the data in the data file is already sorted on clustered index key columns by specifying an ORDER hint.
Similarly, a typical query plan on a table with single nonclustered index is
(datafile-Scan) --\> (Insert data into the table) --\> (Sort on NonClust-Key) --\> (Insert into NonClust-Index).
Note Even if the input data is sorted nonclustered key columns, the sort cannot be eliminated. ORDER hint is applicable only when inserting data into the base table (with clustered index). It is ignored for all other cases.
- Clustered and multiple nonclustered indexes When the target table has a clustered index and multiple nonclustered indexes, concurrent bulk load with BU lock is not possible with TABLOCK.
Note You can still execute concurrent bulk loads without specifying TABLOCK (as in the case of no index), and it will be done using regular locking. However, the concurrent bulk load sessions may become blocked. In this case, the bulk-logging optimizations are only available when the table is empty to start with and the bulk load is done in one batch.
Note If you bulk load data in multiple batches in this case, the bulk-logged optimization will not be available starting from the second batch because the table becomes "nonempty" after the first successful batch.
A typical query plan under this case is more complex, but the initial part is same as in the case of a single nonclustered index or a clustered index.
--\> Spool --\> Sort on NonClust-Key --\> insert into NonClust-Index1
(previous plan) --\>| .......
--\> Spool --\> Sort on NonClust-Key --\> insert into NonClust-Index2
After the clustered index is created, all the attributes required to create nonclustered indexes are spooled. These attributes include all key columns of nonclustered indexes and the clustered key columns. The clustered index key columns are used to point to the data row from the nonclustered index leaf pages. In the preceding query plan representing a bulk load, the data into each of the nonclustered indexes is inserted in parallel.
When the target table is empty, it makes sense to create indexes after the bulk load has finished for two reasons. First, you can perform concurrent bulk load with BU lock and optimized bulk logging. Second, you can perform each index creation in parallel.
Note In this paper, parallel refers to mean a single command that is executed by multiple threads. For example, a single Transact-SQL CREATE INDEX statement can be executed by multiple threads. In this case, a BULK INSERT statement cannot be executed in parallel. You must invoke multiple BULK INSERT statements to load data in parallel.
The constraints are checked for every row that is inserted. If appropriate, it is recommended that you disable checking of constraints. The only constraint that can be disabled is the check constraint. You cannot disable the unique constraint, primary-key/foreign-key constraint, or NOT NULL constraint. When the check constraint option is re-enabled, SQL Server must check the entire table to revalidate the constraints. For this reason, disabling constraints during incremental bulk load is not recommended because the cost of revalidating the constraint for the entire table exceeds the cost of applying constraints to the incremental data.
A situation in which you might want to disable constraints is if the input data contains rows that violate constraints. By disabling the constraints, you can load the data and then use Transact-SQL statements to clean up the data.
If there are triggers defined for insert operations on the target table, they will be fired for every batch completed. If appropriate, disable the execution of triggers during bulk load operation.
The BULK INSERT statement accepts the TABLOCK hint, which allows the user to specify the locking behavior used.
When bulk loading into a heap, TABLOCK specifies that a bulk update (BU) table-level lock is used for the duration of the bulk load. This can improve the performance of the bulk load operation because of reduced lock contention on the table. Also, TABLOCK is a required parameter to bulk logging when the recover mode of the database is configured in BULK-LOGGED or SIMPLE recovery mode.
When bulk loading into a table with one or more indexes, TABLOCK forces bulk load operations to acquire X table lock so no concurrent bulk load is possible.
If you do not specify TABLOCK, bulk load does not acquire table lock, but acquires locks on rows or pages instead. However, row or pages locks may be escalated to X lock depending on your batch size and the concurrent activity on the target table. SQL Server tries to escalate the row or page locks to table level if the number of locks exceeds 5,000.
Table 1 describes logging and locking behavior when the target table is nonempty (meaning that it is an incremental bulk load).
Table 1 Logging and locking behavior for a nonempty target table
Table schema |
TABLOCK |
Logging type |
Concurrent bulk load |
---|---|---|---|
Heap (including LOBS) |
Yes |
Bulk-logged |
Yes (BU lock) |
Heap (including LOBS) |
No |
Fully-logged |
Yes (regular X lock on rows or pages) |
Table with indexes |
Yes |
Fully-logged |
No |
Table with indexes |
No |
Fully-logged |
Yes (regular X lock on rows or pages) |
Note Regarding table schema, heap refers to a table with no index clustered or nonclustered. LOBs are allowed even though LOBs are represented internally as an index for allocation purposes.
Note Concurrent bulk load refers to multiple bulk load commands each with its own data stream. If TABLOCK is specified, the bulk load command waits to acquire the table lock. However, in the absence of TABLOCK, each bulk load command acquires either a row or page lock (assuming no lock escalations) depending on the locking granularity setting on the table.
Bulk load is typically CPU bound. Therefore, if SQL Server is running on a multiprocessor computer, you will see higher bulk load throughput when multiple bulk load commands are invoked concurrently, assuming no blocking. Additionally, during bulk load, a special table-level lock (BU lock) is available when you specify TABLOCK. The BU lock functions as an exclusive lock because it blocks access to the table through regular Transact-SQL statements, but concurrent bulk load threads are not blocked. BU lock is available only on the heap (meaning a table with no indexes) and provides the best concurrent load performance for this case. Table 1 describes conditions for concurrent bulk load.
Note It is not necessary for the database to be in SIMPLE or BULK-LOGGED recovery mode for concurrent bulk load.
To execute bulk load concurrently, you need to issue multiple bulk load commands with each command reading from its own data file. Multiple bulk load commands cannot read from the same data file.
If the input data is sorted on the clustered key column, it eliminates the sort as described in the query plan for single nonclustered index or a clustered index in the section "Indexes" earlier in this paper.
The default setting of BATCHSIZE is the length of the input file for a BULK INSERT statement. Using a nondefault value splits the bulk load into one or more transactions. Each transaction inserts up to BATCHSIZE number of rows. A smaller batch provides the following benefits:
When indexes are present, a smaller batch reduces the memory needed for sorting. During concurrent bulk load, it may also reduce blocking depending on the data pattern and distribution in the input data files.
In case of failure or error, you will need to reload only the data starting from last unsuccessful batch. For example, if your bulk load takes 3 hours and it fails near the end, you will need to reload only the last batch instead of starting the whole bulk load over again.
The data in the data file can be in character format or in native (that is, binary representation) format. Loading data stored in character format requires parsing and then converting it to the native storage format based on the type of the column in the target table, which requires server resources. Character format is the most common format for data; however, it is more efficient to load data that is available in the native format.
The size of the input data is important in the case of indexes. When creating a clustered index, SQL Server sorts the data and increases its size. The larger the size of the input data the more memory (the sort may need to be run in multiple passes) that is required. Similarly, for nonclustered indexes except that the size of the data to be sorted depends on the size of the index key and number of rows. You can control the size of the data by specifying a smaller batch size. Also, depending on the size of data you want to load, the statistical information on the target table may change significantly. It is recommended that you recreate the statistics after bulk load operation, especially if you disabled the automatic regeneration of statistics.
In DSS workloads, it is beneficial to consider scenarios for incremental bulk load that involve a target table with one or more indexes. It is also beneficial to consider a scenario that involves bulk loading into a table with no index because in some situations (referred to as the crossover point) it is better to drop all indexes on the table, perform the incremental bulk load, and then recreate all the indexes instead of performing incremental bulk load with indexes in place.
A crossover point is defined as a ratio of the incremental data over the existing data in the target table below which you can get better incremental bulk load performance without dropping indexes. To put it simply, if you know the crossover point for a specific target table, you can determine if you should drop indexes or not before you bulk load the data.
The key, however, is to determine what the crossover point is. The crossover point depends on a number of variables such as the schema of the table, hardware resources, and data distribution. However, for a system with consistent resource and stress levels, it may be advantageous to determine the crossover point to define a good bulk load strategy.
Note If any dependent factors change, the crossover point needs to be recomputed. In this paper, the crossover point is calculated for each scenario. It is highly recommended that you view the best practices presented in this paper as general recommendations; your experience may be different because of differences in hardware and software.
Testing for the scenarios presented in this paper was based on the table defined in Appendix A, "Test Environments." The data that was bulk loaded was stored in character mode. There is CPU overhead as a result of loading character mode data because it must be parsed and converted to the target table column type before loading. You can eliminate this overhead if you load data in native mode.
The test results for this scenario suggest using concurrent bulk load commands with the default batch size and TABLOCK hint for the highest loading rate. This can be done by launching the same number of Bulk Insert commands as the number of CPUs using the TABLOCK hint. The database should be set in bulk-logged recovery mode.
The following command was used to bulk load the data for each of the eight concurrent bulk load commands:
BULK INSERT fact_table <data-file> WITH (TABLOCK)
Parameters:
TABLOCK: to execute multiple bulk load commands concurrently. The query gets a BU lock.
BATCHSIZE: not specified. Each bulk load command loads the data in a single batch.
Loading throughput goes up linearly with the number of processors. No I/O bottleneck was encountered. However, there was no other workload on the system. During the bulk load, the CPU utilization was nearly 100 percent. Figure 1 indicates the combined throughput that was achieved with eight concurrent bulk load commands.
Figure 1 Incremental bulk load into a heap
The following are recommendations for achieving optimal performance for incremental bulk load:
Use default setting for the batch size. It equates to the size of the data file and reduces the overhead of opening and closing the files.
Use the TABLOCK hint to improve performance. The TABLOCK hint improves performance by:
Enabling bulk logging.
Reducing the locking overhead because there are no row-level locks.
Executing multiple concurrent bulk load commands without blocking by using the BU lock mode.
It is better to keep the clustered index during incremental bulk load. In this scenario, the following two plans were used:
Plan A Incremental bulk load was performed by dropping the clustered index, loading the data into the target table, and then recreating the clustered index.
Plan B Incremental bulk load was performed without dropping the clustered index.
The test results suggest that it is better to load the data with the clustered index intact unless the incremental data is larger than the data in the original table.
The following commands were executed in sequence:
Drop clustered index fact_table.ci. (Executed by using a single thread. No parallelism possible.)
Bulk Insert fact_table <data-file>. (Executed with eight concurrent bulk load commands.)
create clustered index ci on fact_table(order_id). (Executed with eight-way parallelism.)
Parameters for Bulk Insert:
TABLOCK: To do bulk load concurrently. It gets a BU lock.
BATCHSIZE: Not specified. Each bulk load command loads the data in a single batch.
Bulk Insert fact_table <data-file>. (Batch size was set to 5,000 rows.) (Executed with eight concurrent bulk load commands without dropping the clustered index.)
Parameters for Bulk Insert:
TABLOCK: Not specified. If specified, the result would have been an X table lock.
BATCHSIZE: Specified to reduce locking contention.
For Plan A, performance was measured for dropping and creating a clustered index. After the clustered index was dropped, the incremental load was essentially loading data into a heap. For more information about bulk loading without an index, see "Incremental Bulk Load Without an Index" earlier in this paper.
As shown in Figure 2, the result was approximately linear scalability in creating the clustered index. Eight-way parallelism was used to create the clustered index. Because the index did not fit in the memory, sort was performed by using multiple-passes.
Figure 2 Create or drop clustered index with increasing table size
Note The actual degree of parallelism for index creation is determined by the following settings:
Number of processors presented by the operating system, which is set by the boot.ini option numproc.
SQL Server CPU affinity, which is set by the sp_configure command option.
Maximum degree of parallelism setting for SQL Server, which is set by the sp_configure command option.
Available system resources at the time index creation starts, which is determined dynamically. If the system is busy, the degree of parallelism is scaled back to a smaller number. After index creation starts, the degree of parallelism will not change.
Note The time to create a clustered index depends on several factors, such as I/O bandwidth, number of CPUs, available memory, size of the table, size of the clustered key, recovery mode of the database, and the current load on the system. In the tests presented in this paper, the index creation was CPU bound.
It is expensive to drop a clustered index because after the clustered index is dropped, the database engine must update the available space for each data page. This involves a complete scan of the table. The tests showed that dropping a clustered index was more expensive (based on time) than creating the same clustered index. This is because the create index can take advantage of the eight processors by using parallel query plans, whereas, dropping the clustered index is single threaded.
For Plan B, tests indicated that the loading rate stays constant when loading tables with only a clustered index (Figure 3). The loading throughput remains constant throughout the loading phase as the size of the table was varied from 5 gigabytes (GB) to 1 terabyte. Because eight concurrent bulk load commands were executed, there was no lock escalation.
Figure 3 Incremental bulk Load
Test results showed the crossover point to be 350 percent (that is, when the size of incremental data exceeds 3.5 times the original data in the table, it is worthwhile to go with plan A). For more information about the mathematical model used in the tests, see Appendix B, "Predicting Crossover Point." Another benefit of not dropping the index is that the table is available for queries during bulk load.
Note In the test presented in this paper, the existing data and incremental data was uniformly distributed on clustered key columns. If the distribution of the data is skewed, it may lead to significant contention. You can minimize contention by reducing the batch size or the number of concurrent bulk load commands
Figure 4 illustrates the crossover point as the size of the incremental data was increased. The initial size of the table was 91 GB for this test. This initial size should not affect the crossover point because it is based on the percentage of the initial size of the table. For every plotted point, the time for loading includes the time spent performing all the steps of the plan. For example, the total time for loading an additional 200 percent of the data into this table took 26,663 seconds for Plan A, which consists of 10,391 seconds for dropping the clustered index, 4,203 seconds for loading an additional 182 GB of data into the table with no index, and 12,069 seconds for recreating the clustered index on the table with 273 GB of data. Creating the clustered index took longer than the dropping it because the size of the table increased 200 percent after the incremental load of the data.
Figure 4 Crossover point for incremental bulk load with a clustered index
The following are recommendations for achieving optimal performance for incremental bulk load. These recommendations are applicable to the type of workload that was tested in the case study presented in this paper. They also provide a good starting point for other kinds of workloads:
Unless the size of incremental load is significantly larger than the current size of the target table, you are likely to get better bulk load performance by using Plan B.
Check the clustered key distribution in the target table and the incremental data. If there is no overlap in the clustered key values, you can bulk load data in parallel with minimal or no blocking. Typically, in a data warehouse scenario, the incremental data is concentrated at the tail of the clustered key, which can lead to more contention with concurrent bulk load commands.
Table 2 shows how loading throughput increases with multiple concurrent bulk load commands. The tests showed that the bulk load throughput does not go up linearly with the number of processors. This is caused by the lock or latch contention between individual bulk load commands and the overhead of smaller batch size or transactions, including closing and opening of the input data files. The loading throughput increases approximately twofold when a single bulk load command is increased to eight concurrent bulk load commands. However, total CPU usage increases from 12 percent (that is, one CPU at 100 percent) to 95 percent—an eightfold increase. Different batch sizes were chosen depending on the number of concurrent bulk load commands required to achieve optimal performance. For example, for the single bulk load command, we used a single batch, while for eight concurrent bulk load commands we used a batch size of 5,000 rows.
Table 2 Increase in loading throughput with multiple, concurrent bulk load commands
Number of bulk load commands
Loading throughput (MB/sec) with optimal batch size
CPU percent usage (across all 8 CPUs)
1
4.48
12
2
6.25
24
8
9.03
95
Use a smaller batch size to minimize contention if the data overlaps. Each batch in the bulk load runs as a separate transaction. So a smaller batch size will limit the number of rows locked, thereby minimizing blocking. Also, a smaller batch size enables in-memory sort on the clustered key column of the data to be inserted into the target table.
In this scenario, the following plans were tested:
Plan A Incremental bulk load was performed by dropping the nonclustered index, loading the data into the target table, and then recreating the index.
Plan B Incremental bulk load was performed without dropping the nonclustered index.
In this scenario, it is better to keep the nonclustered index during incremental bulk load if the size of the incremental data is similar to the initial size of the table. If it is significantly larger, it is better to drop the index before the load and the recreate it.
The following commands were executed in sequence:
Drop nonclustered index fact_table.nci_1. (Executed single threaded.)
Bulk Insert fact_table <data-file> with TABLOCK. (Executed by using eight concurrent bulk load commands.)
Create nonclustered index nci_1 on fact_table(order_id). (Executed by using eight-way parallelism.)
Parameters:
TABLOCK: To execute bulk load commands concurrently. It gets a BU lock.
BATCHSIZE: Not specified. Each bulk load command loads the data in a single batch.
Bulk Insert fact_table <data-file> with batchsize=5000. (Executed with eight concurrent bulk load commands.)
Parameters:
TABLOCK: Not specified. If specified, the result would have been an X table lock.
BATCHSIZE: Specified to reduce locking contention.
For Plan A, dropping a nonclustered index in SQL Server 2000 is instantaneous. After the clustered index is dropped, the incremental load essentially loads data into a heap. Please refer to the earlier scenario. For more information about incremental bulk load performance in a heap, see "Incremental Bulk Load Without any Index" earlier in this paper.
Given the small size of index entry and the shallow depth of the B-tree, the function is close to linear. Figure 5 illustrates how the time to create a nonclustered index increases almost linearly with the size of data in the table.
Figure 5 Time required to create a nonclustered index
Note The time required to create the nonclustered index is only 20 percent less than the clustered index, even though the size of the data to be sorted in creating the clustered index is approximately five times larger than that of the nonclustered index. This is because the creation of index was CPU bounded. If it were I/O bounded, there would be a more significant performance difference.
For Plan B, the rate of loading with only a nonclustered index in place was constant. As Figure 6 illustrates, the loading throughput remained constant throughout the loading phase from 5 GB to 1 terabyte (size of data only, index was not included) for the same table with only a nonclustered index.
Figure 6 Throughput versus size of existing data
Figure 7 illustrates the crossover point as the size of the incremental data was increased. The crossover point was at 98 percent for the table in this test. This number is similar to what is predicted by a formula described in Appendix B. For this test system, it was concluded that when the additional data is less than the existing data, it is more efficient to keep the nonclustered index to load incremental data. Otherwise, it is more efficient to drop the index before loading.
Figure 7 Crossover point for incremental bulk load
The following recommendations apply to the type of workload that was tested in the case study presented in this paper. These recommendations provide a good starting point for other kinds of workloads. Recommendations for optimal bulk load rate are:
Keep the nonclustered index during incremental bulk load if the size of the incremental data is similar to the target table. Otherwise, it is better to drop the index before the load and the recreate it.
Use a smaller batch size to minimize contention if the data overlaps. Each batch in the bulk load runs as a separate transaction, therefore, a smaller batch size limits the number of rows locked, thereby minimizing blocking. Also, a smaller batch size enables you to perform in-memory sorts on the nonclustered key column of the data that you want to insert into the target table.
Similar to what we saw in a table with a clustered index, if you bulk load without dropping the index because of lock or latch contention, the bulk load does not scale up linearly by increasing the concurrency. Table 3 shows that the bulk load throughput increased only 2.5 times with eight concurrent bulk load commands. Therefore, unless there are no competing requests for CPUs, you receive limited benefit by running concurrent bulk load commands. To achieve optimal performance in the test presented in this paper, different batch sizes were chosen depending on the number of concurrent bulk load commands. For example, for the single bulk load command, a single batch was used, while for eight concurrent bulk load commands, a batch size of 5,000 rows was used.
Table 3 Increase in bulk load throughput
Number of concurrent bulk load commands
Loading throughput (MB/sec) with optimal batch size
Percent of CPU usage (across all 8 CPUs)
1
4.56
12
2
7.02
24
8
10.11
94
In this scenario, the following plans were tested:
Plan A Drop all indexes including the clustered index, load the data, and then recreate all indexes.
Plan B Drop all nonclustered indexes, load the data with the existing clustered index, and then recreate all nonclustered indexes. The decision to keep the clustered index intact and drop the nonclustered indexes was made for the following reasons:
It is expensive to drop a clustered index. When a clustered index is dropped, SQL Server goes through all data pages to update free space information for each allocated page.
It is relatively expensive to create the clustered index depending on the size of the data rows.
Plan C Incremental bulk load without dropping any indexes.
The test results recommend Plan B, unless the incremental data is a very small fraction of the existing data.
The following commands were executed in sequence:
Drop all nonclustered indexes.
Drop the clustered index (executed single-threaded).
Bulk Insert fact_table <data-file> (executed with eight concurrent bulk load commands).
Create the clustered index.
Create all nonclustered indexes.
Parameters to Bulk Insert when loading into a heap:
TABLOCK: To execute bulk load commands concurrently. It gets a BU lock.
BATCHSIZE: Not specified. Each bulk load command loads the data in a single batch.
The following commands were executed in sequence:
Drop all nonclustered indexes.
Bulk Insert fact_table <data-file> with a batch size of 5,000. (Executed with eight concurrent bulk load commands.)
Create all nonclustered indexes.
Parameters to Bulk Insert:
TABLOCK: Not specified. If specified, the result would be an X table lock.
BATCHSIZE: Specified to reduce locking contention.
Command executed without dropping any indexes.
Bulk Insert fact_table <data-file>. (Executed with a single bulk load command.)
Parameters to Bulk Insert:
TABLOCK: Not specified. If specified, the result would be an X table lock.
BATCHSIZE: A batch size of 20,000,000 rows was used.
For Plan A, the time taken includes the sum of the time required to drop all nonclustered indexes and the clustered index, bulk load the data into the heap, and then recreate the clustered and nonclustered indexes. After all indexes are dropped, the incremental load is essentially a bulk load into a heap. Please refer to earlier scenarios. For more information about performance during the incremental bulk load into a heap, see Figure 1 earlier in this paper. For more information about performance for dropping and recreating indexes, see Figure 2 and 5 earlier in this paper.
For Plan B, the time taken includes the sum of the time required to drop all nonclustered indexes, bulk load the data into the table with the clustered index, and recreate all nonclustered indexes. For more information about incremental bulk load performance into a table with clustered indexes, see "..." earlier in this paper. For more information about performance for dropping and recreating nonclustered indexes, see Figure 2 and 5 earlier in this paper.
For Plan C, the bulk load throughput with all indexes intact is significantly slower than the corresponding throughput into a heap. As a general observation, when the number of indexes increases, the loading throughput decreases significantly because of locking contention. In the tests presented in this paper, there was a slight dip in the bulk load throughput as the size of the existing data increased, as shown in Figure 8.
Figure 8 Incremental bulk load throughput versus size of existing data
Test results indicated that Plan B is more efficient than Plan A, as illustrated in Figure 9. The crossover point for Plan C and Plan B is predicted to be 1 percent for the table with one clustered index and seven nonclustered indexes by the formula explained in Appendix B. In other words, when the additional data is less than 1 percent of the existing data, it is more efficient to keep all the indexes while loading additional data. Otherwise, it is faster to drop all nonclustered indexes, load data with the clustered index, and then recreate the nonclustered indexes. Plan B also has the advantage of keeping data and the clustered index available during the process.
Figure 9 Crossover point for incremental bulk load
The following recommendations apply to the type of workload that was tested in the case study presented in this paper. These recommendations provide a good starting point for other kinds of workloads.
It is common for a table in a DSS to have multiple indexes, including nonclustered indexes, clustered indexes, or both, to optimize queries. When the number of indexes increases, blocking becomes a more serious performance issue with concurrent bulk load commands. Because of this, it is recommended that you execute a single bulk load command on tables with multiple indexes.
Consider keeping all indexes during incremental bulk load if the size of the incremental data is less than 1 percent of the initial size of the table; otherwise, it is better to use Plan B.
During incremental bulk load with indexes, unless the data is already sorted on the clustered key and a sort hint is used, the data is sorted before insertion to improve the cache hit ratio during row-by-row insert. Performance gained by fitting the batch data into the physical memory outweighs the performance that is lost by the overhead of opening and closing the file. Figure 10 illustrates how batch size affects average loading performance (including spooling, sorting, and inserting phases) when loading data into a table with one clustered index and seven nonclustered indexes. The large batch size causes data spooling and sorting to overflow from memory to tempdb, which requires a large space for tempdb and a sufficient number of disk spindles for concurrent sorts in tempdb. Given the size and disk bandwidth of tempdb for a typical DSS, this is a small concern. Large physical memory will improve performance by keeping more data in memory and therefore reducing the overhead produced by opening and closing the file.
Figure 10 Incremental bulk load to a table with one clustered index and seven nonclustered indexes
Main database server: Hewlett Packard Proliant 8500
CPU: 8 processors with 733 megahertz (MHz)
RAM: 8 gigabytes (GB)
Host bus adapters (HBAs): 4 Emulex 952 HBAs
Storage: Hewlett Packard Enterprise Virtual Array (EVA)
Storage cabinets: 2
Controllers: 4
Disks: 168 disks with 72 GB Fibre Channel and 10-kilobyte (KB) RPM
84 disks: 36-GB Fibre Channel 10-KB RPM
Storage Area Network (SAN) Switch: Brocade Silkworm 3800
SAN Management Appliance: Proliant DL380
Software
Windows 2000 Advanced Server with Service Pack 3
Microsoft SQL Server 2000 Enterprise Edition with Service Pack 3
SQL Server settings
AWE was enabled
Maximum server memory was set to 7 GB to allow SQL Server to use up to 7 GB of memory
Data and database schema
A representative DSS database schema was used. The fact table was used as the target of bulk load. Data types in the table include int, money, datetime, char, and varchar. Here is the schema of the table:
Table Name: fact_tabe
Columns:
Order_id int,
Part_id int,
Supplier_id int,
Line_no int,
Price money,
Sale_price money,
Special money,
Tax money,
Status01 char(1),
Status02 char(1),
Shipdate datetime,
Commitdate datetime,
Receivedate datetime,
Directions char (25),
Comments varchar (40)
General char(10)
Clustered Index: ci: key: (shipdate)
Nonclustered index: nci_1: key: (order_id)
Nonclustered index: nci_2: key: (part_id, supplier_id)
Nonclustered index: nci_3: key: (price)
Nonclustered index: nci_4: key: (status01)
Nonclustered index: nci_5: key: (commitdate)
Nonclustered index: nci_6: key: (shipdate)
Nonclustered index: nci_7: key: (comments)
Eight flat files containing unsorted and nonunique data.
Both existing and incremental data are uniformly distributed on the index keys. Data ranges overlap significantly among flat files (for example, bulk load threads) and between existing data and incremental data.
The performance of bulk load depends on various factors like the table schema, data distribution in the target table and the incremental data, hardware configuration, and the workload. No standard number applies across all situations. However, if you can compute bulk load performance on a representative dataset and environment, you will be able to predict, with a reasonable level of certainty, the crossover point. This appendix describes how to compute the crossover point by using representative data.
Step 1: Collect data
Data that needs to be collected in experiments:
Loading time / size of data without indexes (T with no indexes): 22 sec/ GB.
Loading time / size of data with the clustered index (T with the index): 112 sec/GB.
Time to create the index / size of table(ac): 45sec /GB.
Time to drop the index / size of table(ad): 114 sec/GB
You can perform these benchmark experiments by using a smaller scale or collect data from previous loads.
Step 2: Estimate the time for each execution plan
Assume that you need to load X amount data into a table of size X0.
Plan A: Load without indexes Because the loading speed stays constant on tables with no indexes, the time to load X amount of data with no indexes equals to X * T with no indexes.
Plan B: Load with the index Because the loading speed stays constant on tables with a clustered index, the time to load X amount of data with the index equals to X * T with the index.
Because the time to drop a clustered index on a table increases linearly with the size of existing data in the table, the time to drop an index on a table of size X0 = ad X0.
Because the time to create a clustered index on a table increases linearly with the size of the existing data in the table, the time to create an index on a table of size X0 = ac (X0 + X).
So the total time for plan Bis X * T+ ad X0 + ac (X0 + X).
Step 3: Predict crossover point for loading plans
At the crossover point between Plan A and Plan B,
X * T with the index = X * T with no indexes + ad X0 + ac (X0 + X) so
X / X0 = (ad + ac) / (T with the index – T with no indexes - ac)
If (T with the index – T - ac) < 0, no positive crossover point exists, so it is always better to load data with indexes.
Computing the crossover point using data from tests presented in this paper:
X/ X0 = (114 + 45)/(112 – 22 – 45) = 350% (approximately)
This number was found in the tests presented in this paper. Considering the table is not available during dropping and recreating the clustered index and 350 percent additional data to be loaded is not very common. Loading data with the clustered index is more favorable for most of the cases.
Step 1: Collect data
Data that needs to be collected in experiments:
Loading time / size of data without indexes (T with no indexes): 22 sec/ GB.
Loading time / size of data with the clustered index (T with the index): 99 sec/GB.
Time to create the index / size of table (ac): 38 sec/GB.
You can perform these benchmark experiments by using a smaller scale or collect data from previous loads.
Step 2: Estimate the time for each execution plan
Assume that you need to load X amount data into a table of size X0.
Plan A: Load with no indexes Because the loading speed stays constant on tables with no indexes, the time to load X amount of data with no indexes equals X * T with no indexes.
Plan B: Load with the indexes Because the loading speed stays constant on tables with a nonclustered index, the time to load X amount of data with the nonclustered index equals X * T with the index.
Because the time to create a nonclustered index on a table increases linearly with the size of existing data in the table, the time to create an index on a table of size X0 = ac (X0 + X).
Therefore, the total time for Plan A is X * T with no indexes + ac (X0 + X). In this case, the time to create the nonclustered index is ignored because it is instantaneous.
Step 3: Predict crossover point for loading plans
At the crossover point between Plan A and Plan B,
X * T with the index = X * T with no indexes + ac (X0 + X) so
X / X0 = ac / (T with the index – T with no indexes - ac)
If T with the index – T with no indexes - ac < 0, no positive crossover point exists. In this case, it is always better to load data with the indexes.
Computing the crossover point using data from the tests presented in this paper:
X/X0 = (38)/(99 – 22 – 38) = 38/39 = 100% (approximately)
This number was found in the tests presented in this paper. Without the time required for dropping indexes, dropping the nonclustered index may be favorable if a significantly larger amount of additional data compared to the existing data needs to be loaded.
Step 1: Collect data
Data that needs to be collected in experiments:
Loading time / size of data without indexes (T with no indexes): 22 sec/ GB.
Loading time / size of data with all the indexes (T with all the indexes): 35067 sec/GB.
Note Only the speed collected with > 5 GB existing data was used.
Loading time / size of data with only the clustered indexes (T with the clustered index): 112 sec/GB.
Time to create all the indexes / size of table(ac all): 324 sec /GB.
Time to create all the indexes / size of table(ac all nonclustered): 362 sec /GB.
Time to drop the clustered index / size of table(ad): 114 sec/GB
You can perform these benchmark experiments by using a smaller scale or collect data from previous loads.
Step 2: Estimate the time for each execution plan
Assume that you need to load X amount data into a table of size X0:
Plan A: Load with no indexes Because the loading speed stays constant on tables with no indexes, the time to load X amount of data with indexes = X * T with no indexes.
Because the time to drop a clustered index on a table increases linearly with the size of existing data in the table, the time to drop an index on a table of size X0 = ad X0.
Because the time to create a nonclustered index or a clustered index on a table increases linearly with the size of existing data in the table, the time to create a group of nonclustered indexes and a clustered index increases linearly with the size of existing data in the table, so the time to create an index on a table of size X0 = ac all (X0 + X).
Therefore the total time for Plan A = X * T with no indexes + ad X0 + ac all (X0 + X).
Plan B: Load with only the clustered index Like Plan A, the total time for Plan B = X * T with the clustered index + ac all nonclustered (X0 + X).
Plan C: Load with the index The time to load X amount of data with all the indexes = X * T with all the indexes.
Step 3: Predict crossover point for loading plans
At the crossover point between Plan A and Plan B,
X * T with the clustered index + ac all nonclustered (X0 + X) = X * T with no indexes + ad X0 + ac all (X0 + X).
X / X0 = (ad + ac all) / (T with the clustered index – T with no indexes - ac all nonclustered)
In the tests presented in this paper, T with the clustered index – T with no indexes - ac all nonclustered < 0, therefore, it is always better to keep the clustered index for bulk load than drop it, which is verified by earlier tests.
At the crossover point between Plan B and Plan C,
X * T with all the indexes = X * T with the clustered index + ac all nonclustered (X0 + X) so
X / X0 = ac all nonclustered / T with all the indexes – T with the clustered index - ac all).
Computing the crossover point using data from the tests presented in this paper,
X/ X0 = (362) / (35067 – 112 – 362) = 362/34593 = 1% (approximately)
Therefore, Plan B is better if incremental data is larger than 1 percent of the existing data. In the tests presented in this paper, this crossover point ≈ 1 percent, which is verified by testing. Dropping the nonclustered indexes and keeping the clustered index may be favorable, unless a small amount of additional data, compared to existing data, needs to be loaded.