We Loaded 1TB in 30 Minutes with SSIS, and So Can You
SQL Server Technical Article
Writers: Len Wyatt, Tim Shea, David Powell
Published: March 2009
Applies to: SQL Server 2008
Summary: In February 2008, Microsoft announced a record-breaking data load using Microsoft® SQL Server® Integration Services (SSIS): 1 TB of data in less than 30 minutes. That data load, using SQL Server Integration Services, was 30% faster than the previous best time using a commercial ETL tool. This paper outlines what it took: the software, hardware, and configuration used. We will describe what we did to achieve that result, and offer suggestions for how to relate these techniques to typical scenarios. Even for customers who don't have needs quite like this benchmark, such efforts can teach a lot about getting optimal performance.
Businesses have ever-increasing volumes of data stored in many heterogeneous systems. To provide meaningful, consistent, and reliable information to their end users, these businesses rely on data integration technologies to extract, transform, and load data (commonly known as ETL operations) as they move data between systems. ETL tools such as Microsoft SQL Server Integration Services (SSIS) support these types of data integration activities. Businesses want to know that an ETL tool they choose will be able to support any data volume they might require and at the same time allow them to integrate data from any of their heterogeneous data sources.
To illustrate the ability of SSIS to meet such performance needs, Microsoft and Unisys arranged to load over 1 terabyte (TB) of data that was read from flat files on four source servers into a SQL Server database on a single destination server. In this test, the data was read, converted from text fields to database data types, transferred over the network, and inserted into the destination database in less than 30 minutes. To be precise, 1.18 TB of flat file data was loaded in 1,794 seconds. This is equivalent to 1.00 TB in 25 minutes 20 seconds or 2.36 TB per hour.
It is important to note that this is not the same as simply doing a bulk load of the data into the database. If data is available on the destination system, and if it does not need to be standardized or corrected for errors before it is loaded, bulk loading data makes sense. In the more common case where data must be moved between systems and transformed along the way, ETL tools are needed. ETL tools like SSIS can perform functions such as moving data between systems, reformatting data, integrity checking, key lookups, and tracking lineage. In our 1 TB loading experiment we did not perform extensive transformations, because we wanted to run an experiment that would be comparable to what other ETL tool vendors have published.
The idea of being able to compare ETL tool performance is an important one. Customers should be able to compare ETL tools in the same manner that they can use a TPC-E or TPC-H benchmark today to compare relational databases. There is no commonly accepted benchmark for ETL tools. Microsoft thinks there should be. Industry-standard benchmarks can lead to healthy competition, better products, and better publication of the techniques used to get high performance. Microsoft supports the idea of defining an industry-standard benchmark that reflects the real-world uses of ETL tools.
Conceptually, the design of this system is very simple, as shown in Figure 1. There are 56 streams of data generated as flat files using the TPC-H data generator, so there will be 56 instances of the SSIS package executing in parallel. They will write to a single destination database. Each instance of the package will read one of each of the file types created by the data generator: Customer, LineItem, Orders, Part, PartSupp, and Supplier. The file names to use are passed in as arguments to the package. The source files will be described in the next section, and the SSIS package will be described in detail in the section on SSIS package design.
Figure 1: Conceptual data flow
On the output side, each package will write to a different partition in the destination tables. More precisely, as illustrated in Figure 2, each package will write into a separate table for highest performance, and the tables will be “switched in” to partitions of the larger table. This will be described more fully in the section on database setup. There are a number of times when partitioning a table is a good practice, one of them being when multiple large insertions need to be performed concurrently–exactly the situation we have for this scenario. Each of the major TPC-H tables (Customer, LineItem, Orders, Part, PartSupp, Supplier) was partitioned for this exercise. The tiny Nation and Region tables were not partitioned.
Figure 2: From flat files to table partitions
The SSIS package that loaded the data was quite simple. Figure 3 describes it in detail. The control flow of the package consisted of six data flows, one for each of the major tables, which run sequentially. Each data flow has a flat file source sending data to an OLE DB destination. The flat file source contains the field definitions for its file type, and the data types the fields should be converted to (32-bit integer, date/time types, money, and so on). The OLE DB destination contains the mapping of columns in the data flow to columns in the destination table in the relational database. Note that we used an OLE DB destination, not a SQL Server destination. This means that the package could have written data to any type of relational database that has an OLE DB adapter, not just SQL Server, and that the destination database does not have to be on the same machine that is running the SSIS package. More details on the package follow in the section on SSIS package design.
Figure 3: SSIS package overview
Physically, the source files are distributed across four source servers and the destination database is on a Unisys ES7000 server. As shown in Figure 4, the SSIS package instances execute on the source servers, and the SQL Server relational engine runs on the ES7000. Each source server is connected to the destination using two 1 Gb Ethernet connections in order to provide sufficient bandwidth for this scenario. The network connections were driven to between 70% and 100% of capacity, so it was essential to create ideal network settings to avoid overloading the CPUs with network interrupt work. More details about the server, operating system, and network setup are in following sections of this document.
Figure 4: Physical topology
We used the data generator from the Transaction Processing Council’s TPC-H benchmark to generate data for this experiment. There were two reasons for that choice:
· The generator provided a convenient way to generate realistic data that would be recognizable by many people.
· Choosing this data set allowed us to make a direct comparison to the results achieved by other vendors who have used the same data set.
Although the TPC-H data generator was used, this is not a TPC-H benchmark result. It was simply a convenience to use that data set. As noted earlier, there is not a commonly accepted benchmark for ETL tools at this time.
The TPC-H data generator is called DBGEN, and it generates text files for loading into databases. Although SSIS can move data directly from one database to another, and this usage pattern is encouraged because of its efficiency, it is not uncommon in the ETL world to unload data from one database to flat files and load those into another. In effect, flat files are the “lowest common denominator” for data exchange. Because DBGEN generates flat files, we decided to use them exactly as they were generated.
DBGEN provides the option to partition the data into a number of streams that can be loaded in parallel. For example, the command
dbgen –T o –fF –q –b dists.dss
-s 1000 -C 56 -S 6
will generate data for the ORDERS and LINEITEMS tables at the scale point 1000 (roughly 1 TB total data size). The data will be partitioned into 56 streams, and the command will generate data for stream 6.
When all the data for stream 6 is generated, the following files will be present:
02/14/2008 07:31 PM
02/14/2008 08:12 PM
02/14/2008 08:12 PM
02/14/2008 08:17 PM
02/14/2008 08:17 PM
02/14/2008 08:18 PM 25,768,892 supplier.tbl.6
We loaded data from 56 streams in parallel, so correspondingly we generated 56 customer files, 56 line item files, and so on. The data for each of the 56 streams is fairly similar in size, and collectively the set adds up to 1.18 TB of source data. Later we will describe how the files were laid out on disk to achieve the needed performance. The files are plain text files with variable length fields and vertical bars (“|”) as the field separators. As an example, here are the first three lines of the part.tbl.6 file (truncated for readability):
NICKEL|17|Brand#54|cream peru …
TIN|47|Brand#51|violet drab …
BRASS|3|Brand#32|orange yellow …
At run time, SSIS will read the files in this format as the first step in the preparing, transferring, and loading the data.
In many ways the configuration of the SQL Server database used default settings or common configuration practices. We will review the details, but the two unusual settings were the use of software non-uniform memory access (soft-NUMA) to distribute work evenly among the processors and the use of the –x startup flag for SQL Server. Soft-NUMA will be discussed in more detail below. The –x option turns off the collection of run-time performance statistics that appear in performance counters and DMVs. There was one particular counter that was impacting this effort, and in subsequent builds a fix for that problem has been introduced. This fix is in the released product code. It is no longer necessary to run with –x to obtain performance similar to what we achieved.
File Groups and Tables to Load
The database was created on 16 data volumes (logical disk drives) plus a log volume. All volumes were provided by the EMC SAN. A separate file group was created for each incoming stream of data, one file per file group, with the files placed on the data volumes in a round-robin assignment [ref01]. Here is how the database was created.
CREATE DATABASE TPCHdestination ON PRIMARY ( NAME = N'TPCHdata0', FILENAME = N'C:\Mount\Drive1\SQLdata\TPCHdata0.mdf' , SIZE = 1GB , MAXSIZE = UNLIMITED, FILEGROWTH = 10% ), FILEGROUP FG1 ( NAME = N'TPCHdataG1F1', FILENAME = N'C:\Mount\Drive1\SQLdata\TPCHdataG1F1.mdf' , SIZE = 24GB , MAXSIZE = 24GB ), FILEGROUP FG2 ( NAME = N'TPCHdataG2F1', FILENAME = N'C:\Mount\Drive2\SQLdata\TPCHdataG2F1.mdf' , SIZE = 24GB , MAXSIZE = 24GB ), . . . LOG ON ( NAME = N'TPCHdata_log', FILENAME = N'C:\Mount\Log\SQLlog\TPCHdata_log.ldf' , SIZE = 25GB , MAXSIZE = 25GB ) GO ALTER DATABASE TPCHdestination SET RECOVERY SIMPLE GO sp_dboption 'TPCHdestination','auto create statistics','OFF' ; go sp_dboption 'TPCHdestination','auto update statistics','OFF' ; go alter database TPCHdestination set PAGE_VERIFY NONE ; go With the file groups created, tables to be loaded with data were placed in the file groups, for example, create table ORDERS_6 (O_ORDERDATE smalldatetimenot null, O_ORDERKEY bigint not null, O_CUSTKEY int not null, O_ORDERPRIORITY char(15) not null, O_SHIPPRIORITY int not null, O_CLERK char(15) not null, O_ORDERSTATUS char(1) not null, O_TOTALPRICEmoney not null, O_COMMENT varchar(79) not null) on FG6
Remember that each of these tables is really a temporary location that SSIS can rapidly load data into. It will then be switched in to partitions of the full table.
Tables are all created without indexes. This is typical for environments where large amounts of data are loaded – it is often better to load the data first, and then add indexes. That’s how this test was run. We assume that adding indexes is a separate operation that follows the data load. Indexing is not included in the load time.
Data is loaded into separate temporary tables because it is the fastest way to load data. Later in this paper, we explain that loading into a single heap is almost as fast, but simpler to set up and manage. We expect most sites would find the simpler design sufficiently fast.
After the data is loaded into the temporary tables, they are then switched in to become partitions in the full portioned tables. This switch is a metadata operation only – the data does not need to be copied or physically moved to accomplish it. This is a significant piece of the power of partitions, and this technique has been a recommended data loading practice since SQL Server 2005.
Every partitioned table uses a partition function and a partition scheme [ref02]. In short, the partition function gives the boundary points for the partitions in the table, and the partition scheme tells what file group each partition will reside on. In a many implementations, partitions will be divided along time boundaries, such as creating a new partition each week for new data. In this experiment the partitioning of streams generated by DBGEN was simply by the primary key of each table. While this is not an entirely typical implementation, it does show how partitioning works. The partitioning method should always be driven by the data needs of the application.
Below are the definitions of the partition function, the partition scheme, and the table for the Orders table. For the Orders table, partitioning is by O_ORDERKEY, so the values in the partition function are O_ORDERKEY values. Note that the definition of the Orders table here is identical to the temporary table (shown above) that data is initially loaded into. This is a requirement for being able to switch the data into a partition.
CREATE PARTITION FUNCTION pfnORDER (bigint) AS RANGE LEFT FOR VALUES ( 107142850,214285700,321428550,428571424,535714274,642857124, 749999974,857142848,964285698,1071428548,1178571398,1285714272, 1392857122,1499999972,1607142822,1714285696,1821428546,1928571396, 2035714246,2142857120,2249999970,2357142820,2464285670,2571428544, 2678571394,2785714244,2892857094,2999999968,3107142818,3214285668, 3321428518,3428571392,3535714242,3642857092,3749999942,3857142816, 3964285666,4071428516,4178571366,4285714240,4392857090,4499999940, 4607142790,4714285664,4821428514,4928571364,5035714214,5142857088, 5249999938,5357142788,5464285638,5571428512,5678571362,5785714212, 5892857062) -- maximum is 6000000000 CREATE PARTITION SCHEME pscORDER AS PARTITION pfnORDER TO ( FG1, FG2, FG3, FG4, FG5, FG6, FG7, FG8, FG9, FG10, FG11, FG12, FG13, FG14, FG15, FG16, FG17, FG18, FG19, FG20, FG21, FG22, FG23, FG24, FG25, FG26, FG27, FG28, FG29, FG30, FG31, FG32, FG33, FG34, FG35, FG36, FG37, FG38, FG39, FG40, FG41, FG42, FG43, FG44, FG45, FG46, FG47, FG48, FG49, FG50, FG51, FG52, FG53, FG54, FG55, FG56) GO create table ORDERS (O_ORDERDATE smalldatetimenot null, O_ORDERKEY bigint not null, O_CUSTKEY int not null, O_ORDERPRIORITY char(15) not null, O_SHIPPRIORITY int not null, O_CLERK char(15) not null, O_ORDERSTATUS char(1) not null, O_TOTALPRICEmoney not null, O_COMMENT varchar(79) not null) on pscORDER(O_ORDERKEY)
The final requirement for partition switching is that there must be constraints on the temporary tables so that only qualifying data can be in the tables. There must be range checks on the tables so the data will not violate the ranges given in the partition function for the partitioned table. Here is an example of the range constraint placed on the Orders_6 temporary table. Again, the actual key values given are driven by the input data:
ALTER TABLE ORDERS_6 WITH CHECK ADD CONSTRAINT check_ORDERS_6 CHECK (O_ORDERKEY >= 535714275 AND O_ORDERKEY <= 642857124)
With the tables set up as described, after SSIS has loaded data into the temporary tables, executing the switch is as simple as this:
When all of the switches are done, there will be a uniform Orders table with all of the data. The same process is followed for each of the major tables.
Soft-NUMA and Port Mapping
Soft-NUMA provides a way of subdividing the processors in a server into smaller logical groups. The SQL Server scheduler is aware of these groups, so work stays more localized than the system hardware might otherwise require. After logical nodes are defined, port mapping can be used to direct work to a certain node based on the port number used in the TCP connection. More information about soft-NUMA and port mapping, as well as a discussion of hardware NUMA, which we reference later in this paper, is available in SQL Server Books Online [ref03].
For this exercise, we used soft-NUMA and port mapping to force each incoming data stream (one from each SSIS package) to go to a different processor. Partly this was to work around some idiosyncrasies in the SQL Server scheduler (which we hope to change in future releases), but we probably would have used NUMA and port mapping anyway to keep network traffic local to each hardware NUMA node in the server. The details behind this will be described further in the section on network setup. For now we’ll just say that there were eight CPU cores in each hardware NUMA node. We directed all the network DPC traffic for each NUMA node to one core from that node, and we used the other seven cores for SQL Server processing. This kept incoming data from the network local to each hardware NUMA node. Each SSIS package connected to one of the available soft-NUMA nodes. Because there were eight hardware NUMA nodes in the server and each had seven soft-NUMA nodes available to do work, we ran a total of 56 streams concurrently.
We did not need to take steps to ensure locality of disk I/O traffic to disk controllers on hardware NUMA nodes. That is another optimization that could be considered.
The creation of soft-NUMA nodes and port mappings is done in the system registry. The registry settings used were [ref04]:
Windows Registry Editor
. . .
Other Database Settings and Options
We set the network packet size to 32k instead of the default 4k. This was implemented on both the database side using the option network packet size (B) and in the SSIS package as discussed in the next section.
SSIS Package Design
The design overview section and Figure 3 above gave the high-level overview of the SSIS package design. The package is executed in 56 parallel instances. Each instance of the package loads one Customer file into one Customer table, one Supplier file into one Supplier table, and so on, in sequence. The sequencing is determined by the task dependencies set up in the control flow for the package. Each of the tasks is one data flow, with a single flat file source and an OLE DB destination.
One important consideration in an SSIS data flow is the choice of data types to be used. In general it’s a good idea to get the types converted in the SSIS pipeline to data types that will be native for the destination database. That way the data does not have to be converted again by the database. When they are read from text files, the data types are specified in the Flat File Connection Manager editor. Figure 5 shows an example. It shows that the column O_ORDERKEY, when read from an orders.tbl.n file, should be converted to an eight-byte signed integer, which is the same as bigint in Transact-SQL. Each input column can be mapped appropriately in the Connection Manager editor.
Figure 5: Setting data types in the Connection Manager editor
In addition, there is an option to speed up parsing of integer, date, and time types if the conversion does not have to be locale-sensitive. This option is set on a per-column basis using the Advanced Editor for the flat file source, as shown in Figure 6.
Figure 6: Setting FastParse in the Advanced Editor
The destination chosen is the OLE DB destination, because the SSIS package is running on a different server than the destination database. Had this been a local database, we could have gotten further optimization by using the SQL Server destination.
Setting up the outputs is a little simpler than the text-file inputs, because the data types are already defined by the SSIS pipeline and by the destination table definition. Because we have taken care to select the SSIS data types carefully, they will match the database data types. What remains is to select the options for the destination. Note in Figure 7 that we used the fast load functionality. We now have the option to place a table lock, to turn off checking constraints, and to set the batch size. Although we ran with a batch size of 100,000 rows for this project, that may have more than needed – we saw no significant difference for this data set when running with 1,000 or 10,000 row batches. We found that is was best to leave the commit size at the default setting.
Also note in Figure 7 that the name of the table to load is obtained at run time from a variable. This allows each package instance to write to a different table.
Figure 7: The OLE DB destination
With the package defined, it needs to be called at execution time. This is done using the DTExec command prompt utility, as shown below. White space has been added for clarity. Naturally this command line was generated in a loop that started all 56 streams; this represents stream 6. Each of the file connection managers has the file name set, and each of the variables that names a destination table is set accordingly. This type of coordination could also have been accomplished inside SSIS – we could have passed in a single variable with the stream number, and SSIS could have determined the file names and table names. That would be a reasonable option for many sites.
Source=10.1.1.2,2006; Initial Catalog=TPCHdestination;
Two other items to note in the DTExec command are the Data Source and the PacketSize values. The Data Source value is the IP address and port number for the soft-NUMA node this package will connect to. In most sites the Data Source value will simply be a machine name; sometimes it will be accompanied by a SQL Server instance name. Here we bypassed the usual naming and went directly to the IP address and port number. The PacketSize value was set to 32K instead of the default 4K to reduce the number of network round-trips.
Disk Setup on the Database Server
The database server was attached to a SAN with a single rack of disks in a mirrored RAID configuration. In this way we created a customer-realistic storage system using the performance of current common technology, including the ES7000 Server, EMC SAN technology, Fibre Channel connections, and standard hard drives. The SAN was configured to expose 16 data volumes and a log volume, which were formatted from Windows® and used by SQL Server.
To host the SQL Server database files, Unisys supplied the EMC SAN, a 6 by 4 Gbps Fibre Channel-connected EMC CLARiiON CX3-80 with a single rack (11 trays by 15 disks) of 146 GB drives spinning at 15k RPM. The SAN was configured to expose 17 LUNs, each of which mapped back to 8 drives, arranged in a 4x2 RAID 10 configuration. In this way, on each LUN, the database files are striped across four sets of mirrored drive pairs. This configuration is fault tolerant and leaves one hot spare per tray. At the stripe size of 64k, across four mirrored drive pairs, the LUN should nicely handle writes of 256k each. The lab tested and confirmed a maximal write rate at an I/O size of 256k per write operation.
Each LUN was formatted as a logical volume and assigned to a Windows NTFS mount point on the server's file system. This task could have been performed in the Windows Server® 2008 Management Console Disk Management snap-in; we used DISKPART for convenience in scripting the configuration tasks. Note that in Windows Server 2003 you can use DISKPART to set the correct partition alignment on disk by specifying the partition offset; Windows Server 2008 does this automatically.
Testing with the sqlio.exe tool confirmed a write rate just over 800 MB/sec (parallel threads writing 256 KB buffers to all LUNs).
The 16 data volumes mentioned in the Database Setup section were mapped to the first 16 mount points. The 17th volume was used for the SQL Server database’s log file. During the test we measured an average write rate of 608 MB/sec with a peak of 872 MB/sec from SQL Server as shown in Figure 8. Note how the rate changes as the SSIS servers load different TPC-H tables at various times in the execution.
Figure 8: Database server write rate during a test run
All data and log files were precreated to eliminate file growth during the ETL process. SQL Server Instant File Initialization [ref05] was configured for creation of the database files; the time to set up the database was essentially the time to zero the log file. We found a 50 GB log file sufficient, and most likely it could have been a good bit smaller.
NUMA on the Database Server
The ES7000 is a non-uniform memory access (NUMA) architecture machine. We looked at two NUMA memory options: fully interleaved, and stacked cell pair. We used four pairs in total.
The ES7000 was configured as a single 32 socket server from eight four-CPU nodes, in a single rack. Each node was paired with a single neighbor via a rigid side-panel interconnect, which is an electrical connection linking the system bus of each machine to its twin. Such a pair of nodes comprises a super cell, and in this case, the memory of each node is interleaved to create a single memory system local to the super cell. Four super cells were linked via a network of external so-called crossbar cables. In this configuration we had the option of "stacking" the memory, where the memory within a super cell remains local to that super cell and the effects of NUMA are maximized, or of interleaving the memory across all super cells. We would expect the latter approach to reduce both the pros and cons of a NUMA design machine.
Other memory configurations may have been physically possible but were not tested.
We tested against the two memory configurations described above. Performance of the test was slightly better in the fully interleaved configuration. Our working assumption is that the performance difference was a result of poor locality of the network receive buffers with respect to the network interface cards. In other words, the memory configuration that maximized NUMA effects did not perform as well as the memory configuration that minimized them. This can indicate a problem with network receive buffer placement within the memory space of the node that physically hosts the NIC using each buffer. It can also indicate that the network receive buffer is missing. This could be an area for further research; we did not pursue it because the performance difference is small, and is likely to vary on a newer server with updated ACPI support. Information on the ACPI standard is available online [ref06].
The difference in the performance of the test when comparing the two NUMA memory configurations can be observed in the average write rate on the SQL Server (note the scale is for the average across the physical disk instance counters), where in both instances the test is CPU-bound on the server. In follow-up testing, we observed a rate of 38.6 MB/sec with stacked memory and a rate of 39.7 MB/sec with interleaved memory, a difference of 2.9%, in line with the observed difference in test latency of 2.6%.
Disk Setup on SSIS Servers
The disk configuration on the SSIS servers was straightforward and uncomplicated. We used four commodity class servers to source the ETL data and execute the SSIS packages, and two storage arrays to physically hold the 1.18 TB of flat file source data. In this way we modeled the scenario in which multiple data sources, or ETL “source” servers, write to a target warehouse simultaneously. The data processed by each SSIS server was unique; there was no replicated or shared source data.
Four Unisys ES3220Ls (dual socket, quad core) were connected to two EMC CLARiiON CX600s with two, 2Gbit FC connections from each server to a CX600. Each CX600 held 45 spindles, for a total of 90 spindles serving the flat file source data.
Each of the four SSIS servers drove seven streams of data from each of two source volumes, for a total of 56 parallel streams of data being processed and sent to the database server during the ETL process. Each source volume was hosted on a LUN exposed to the source server. Each LUN was formatted and assigned to a Windows NTFS mount point on the server's file system using the DISKPART utility. This configuration supplied the required read rate of 164 MB/sec to read 1.18 TB of flat file data in 1,800 seconds.
It would have been possible, but was not necessary, to configure the SAN to expose a higher number of LUNs, with a lesser number of spindles each, to optimize for sequential read. In our test configuration each source server pulled seven streams of flat file data from each of two LUNs.
The TPC-H source data files were generated with the DBGEN utility from the benchmark kit. We initially used a parallel process to write most of the source files out in parallel. This strategy created extremely high (unrealistically so) levels of fragmentation in the file system. By generating the data into spare storage before moving them onto the source volumes, we realized a small gain in read latency.
Few adjustments were made to system settings in reaching the goal of a 30-minute 1.18 TB ETL load with SQL Server and Integration Services.
On the database server, we enabled the available CPU cache performance options: hardware prefetcher, and adjacent cache line prefetch [ref07].
A hardware cache prefetcher is a system by which a CPU's memory controller detects a sequential pattern of memory access, and when such a pattern is detected, predicts the next piece of memory that will be requested and places it in the cache. If the controller's prediction is correct, memory access become significantly faster for the rest of the reads in the sequence, because the CPU does not have to wait for a slow round trip to main memory and back (or even worse, to a remote NUMA node).
Adjacent cache line prefetch simply means the memory controller grabs twice as much memory at a time, or two lines of cache, instead of the usual one line. In current systems a line of cache (or, cache line) is often 64 or 128 bytes. Cache line size is a fixed attribute of the hardware.
As noted earlier, we enabled the fully interleaved mode for memory access. With respect to software, we enabled large code pages for the SQL Server program image[ref08], [ref09]; please note that this is not required even on a 64-bit version of Windows, and it should only be enabled in production when recommended on a specific case basis by Microsoft.
On the source servers, we gave the SSIS process (DTEXEC) a slight boost in priority class. In similar tests, we have observed that such a boost evens out the processing rate of the data streams, which reduces overall processing time by reducing the variability in data stream processing latency. Because the latency of the test is defined as the time from the start of processing to the completion of the last stream, reducing the variability in individual stream latencies reduces the test latency. In other words, the wait time for the final streams to complete is reduced. This effect was only observed when the SSIS Server was CPU bound.
Operating System Selection
At the start of this project, we debated various system architecture choices. Unlike Windows Server 2008 R2, Windows Server 2008 is limited to 64 logical processors, and some early experiments suggested we couldn’t accomplish our goals if we restricted ourselves to running everything on a single system. Because this was our first time running a major ETL benchmark workload, we made major choices early on that favored control and scale-out flexibility over ease of use. We will likely pursue a different (simpler) approach the next time.
Being able to scale out the front end of the system meant running SSIS on different servers than the system that would host the relational database, so we committed early on to loading over a network. Because Windows Server 2008 would be launched at the same time as SQL Server 2008, and the new operating system included an all-new, more modern TCP/IP stack, we felt this design choice would also give us an opportunity to kick the tires on the new operating system’s next-generation networking.
Network Interface Card (NIC) Selection
The Unisys ES7000 server model we used was an older mature cache-coherent NUMA machine [ref10], since replaced by Unisys by a newer model with a faster memory interconnect. We quickly rejected the idea of using NIC teaming, for fear of not being able to control NUMA placement if we hit chokepoints in the server’s interconnect. This older server system did not have the range of qualified NIC choices available in newer machines today, so we steered clear of using 10 Gigabit Ethernet NICs for this first go at setting a new ETL world record.
The NICs on the Unisys server were the built-in, Intel PRO/1000 MT LAN-on-Motherboard (LOM) controllers. We used what was then the latest publicly-available version of Intel’s driver for this Ethernet controller. On the client side, we used Intel PRO/1000 PT NICs. These worked really well, both in terms of performance and rock-solid stability.
Parallelism in the Network Topology
We settled on using four front-end multicore “client” servers, to run SSIS. These systems would read the flat file data from RAID-backed file systems, do needed type transformations, and then load the data over the network into SQL Server on the 64P ES7000. The networking architecture we chose to connect the clients and the server was to use a dedicated 1 GbE link for each Unisys NUMA node, eight 1 GbE links in total. To be able to track and manage things tightly, we established a clear mapping between the physical Ethernet connections and the Layer 3 IP addresses on each end, like this.
Figure 9: Logical diagram of the networking architecture
A Gigabit Ethernet link connected each of the ccNUMA server modules of the Unisys ES7000 with one of four clients, so each client had two dedicated 1 Gb/s connections to the server. This arrangement eases performance investigation work, and is not necessarily something we’re suggesting for normal production deployments.
Extending Parallelism up the Stack with IntPolicy
We chose to extend our parallel, partitioned network architecture up into the database server, by using a freely available Microsoft tool, called IntPolicy, to bind NIC interrupts and DPCs [ref11] to specific processors in the server. We wanted to be able to clearly observe CPU utilization from network interrupts/DPCs, to be able to make adjustments during our work. IntPolicy is available on Microsoft’s download center [ref12], and it runs on Windows Server 2008 and Windows Server 2008 R2.
Figure 10: Setting processor affinity using IntPolicy
This structured configuration was useful for controlling NUMA locality but also allowed us to easily map performance issues end-to-end through our client/server architecture.
The only drawback to using IntPolicy is that it’s a bit of a pain to configure. You have to sort out your mapping between physical interface, the network connection device name, and the physical device object (bus/device/function #s), to be able to figure out which NIC’s interrupts/DPCs should land on the desired CPU. This is not for normal people or the faint of heart, but it isn’t rocket science either. Using ipconfig /all, devmgmt.msc, and IntPolicy, it’s helpful to construct a table like this to keep the mappings straight.
Device Name NetConn
New CPU Range
New DPC CPU
Table 1: An example of tracking the mapping between physical network interfaces and the ES7000 CPU that would receive network interrupts/DPCs
Initial Tuning of the Network Configuration with NTttcp
After we had all of this cabled up and configured, we decided to pre-flight the ETL experiments by validating the performance of the network, using an internal tool called NTttcp [ref13]. Based on the same concepts introduced in Mike Muuss’ original ttcp tool [ref14], NTttcp provides some additional control features that take advantage of Windows kernel features. We used NTttcp to pre-flight the basic performance of our network before beginning runs of the ETLWR workload using SSIS and SQL Server.
A single-stream experiment with NTttcp showed some value in running with L2 jumbo frames (9014-byte Ethernet frames), so we set that through the Intel driver’s advanced configuration settings. Although we ended up running the world record this way, later networking discoveries showed us we probably could have stuck with the standard default setting (1500 bytes). To keep performance complications to a minimum, we disabled NetBT (NETBIOS support) and Windows Firewall on the private benchmark network interfaces.
In running NTttcp, we typically used the –v, –a, –fr, and –m options on the receive side, to get: (1) verbose output, (2) async socket I/O with a number of pending requests, (3) full buffer (not partial) receives, and (4) to specify the number of threads and on which CPUs those threads should execute the ttcp requests. For example:
ntttcpr.exe -v -a 6 -fr -m
2,1,10.1.1.2 2,9,10.1.2.2 2,17,10.1.3.2 2,26,10.1.4.2\
2,33,10.1.5.2 2,40,10.1.6.2 2,48,10.1.7.2 2,56,10.1.8.2
establishes two receive threads per CPU, on CPUs 1, 9, 17, 26, 33, 40, 48, and 56, and allows up to six outstanding async I/O requests per thread. And on the send side, we used similar options, but obviously could skip full receives:
ntttcps.exe -v -a 6 -m
2,1,10.1.1.2 2,5,10.1.2.2 2,9,10.1.3.2 2,13,10.1.4.2
In no time, we were running at line rate with a single stream, so the next step was to run eight streams in parallel, one for each GbE NIC pair between the systems. Right away we hit a problem. We could run with two streams and get line rate, but as soon as we added a third, the other streams would be impaired, dropping to a lower rate of throughput. As we drilled into this issue, we found a fairly knotty, related set of issues that involved the version of ACPI that this older server supported, the way NDIS allocates NIC receive buffers, and NUMA fabric behavior. In the end, we cut this Gordian knot by enabling full memory interleaving on the ES7000, and by changing two BIOS settings, affecting cache prefetch behavior (thanks to Unisys’ Bob Murphy). In the BIOS, we set Hardware Prefetcher Disable = No, and Adjacent Cache Line Prefetch Disable = No.
With these changes, eight NTttcp streams (having 16 threads on the receive-side), were able to sustain about 118 MB/s per 1 GbE flow, when running all in parallel, which is very nearly full line rate.
Figure 11: Sample output from an eight-stream run of NTttcp
This NTttcp CPU utilization meant we were consuming roughly six of the eight CPUs we had dedicated for processing network traffic, which isn’t bad at all considering the age of the Intel PRO/1000 MT LOM controllers in the Unisys server.
Network Discoveries Running the Workload
With the lower level network configuration in place we were ready to begin running scaled-down versions of the full application ETLWR workload. We designed script automation and SSIS packages that could be run at full scale, ingesting 1 TB of source data, and another version that worked with 10% of the full source data. With these scale points and the help of a tool called TCP Analyzer, we were able to identify and work around the final obstacles to an under-30-minute run.
As we began experimental runs with the 10% data set, we observed the 56 CPUs allocated to SQL Server for row processing were completely pegged. To shift some of the CPU time away from TDS packet handling overhead, and into row processing, we increased the TDS packet size from the default of 4k to the maximum value of 32k by reducing the number of packets exchanged between client and server. This improved performance (that is, reduced load time) by about 25%.
At this time we observed large variations in the networking throughput during a test run. We had been using an SSIS package that loaded the tables using a parallelized data flow within the package. To get clarity on the source of the variation, we switched to running an SSIS package that loaded the tables serially, which in turn helped us see the different run-time characteristics of processing each of the different TPC-H tables. For instance, during processing of one of the large, wide tables, such as LineItem, networking bandwidth across the interfaces would drop as the server CPUs became bottlenecked on processing the large rows. Narrower tables, like Orders, would allow SSIS to drive all eight network interfaces at line rate.
Untangling the SSIS packages helped us really understand the networking needs and behavior of processing each table, but we still saw gyrations while SSIS worked on each table. At this point, we decided to pull out a new Windows SDK tool, called TCP Analyzer, to figure out what was happening. This application, which was written by the Windows core networking team, allows you to enable the new TCP/IP stack’s support for IETF TCP Extended Statistics (ESTATS) MIB, to see in real time what’s going on with a specific TCP flow. We had been experimenting with changing the number of L2 NIC buffers allocated for each adapter, but one look at TCP Analyzer showed we had been barking up the wrong tree (thanks to Murari Sridharan for the TCP Analyzer tips).
Figure 12: TCP Analyzer - using default TCP receive window auto-tuning
This TCP Analyzer screenshot, taken during an instance of a 10% ETLWR run, shows what was happening for a single TCP flow, between SSIS on the sender (where TCP Analyzer is best run), and the SQL Server receiver. Looking at the throughput history line graph you can see the bandwidth gyrations. The real eye-opener was watching this run in real time and seeing how much time this TCP flow spent in recovery from congestion. Fortunately, TCP Analyzer also maintains a connection lifetime pie chart that breaks this down, so you can see that a considerable amount of time during this 10% ETLWR run had been spent waiting for the TCP path to become uncongested!
We had been assuming that there were no TCP-level issues, and we had been experimenting with allocating smaller or bigger numbers of L2 NIC buffers. We were clearly barking up the wrong tree, as a set of experiments showed. In fact, giving the NIC driver more L2 send buffers, although they used the default TCP auto-tuning algorithm, actually made performance worse, because the sender could even more easily overwhelm the SQL Server receiver, which today has some limitations around the number of outstanding receive requests that it keeps in flight. The new networking stack supports three TCP auto-tuning levels: default, disabled (no auto-tuning of TCP’s receive window), and restricted (which auto-tunes TCP’s receive window, but less aggressively than the default setting).
Running a few experiments was interesting, not just for the bandwidth and run-time results we got, but also for the stability of data movement. Just switching TCP auto-tuning to restricted, we were able to get a 10% increase in performance. Changing the number of NIC auto-tuning buffers made a slight difference, but the change was within our run-to-run variation, so probably not really significant. Table 2 shows results comparing ETLWR performance with changes to the TCP receive window auto-tuning setting, and the number of NIC transmit buffers configured per Gigabit Ethernet adapter.
NIC # TxBufs
TCP Rx Autotuning
56 Streams 10% data
56 Streams 10% data
Reduce # of NIC buffers
56 Streams 10% data
Disable TCP Rx auto-tuning
56 Streams 10% data
Try Restricted TCP auto-tuning
56 Streams 10% data
Increase # of NIC buffers
Table 2: Effect of TCP receive window auto-tuning and number of NIC transmit buffers on ETLWR performance
Before running each experiment, we simply used the built-in netsh tool to change TCP receive auto-tuning, like this:
netsh int tcp set global
An even bigger win was in the improvement in network bandwidth stability, with Restricted TCP auto-tuning. Another TCP Analyzer screenshot tells the whole story.
Figure 13: TCP Analyzer - using “restricted” TCP receive window auto-tuning
You can see that the flow’s average sending rate is higher, but more importantly, no time is spent waiting on the sender or waiting for the path to become uncongested, and the flow’s throughput is exceptionally stable. At this point, we were receiver bound, which is what we expected. It’s worth mentioning that TCP/IP’s default auto-tuning behavior could improve in later releases of Windows, so be careful assuming too much.
As we got down to short strokes for this effort, TCP Analyzer came in handy one more time, though the root cause of the problem wasn’t really a networking issue. After overcoming the TCP auto-tuning problem, we converged quickly to within striking distance of our goal, but it still danced just out of reach because some CPUs on the database server exhibited a “long tail” at the end of the run. (A long tail is a large number of unique items, each in relatively small quantities.) We had dedicated some CPUs to networking to avoid exactly this kind of an outcome, and yet here it was. We studied the perfmon and SQL Server statistics on the database server, but could not figure it out.
Then one evening (or early morning, more likely), we thought to hook up TCP Analyzer to a networking flow associated with one of the long-tail CPUs. The long tails happened consistently, and not randomly, so this was easy to do (and also a bit fishy by itself). Once again, TCP Analyzer showed us we were looking in the wrong place.
Figure 14: TCP Analyzer - long-tail SSIS streams were sender-bound
You can see that the networking stream, associated with a long-tail SSIS stream, was sender-limited, not receiver-bound. As soon as we shifted our attention back to the clients, we noticed a small but significant anomaly in the disk subsystem of one of the client machines, which had been caused by a fragmented source file system. Fixing that problem got us across the finish line!
Model: ES7000/one Enterprise Server
OS: Windows Server 2008 x64 Datacenter
CPU: 32 socket dual core Intel® Xeon 3.4 GHz (7140M)
RAM: 256 GB
HBA: 8 dual port 4Gbit FC
Database: Prerelease build of SQL Server 2008 Enterprise (V10.0.1300.4)
Storage: EMC Clariion CX3-80 (Qty 1)
11 trays of 15 disks; 165 spindles x 146 GB 15Krpm; 4Gbit FC
OS: Windows 2008 x64 Enterprise
CPU: 2 socket quad core Intel Xeon processors @ 2.0GHz
RAM: 4 GB
HBA: 1 dual port 4Gbit Emulex FC
NIC: Intel PRO1000/PT dual port
Database: Prerelease build of SQL Server 2008 Integration Services (V10.0.1300.4)
Storage: 2x EMC CLARiiON CX600 (ea: 45 spindles, 4 2 Gbit FC)
Setting the ETL World Record is simultaneously very important and unimportant. It is important because it clearly shows that SSIS is among the world leaders in ETL performance. In doing this exercise, we have illustrated techniques that can be applied to achieve outstanding performance. These techniques can be applied by customers today. We have also learned things that can be used to build a better product. Some of these things have been implemented; some will come in following releases.
The ETL World Record, like any benchmark, is unimportant if the workload does not bear some resemblance to what you as a customer need to do. ETL spans a broad spectrum of operations, many of which were not included in this benchmark. While we followed the example that has been set in the marketplace, we also look forward to industry-standard benchmarks that would better represent customer needs. Microsoft has joined with other industry leaders in the Transaction Processing Performance Council (TPC) to develop a standard ETL benchmark. At the same time, we wish to reiterate that the techniques described here are useful in many situations. We have demonstrated that SSIS can run like a race horse.
The authors wish to thank our partners at Unisys for their assistance and advice, especially Henk van der Valk, who spent many days and nights in the lab working with us to perfect the system.
The authors also wish to thank:
- Intel, for lending the PRO/1000 PT network interface cards that worked so well in the client systems.
- Ahmed Talat, for advice on NTttcp and IntPolicy.
- Bob Murphy, of Unisys, for his ideas on BIOS cache configuration settings.
- Murari Sridharan, for the TCP Analyzer tips.
For more information:
http://www.microsoft.com/sqlserver/: SQL Server Web site
http://technet.microsoft.com/en-us/sqlserver/: SQL Server TechCenter
http://msdn.microsoft.com/en-us/sqlserver/: SQL Server DevCenter
Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to 5 (excellent), how would you rate this paper and why have you given it this rating? For example:
- Are you rating it high due to having good examples, excellent screen shots, clear writing, or another reason?
- Are you rating it low due to poor examples, fuzzy screen shots, or unclear writing?
This feedback will help us improve the quality of white papers we release.
[ref01] This layout may have been more than needed. In later runs we repeated the result with 8 data volumes and 16 file groups.
[ref02] To learn about SQL Server 2008 table partitioning, see http://msdn.microsoft.com/en-us/library/ms188706.aspx
[ref03] SQL Server 2008 Books Online. Understanding Non-uniform Memory Access.http://msdn.microsoft.com/en-us/library/ms178144.aspx
[ref04] Note that these registry entries create nodes on all 64 cores and assign a port number to each, even though eight of the cores are reserved for network DPC processing. We could have left those cores unmapped to ports; we simply chose to control the selection of processors used for SQL Server processing from the SSIS packages rather than enforcing it on the database server.
[ref05] SQL Server 2008 Books Online. Database File Initialization.http://msdn.microsoft.com/en-us/library/ms175935.aspx
[ref06] Intel Corporation. Advanced Configuration & Power Interface (ACPI).http://developer.intel.com/technology/iapc/acpi
[ref07] Intel Corporation. Optimizing Application Performance on Intel® Core™ Microarchitecture Using Hardware-Implemented Prefetchers. http://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers/
[ref08] MSDN®. Enabling Memory Support for Over 4 GB of Physical Memory.http://msdn.microsoft.com/en-us/library/ms190730.aspx
[ref09] MSDN. How to: Enable the Lock Pages in Memory Option (Windows).http://msdn.microsoft.com/en-us/library/aa366720.aspx
[ref10] Technically abbreviated ccNUMA, but the cc prefix is often not used, and this is what people typically mean by NUMA.
[ ref11] Deferred Procedure Calls (DPCs) are the method used by the Windows family of server operating systems to process network traffic, outside of the critical hardware interrupt processing window.
[ref12] Windows Hardware Developer Central. Interrupt-Affinity Policy Tool.http://www.microsoft.com/whdc/system/sysperf/intpolicy.mspx
[ref13] Windows Hardware Developer Central. How to Use NTttcp to Test Network Performance.http://www.microsoft.com/whdc/device/network/TCP_tool.mspx
[ref14] Michael John Muuss. The Story of the TTCP Program.http://ftp.arl.mil/~mike/ttcp.html