Scaling Up Your Data Warehouse with SQL Server 2008

SQL Server Technical Article

 

Writers: Eric N. Hanson, Kevin Cox, Alejandro Hernandez Saenz, Boris Baryshnikov, Joachim Hammer, Roman Schindlauer, and Grant Dickinson of Microsoft Corporation. Gunter Zink of HP.

 

Technical Reviewer: Eric N. Hanson, Microsoft Corporation

 

Applies To: SQL Server 2008

 

Summary:

SQL Server 2008 introduces many new functional and performance improvements for data warehousing. This paper discusses how to use SQL Server 2008 to get great performance as your data warehouse scales up. We present lessons learned during extensive internal data warehouse testing on a 64-core HP Integrity Superdome during the development of the SQL Server 2008 release, and via production experience with large-scale SQL Server customers. Our testing indicates that many customers can expect their performance to nearly double on the same hardware they are currently using, merely by upgrading to SQL Server 2008 and compressing their fact tables. We cover techniques to improve manageability and performance at high-scale, encompassing data loading (extract, transform, load), query processing, partitioning, index maintenance, indexed view (aggregate) management, and backup and restore.

 

Audience: Data Warehouse architects, developers, and DBAs.

 

Introduction

For high performance, simplicity of management, and reliability, scale-up-based data warehouse configurations based on Microsoft SQL Server 2008 offer a very attractive solution for your enterprise data warehouse needs. Scale-up based data warehouse solutions are based on single shared-memory multiprocessor (SMP) computer systems. This paper describes how you can effectively create, manage, and query your enterprise data warehouse with SQL Server 2008 when using a large SMP.

The goals for a data warehouse solution are typically the following, although they may vary somewhat in different environments:

·         Support rapid response time for queries that summarize large volumes of data.

·         Support efficient drill-through queries that extract moderate volumes of detail data for further analysis.

·         Support rapid initial data loading.

·         Support efficient incremental loading (for example, load 24 hours of data in a one-hour window).

·         Support fast maintenance operations (backup, restore, loading, indexing, compression, bulk removal of old data).

·         Given that the above goals are met, limit total cost of ownership (TCO):

·         Minimize the database administrator (DBA) time required during setup and for normal operations.

·         Keep hardware costs reasonable.

Data warehouse designs are most commonly based on a star or snowflake schema, also known as a dimensional model. This approach has advantages for query performance and makes it easy for business users to understand the data. Some parts of this paper assume basic knowledge of star-schema-based data modeling. Ralph Kimball and Margy Ross provide an excellent reference on dimensional modeling [Kim02]. Mundy et al. give another good reference on applying dimensional modeling specifically with Microsoft products [Mun06].

Less commonly, data warehouses are based on normalized schemas. An example of this is a schema containing one-to-many relationships between two large tables, such as Orders and LineItems from an order-entry system. One reason people use this type of schema is that they sometimes mirror the schema of the online transaction processing (OLTP) system that is the primary source of the data. This makes it easier to load the data warehouse simply by accumulating more history from the OLTP system, with few or no other transformations. Also, if the schemas are identical, operational reports can be run on the data warehouse and vice versa. It is also not uncommon to see a single “operational data store” used for transaction processing but containing a large amount of historical data, such as up to seven years. Such a hybrid system has many characteristics that are similar to a data warehouse, and much of the material described in this paper applies to this type of system as well.

Whether you are using a star schema or a normalized schema for your data warehouse, the ideas covered in this white paper apply. Because the star schema approach is common and well-defined, we assume it is the approach being used unless stated otherwise.

Benefits of Scale-Up Solutions for Data Warehousing

Scale-up based solutions can provide an excellent combination of high performance, management simplicity, high reliability, and reasonable total hardware purchase cost. Scale-out approaches, such as the SQL Server scalable shared database configuration [BOL08a], are an alternative to scaling up. For large systems, scaling out can reduce hardware costs, but this is only one part of the total operational cost of a large data warehouse. Another important consideration is system management cost, and scale-up systems have a distinct advantage here, as discussed later in this white paper. Moreover, compared with scale-out approaches, scaling up offers superior performance given that the same number of total processors and memory are used, along with a disk system with the same aggregate total bandwidth. The reasons for this are the following:

·         Scale-up systems offer extremely fast communication between all processors via the memory system, which is typically much faster than the bandwidth available via the network in a scale-out system.

·         During many SQL query execution operations (such as UNION, DISTINCT, TOP, and some sorts), data must be pulled into a single server simply because some operations cannot be parallelized efficiently across multiple separate computers. These types of operations on large quantities of data in a business intelligence (BI) environment are performed more efficiently in a scale-up environment.

·         Scale-up systems can more effectively use all of memory for all operations including query processing and index creation.

·         Scale-up systems can deliver data from the I/O system to any processor with minimal latency (no network transmission is needed to communicate between nodes).

·         Query optimizers tend to generate more efficient plans in a scale-up environment compared to a scale-out environment.

·         Software licensing costs tend to be lower in a scale-up environment

·         Operational management for a single server is easier than managing a scale-out federation of servers.

·         Scale-up servers can be partitioned into multiple logical computers. This allows for a more efficient use of system resources.

·         Those who use Microsoft SQL Server Analysis Services OLAP cubes also stand to benefit from a scale-up solution because OLAP queries perform most efficiently when the cube’s cells are memory resident. Scale-up servers may be configured with up to 2 terabytes of memory compared to 128 MB–256 MB for off-the-shelf servers with four or fewer processors.

Simpler Management Lowers Total Cost of Ownership

Scale-up systems are simpler to manage than scale-out systems, because there are fewer “moving parts.” Fewer individual nodes must be configured (one versus many). It is not necessary to run network connections between multiple nodes. The management tools provide a single system image of the scale-up system, while this is not typically the case for a scale-out system. Scale-out systems come with increased burdens of security with multiple servers that can require numerous patches and updates to maintain pace with global security issues. A scale-up configuration also does not have the complex distributed failure modes that can be present in a scale-out system. All this contributes to making the administrator’s job easier with a scale-up configuration.

Reliability and Availability Reduces Unplanned Downtime

Scale-up systems using Intel Itanium processors typically offer significantly improved reliability compared to scale-out systems that use x86 components. High-end SMPs like the HP Integrity Superdome offer added RAS (Reliability, Availability, and Serviceability) capabilities. For example, the HP Integrity product line supports the ability to logically remove a single failed processor or memory module (DIMM) from the system automatically, until it can be replaced by a technician. The system can operate even with the failure of up to half of the processors. Similarly, automatic error detection, error correction, error isolation, and recovery from memory failures are supported. The memory systems of computers such as the HP Integrity Superdome offer sophisticated error correction, making them tolerant of transient memory faults and hard faults in individual memory components. High-end scale-up systems also support hot-add of memory. Added memory becomes immediately visible to all the processors of the scale-up system—it is not necessary to restart the server or SQL Server.

Scale-up based systems are often used together with Storage Area Networks (SANS), such as the HP XP 240000. SANs also simplify management compared to the direct-attached storage used in scale-out approaches. For example, SANs typically support operations such as:

·         Very fast copying of database files, which is useful as part of the extract, transform, load (ETL) process.

·         Block-level replication of entire volumes for disaster recovery.

·         Copy-on-write snapshot of a database from the production environment to use for many purposes.

These operations can make it much simpler to manage your data warehouse and they work well in a scale-up computing system.

Best Practices for Building a Scale-Up Data Warehouse Solution

A number of best practices will help you get the best performance and the simplest and most cost-effective system management using SQL Server 2008 in a scale-up configuration. These best practices cover the following:

·         Database design

·         Schema design

·         Physical design

·         Hardware selection and configuration

·         Software selection and configuration

·         Data loading

·         Query specification and tuning

Some of these best practices are independent of the type of hardware being used. Others are more closely related to the use of a scale-up configuration. We focus here on the best practices that are most relevant to scale up, although we touch on other best practices briefly as well. This paper focuses mainly on software techniques associated with SQL Server 2008. For other best practices for using SQL Server on large SMPs from HP with an emphasis on hardware capabilities and configuration, you may wish to consult one of several detailed references on the subject from HP [HP05].

Database Design

The section covers best practices for designing schemas and for the physical design of a database such as disk layout, indexes, compression, and summary aggregates.

Schema Design

Database schema design (logical design) can have a major impact on the performance of queries and updates. For best performance, and ease of understanding the data and writing queries, we recommend using a star schema (also known as dimensional modeling [Kim02]) if that is feasible in your data warehouse environment. SQL Server 2008 Enterprise Edition has special star join query optimizations that take advantage of this type of schema [Han07]. If you prefer a normalized schema (typically third normal form [Wiki08a]), you can still get excellent query performance, but certain types of star join optimizations may not be possible for your queries.

The data types you choose for columns have a performance impact. Use integer (4‑byte) surrogate key columns to join your fact table to your dimension tables in a star schema. These compress well and support fast comparisons during joins. For decimal measure values, use the money data type if possible because operations on it are faster than for general numeric or decimal types. Avoid the use of long fields of any type in your fact table unless it is essential to the behavior of your application. This helps keep your fact table smaller and usually speeds up query processing and data loading.

Physical Database Design

After you choose your logical database design, how you physically structure your data warehouse data has a major performance impact. Physical design considerations include partitioning, indexing, creation of summary aggregates, compression, layout of tables and partitions on storage devices, and configuration of tempdb.

Table Partitioning

With the use of large SMPs for scale up, it becomes feasible to manage and query very large tables, even tables with billions of rows. With tables of this size, it is important to be able to break them down into manageable-sized chunks for system management operations so that these operations complete in a reasonable time. The table and index partitioning feature in SQL Server Enterprise Edition fills this need. For a data warehouse fact table beyond 50 GB, it is recommended that you partition it by time, such as by week. This allows you to bulk purge a week of data at a time with a metadata-only operation. It also can improve performance of loading, and management of summary aggregates maintained using indexed views, as discussed later.

Note: SQL Server table partitioning is separate and distinct from the hardware server partitioning supported by the HP Integrity Superdome.

Index Design

A basic index design that is often effective for a star schema in SQL Server is to:

1.    Create an integer date key in the format of YYYYMMDD as the surrogate key for both the date dimension and a foreign key in the fact table1. Create a clustered index on the date key column on the fact table. With the date key in this format, you can express date range filters conveniently and explicitly on the fact table. As we will discuss later, this in turn can give you better parallel query speed in some cases.

2.    Create nonclustered indexes on the fact table on the surrogate keys of the most frequently used dimensions.

3.    Create nonclustered indexes on large dimension tables (such as a Customer dimension with millions of rows) on the columns you search on frequently in those tables.

If you use partitioning to support a sliding-window scenario, partition the fact table and all its indexes on the date key. If your table partition width is one day, there is no need for the clustered index on the date key of the fact table in this situation. Instead, create a clustered index on the most frequently used dimension key column of the fact table besides the date key.

Design of Summary Aggregates

Experienced data warehouse developers have long known that one of the best ways to get good data warehouse query performance is to use summary aggregates. These summarize data in the database into sets of rows that are typically much smaller than the fact table. If most of your queries can be answered from these aggregates, which are much smaller than the raw event data, query performance can be improved by an order of magnitude. This in turn can make it possible to meet your users’ performance requirements on a smaller hardware platform, or provide better quality service to your users, or both.

The primary forms of summary aggregates you can use with SQL Server are:

·         SQL Server Analysis Services cubes

·         SQL Server indexed views

·         User-defined summary tables

SQL Server Analysis Services (SSAS) is an online analytical processing (OLAP) system with features tuned for fast, convenient processing of decision support queries [Mel07]. It supports a specialized query language called MDX designed for decision support and analytics. SQL Server 2008 Analysis Services supports highly efficient decision support query processing via materialization and use of summary aggregates. If Analysis Services suits your reporting needs, Microsoft recommends that you use it in conjunction with your relational data warehouse to improve query performance. Like SQL Server, it can benefit from scale-up configurations. Pasumansky provides an excellent collection of resources on best practices and tools associated with SQL Server Analysis Services [Mosh08].

If you wish to keep a relational-database-only configuration with SQL Server without using SSAS, or if you have a significant SQL decision-support query workload that could benefit from aggregates, consider using indexed views [Han05] or user-defined summary tables [Han07] to accelerate your common aggregate queries. Indexed views can be used to maintain summary aggregates automatically. In general, the SQL Server 2008 query processor can automatically use an indexed view to solve an aggregate query matching the structure of the view. If the SQL Server query processor is not able to use an indexed view to solve a query that actually could be solved using the indexed view, you can manually rewrite the query to reference the indexed view by name, and use the NOEXPAND hint on the reference to the indexed view in your query.

Indexed views are automatically maintained by SQL Server when you update your database. If you need to be able to incrementally update your database, such as to correct errors, and have the aggregates be updated automatically, this makes indexed views a good choice. A significant improvement in SQL Server 2008 is partition aligned indexed views [Han07]. This feature enables you to switch partitions of tables in and out even if those tables have indexed views defined on them. This eliminates the need to drop and rebuild indexed views during your ETL process, and can greatly speed up ETL, or make it feasible to use indexed views on partitioned tables when it was not feasible before.

Indexed views do have restrictions regarding what views can be indexed. If you want full flexibility and are able to maintain aggregates yourself in your ETL process, you may wish to use summary tables as described by Adamson [Adam06], and modify your SQL statements to refer to the appropriate summary tables for best performance.

Compression

Data compression, a new feature available in SQL Server 2008 Enterprise Edition [Agar08], is extremely valuable in a scale-up environment. It can dramatically reduce I/O requirements by (a) reducing the amount of data that must be read from disk, and (b) increasing the percentage of the data that resides in main memory during normal operation. The second effect can be dramatic, speeding up some queries by an order of magnitude or more.

We have observed compression factors ranging from 2X to 7X, with 3X being typical. At 3X, the compressed data is one-third the size of the uncompressed data.

Microsoft SQL Server Customer Advisory Team engineers have determined that for most large data warehouses, once about 20 percent of the data is in main memory, little or no I/O needs to be done to process queries. In practice, this 20 percent represents the most active portion of the database. As a best practice, a rule of thumb is to:

·         Compress your fact tables by using the PAGE compression feature [Agar08].

·         Add enough memory to your system so that at least 10-20% of the compressed data will fit.

Compression is not free. Its main cost is the increased CPU time required to load data. With page compression (the form of compression with the highest compression factor), CPU usage during loading can increase by about a factor of 2.5. Another form of compression, called row compression, results in smaller compression factors, but the cost to compress is significantly less [Agar08]. Query processing time does not differ significantly between row and page compression. The actual benefits and costs of compression depend on your data, hardware, and workload. However, in large-scale internal tests on a 600‑GB and 6‑terabyte data warehouse and a workload of 123 different queries, when using page compression, we observed a 30-40% improvement in query response time with a 10-15% CPU time penalty.

SQL Server 2008 also introduces a backup compression feature [Han07]. Backing up a compressed database is faster than backing up a non-compressed one because reading the source data is faster. Backing up a compressed database normally results in a backup that is 10-15% smaller than when backing up a non-compressed database, depending on the data.

Compression can be applied to individual partitions. If you have a large table that is already partitioned, but the partitions are not compressed, and you do not have a large time window available in which you can compress the full table, consider compressing only the new data during ETL. Then over time, as you age out old partitions and add new ones, the entire data set will gradually become compressed, without the need to extensively re-organize the data all at once.

Disk Layout

According to the Gartner Group, storage throughput is the most critical resource for a good data warehouse system. It is also one of the most expensive. A rule of thumb for any database system is that multiple small drives are better than one big one. The number of spindles you can apply to your storage system is probably the single most important factor in creating a high throughput IO subsystem.

Read-and-write caching in the disk devices or the SAN can be a big benefit up to a certain point. However, very large table scans can quickly fill this hardware-level cache and cause it to not give much benefit.

While a proper configuration of direct attached storage systems can equal or exceed the throughput ratings of the SANs, SAN systems have features that are highly beneficial for very large databases. The SAN snapshots, copy on write, and other similar features can make administrative tasks much easier.

To help you determine how good your disk layout is, download SQLIO.exe or IOMETER.exe. Run one of these tools to get a rating for both the number of IO/sec and throughput in bytes/sec. SQL Server uses a variety of block sizes for both reads and writes so it is important to run a variety of tests that will match your usage patterns.

There are three basic parameters to alter when running the I/O tests:  block size, read/write, and serial/random. For SQL Server, the minimum tests to run are 8‑KB and 64‑KB block sizes, both serial and random, and both read and write. Most of the ratings you find on the Web are stated in throughput of bytes/sec for the 64‑KB serial read test (which gives the highest throughput rating). A value of over 250 MB/sec per core for today’s computers is typically sufficient.

A number of configuration settings need to be set to work with SQL Server depending on the storage subsystem in use. The best advice is to ask your hardware vendor for the recommended settings for use in SQL Server. There is one configuration setting that is common to most disk subsystems or host bus adapters (HBAs) that can be beneficial to change. The Queue Depth setting is normally 4, but for SQL Server data warehouse systems it should be set to a minimum of 32. If you have many Fibrechannel connections, use the maximum of 254. For instructions on how to adjust the Queue Depth setting, consult your storage hardware documentation about how to manage the disk controller configurations.

It is also critical to do sector alignment so that one I/O does not cross physical sectors. Use the diskpar.exe program to change the disk alignment.

After you establish that your I/O system can provide the necessary raw throughput, make sure that the filegroups underneath each physical index or heap have adequate I/O bandwidth under them. A data warehouse fact table (or fact table partition if the table is partitioned) ideally should have at least 250 MB/sec I/O capacity per processor core if a significant part of your workload is I/O bound (limited by I/O performance).

There are many explanations published about the proper disk layout for SQL Server. See the SQL Server Customer Advisory Team material [CAT08].

tempdb Configuration

A common problem is having tempdb in a small file on the C: drive. You need to add tempdb files on a SAN or set of direct-attached disks with sufficient I/O throughput, and remove tempdb from the C: drive, because extent allocations are written on the first database file. These allocations may not be fast enough if you leave that file on a relatively slow C: drive. Similarly, log files should also not be on C:.

tempdb should be set to autogrow in case of emergencies, but should never be allowed to grow beyond its original size. It is considered a best practice to create tempdb at an appropriate size.

tempdb should also contain one file per CPU core. For example, if you have four dual core CPUs, you should create eight files in tempdb. These files should be of equal size to achieve round-robin loading. This is considered a general best practice but may not fit all projects. And if you do not have heavy tempdb usage, it really doesn’t matter how many files you create.

Whether to place your tempdb files on the same disk fabric as the main database depends on your usage patterns. For some projects it is okay to put the database and tempdb on the same spindles. If your tempdb activity is mostly on in-memory pages, there will be very little physical I/O and it would be okay for tempdb to reside on the same disks. For other projects this would be a mistake. Knowing your usage patterns is essential to making this decision.

Partitioning Server Resources

HP Integrity Servers allow for more efficient operations at lower cost. This may be accomplished through virtualization. Virtualization may be implemented several different ways:

·         nPars (hard partitions - NUMA)

·         Soft partitions (soft NUMA)

·         Pay as you go

·         Manage a single server with Windows System Resource Manager (WSRM) and/or SQL Server 2008 Resource Governor

·         Virtuoso for virtualized containers

The above partitioning strategies of the server allow for various databases and/or applications to share a scale-up server.

Hardware Selection

Best practices for hardware selection include considerations for assessing processor and memory resources.

Processors

The number of processors needed is usually the most difficult hardware characteristic to choose. Several things generally determine how many processors are needed:

1.    The number of queries executed in a specific period of time

2.    Query complexity (simple, medium, complex or very complex)

3.    Desired response time for a single query. This can determine the amount of parallelism you want to achieve per query.

With the multi-core processors today, the estimation gets slightly more complicated. Each additional core does not add the same processing power as the original. A practical estimate is to assume each additional core is 50% of the original, although this can vary depending on your usage.

Taking your workload to a lab for testing may be the best way to determine the number of processors needed. Hewlett Packard and Microsoft both have a number of facilities with high-end computers that can be scheduled. When creating your suite of tests to run, do not just test the normal part of the application—you may end up with a smaller computer than you need. Add the maintenance tasks to your list of tests. These include index maintenance, backups, and DBCC programs. These maintenance tasks parallelize very well and may require more CPUs than the other parts of your application.

Both Windows® 2003 and Windows 2008 have a maximum limit of 64 cores. (Note that hyperthreading simulates processor cores and this limit applies to these simulated cores if hyperthreading is enabled.)

To estimate the amount of hardware required for your application, HP recommends using capacity planning models or, if you have a new application, use either HP’s published Reference Configurations or the sizing tools on HP’s ActiveAnswers Web site.

Memory

Having data already in memory can make query run time more predictable. Having a fast disk subsystem is essential for quickly getting data into memory. It is also best if the data in memory can be used by other queries.

Empirical evidence has shown that 4 GB per core is sufficient for most data warehouse applications. This estimate can be influenced by the number of simultaneous queries or transactions as well as the number of large table/index scans needed.

Software Selection

For the best results in scaling up your computer to its largest configuration it is best to use the following:

·         Windows Server® 2008 Enterprise, Datacenter, or Itanium editions

·         SQL Server 2008 Enterprise Edition (SQL Server 2005 and 2008 Enterprise editions are both available in versions for the Intel Itanium processor)

This will give you the following maximum limitations:

·         64 cores

·         2 terabytes RAM

SQL Server 2008 does not have a Datacenter edition. The highest edition is Enterprise. Many of the data warehouse scale enhancements in SQL Server 2008 are only available in the Enterprise edition.

SQL Server Settings

There are several issues to consider when configuring the SQL Server 2008 Database Engine for data warehouse workloads.

Configuring “max server memory” [BOL08b]

The SQL Server 2008 Database Engine manages memory dynamically and by default is configured to use as much memory as it can get from the operating system. While this is desired on a dedicated system where nothing else is running, it may starve other applications from getting memory if they are sharing the computer. This is especially important if multiple instances of the SQL Server 2008 Database Engine or other components, such as Analysis Services, Reporting Services, and Integration Services are running on the same system. We recommend that you configure max server memory to a value that accounts for other memory-consuming applications and the operating system itself2.

Configuring “max degree of parallelism” [BOL08b]

We recommend setting max degree of parallelism for the server to its default value 0 (unlimited) to be able to fully use the parallelism enhancements in SQL Server 2008. Note that you can control the degree of parallelism of a particular workload with Resource Governor as outlined in the following sections. You can also cap the degree of parallelism for a specific query by using the MAXDOP query hint.

Configuring affinity masks [BOL08b]

Affinity masks (CPU and I/O) are generally used when consolidating multiple instances on the same system or reserving CPUs for other applications but eventually their use may result in CPU underutilization. There are other means of controlling CPU usage in such environments, including Resource Governor and Windows System Resource Manager (WSRM).

Using Resource Governor [BOL08c]

Resource Governor is a new feature of SQL Server 2008 that helps improve the predictability of workload execution on the server for multiple combined workloads (for example, ad-hoc queries with pre-canned reports). Separating workloads into distinct workload groups helps monitor their resource usage. Furthermore, Resource Governor provides maximum/minimum CPU bandwidth and maximum/minimum aggregate memory usage controls. Each individual workload can be configured for a particular maximum degree of parallelism depending on the style and type of queries being run in the workload [BOL08c]. Setting maximum resource usages help contain large workloads while setting minimums guarantees that mission-critical workloads have predictable execution times on a loaded server. For details on parameters controlled by Resource Governor, see SQL Server Books Online [BOL08c] and the Resource Governor best practices white paper [RGWP08].

Using Windows System Resource Manager (WSRM) [WSRM03]

SQL Server Resource Governor provides controls to manage CPU and memory consumption within the SQL Server process (sqlservr.exe). WSRM is a recommended way to manage CPU bandwidth allocation between different processes on the server, such as different database instances or other components such as Analysis Services.

WSRM is best used when more than one service that consumes a significant amount of resources is installed on the same computer. These services could include any combination of SQL Server, Analysis Services, Reporting Services, and Integration Services, your own services, and third-party services. In this scenario WSRM would be used to manage the processor affinity for Analysis Services, Reporting Services, and Integration Services. It is best to use the built in controls for processor affinity for SQL Server. WSRM would also be used to control the memory consumption for Reporting Services and Integration Services since Analysis Services and SQL Server have their own controls.

Query Design

Designing your queries well can have a big positive impact how they perform. Query design is, of course, closely related to schema design. Again, we recommend using a star schema if that is workable for your application. Important things to remember when you design your queries are:

Keep your queries simple.

Specifically, try to use a single star join with grouping and aggregation when possible. SQL Server Enterprise Edition has specific optimizations for these types of queries. In addition, avoid using expressions, functions, or local variables in the WHERE clause, as these will cause SQL Server to “guess” the selectivity of a condition. This way the query optimizer will be much more likely to choose a good query plan. See the white paper on statistics in SQL Server for additional details on what constructs to avoid in queries to help the optimizer get a good query plan [HanKol05].

In addition, avoid including parameters or variables (such as @ParameterName or @VariableName) in your data warehouse queries unless you know you need to save on compile time. Use literals (that is, actual numbers, dates, or strings) in your query instead of parameters. This will help you get a better query plan. For data warehouse queries, the cost of compiling the plan is usually dwarfed by the cost of execution, so plan reuse is not important. If you want to parameterize your queries to make it easier to program your application or to improve security as a way to prevent SQL injection [How03], use the OPTION(RECOMPILE) hint on your query to make sure you get the best plan for the specific parameter values you are passing in to your query.

Consider using temp tables instead of CTEs.

Although common table expressions (CTEs) are convenient for programming, they do not force SQL Server to generate a temporary partial query result. Heavy use of CTEs in your queries may lead to redundant computation of the CTE expression results by the query plan in some cases. For better performance, consider using temp tables instead of CTEs, in order to break down a large business question into manageable chunks. This may of course mean you have to split a single query with several CTEs into multiple queries that retrieve results into temp tables. Temp tables have statistics while table variables do not, which is why we do not recommend table variables here.

Put an explicit date filter on the fact table.

Avoid implying a range filter on the fact table by using a join with a “Date,” “Time” or “Period” dimension unless it is truly necessary. This will help avoid some situations with partitioned fact tables where partitions are touched even though they cannot contain relevant data for the query. Even more importantly for scale up, this also allows pure hash-join plans to be chosen more frequently. These pure hash-join plans parallelize well and enable you to benefit from the power of all the processor cores on your shared-memory multiprocessor server.

In order to be able to explicitly filter on date on the fact table easily, encode the surrogate keys for date as integers in the form YYYYMMDD. For example, April 25, 2008 would be represented as 20080425.

For example, suppose we have a database with the same schema as the AdventureWorksDW sample database available with SQL Server [Samp08], except that it has a very large partitioned fact table for Internet sales. Furthermore, we modified the DimTime dimension and this fact table to use surrogate keys for the date, which is encoded as YYYYMMDD. To find the top 10 products in total sales between January 1 and 7, 2008, the following would be a good query to use:

 

select top 10 p.ProductKey, sum(f.SalesAmount)

from FactInternetSales f, DimProduct p

where f.ProductKey=p.ProductKey and p.ProductAlternateKey like N'BK-%'

and f.OrderDateKey between 20080101 and 20080107

group by p.ProductKey

order by sum(f.SalesAmount) desc

 

Avoid a query like the following when possible because it filters on date via join:

 

select top 10 p.ProductKey, sum(f.SalesAmount)

from FactInternetSales f, DimProduct p, DimTime t

where f.ProductKey=p.ProductKey and p.ProductAlternateKey like N'BK-%'

and f.OrderDateKey = t.TimeKey

and t.MonthNumberOfYear = 1

and t.CalendarYear = 2008

and t.DayNumberOfMonth between 1 and 7

group by p.ProductKey

order by sum(f.SalesAmount) desc

 

If you must include values from a period dimension in the output of your query to include it in a report or to control the grouping level for aggregation, you can join it to the fact table but still not use it to filter. You would still filter explicitly on the fact table. For example, the following query joins on DimTime but does not filter using it, and pulls in the year and month for display:

 

select top 10 p.ProductKey, t.CalendarYear, t.EnglishMonthName,

sum(f.SalesAmount)

from FactInternetSales f, DimProduct p, DimTime t

where f.ProductKey=p.ProductKey and p.ProductAlternateKey like N'BK-%'

and OrderDateKey between 20030101 and 20030107

and f.OrderDateKey=t.TimeKey

group by p.ProductKey, t.CalendarYear, t.EnglishMonthName
order by sum(f.SalesAmount) desc

 

Avoid complex expressions inside aggregates in your query.

For example, instead of SUM(1.1*x) use 1.1*SUM(X). This can reduce the amount of time spent evaluating expressions at the innermost level of query execution.

Avoid mixing DISTINCT and non-DISTINCT aggregates in the same query.

SQL Server sometimes generates multi-consumer spool operators as part of a query plan. The portion of the query plan above a multi-consumer spool typically is not parallelized. For this reason, if you have a performance problem with a query that has a multi-consumer spool, consider rewriting the query to eliminate the multi-consumer spool. Multi-consumer spools typically arise when you mix DISTINCT and non-DISTINCT aggregates in the same query. In the following query, we use AGG1 and AGG2 to represent any aggregates, such as SUM, COUNT, MIN, and MAX. For example, a query or subquery of this form may generate a multi-consumer spool:

 

SELECT a, AGG1(DISTINCT b), AGG2(c)

FROM table

GROUP BY a

 

You can rewrite this to eliminate the use of DISTINCT and thus remove the multi-consumer spool from the plan, as follows:

 

SELECT a, AGG1(b), AGG2(c)

FROM

(

      SELECT a, b, AGG2(c)

      FROM table

      GROUP BY a, b

) as T

GROUP BY a

 

This rewrite is only correct if the aggregates AGG1 and AGG2 can be computed incrementally from their input, regardless of how the input has been ordered, or partially grouped and aggregated. SUM, COUNT, MIN, and MAX all have these required characteristics. However, user-defined CLR aggregates might not have the necessary properties.

Use GROUPING SETS.

The new GROUPING SETS feature in SQL Server 2008 can enable you to avoid redundant computation and shorten your queries. For example, you want to create a report that shows sales totals for a year, quarter, and country, in addition to country totals, period totals, and a grand total. A typical way to display this information would be a report with the format shown in the following figure.

Grouping Sets Pivot Table.gif

Figure 1: Pivot table report layout suitable for displaying data retrieved by GROUPING SETS query

The GROUP BY CUBE notation in SQL is not sufficient to express this query because the period of interest is represented by two columns, not one. The new GROUPING SETS feature enables you to specify all sets of grouping columns you are interested in, in a single relatively simple query as follows:

 

USE AdventureWorksDW;

SELECT      D.CalendarYear, D.CalendarQuarter, T.SalesTerritoryCountry,

      SUM(F.SalesAmount) AS SalesAmount

FROM  dbo.FactResellerSales F      

            INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey

            INNER JOIN dbo.DimSalesTerritory T ON

                  F.SalesTerritoryKey = T.SalesTerritoryKey

WHERE D.CalendarYear IN ('2003', '2004') 

GROUP BY GROUPING SETS (

      (D.CalendarYear, D.CalendarQuarter, T.SalesTerritoryCountry),

      (D.CalendarYear, D.CalendarQuarter),

      (T.SalesTerritoryCountry),

      ()

);

 

Without using GROUPING SETS, the simplest way to write this query is as the union of several SELECTs, one for each grouping set, similar to the following:

 

SELECT      D.CalendarYear, D.CalendarQuarter, T.SalesTerritoryCountry,

      SUM(F.SalesAmount) AS SalesAmount

FROM  dbo.FactResellerSales F      

            INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey

            INNER JOIN dbo.DimSalesTerritory T ON

                  F.SalesTerritoryKey = T.SalesTerritoryKey

WHERE D.CalendarYear IN ('2003', '2004') 

GROUP BY D.CalendarYear, D.CalendarQuarter, T.SalesTerritoryCountry

UNION ALL

 

SELECT      D.CalendarYear, D.CalendarQuarter, NULL,

      SUM(F.SalesAmount) AS SalesAmount

FROM  dbo.FactResellerSales F      

            INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey

            INNER JOIN dbo.DimSalesTerritory T ON

                  F.SalesTerritoryKey = T.SalesTerritoryKey

WHERE D.CalendarYear IN ('2003', '2004') 

GROUP BY D.CalendarYear, D.CalendarQuarter

UNION ALL

 

SELECT      NULL, NULL, T.SalesTerritoryCountry,

      SUM(F.SalesAmount) AS SalesAmount

FROM  dbo.FactResellerSales F      

            INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey

            INNER JOIN dbo.DimSalesTerritory T ON

                  F.SalesTerritoryKey = T.SalesTerritoryKey

WHERE D.CalendarYear IN ('2003', '2004') 

GROUP BY T.SalesTerritoryCountry

UNION ALL

 

SELECT      NULL, NULL, NULL,

      SUM(F.SalesAmount) AS SalesAmount

FROM  dbo.FactResellerSales F      

            INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey

            INNER JOIN dbo.DimSalesTerritory T ON

                  F.SalesTerritoryKey = T.SalesTerritoryKey

WHERE D.CalendarYear IN ('2003', '2004') 

;

 

This second form of the query is of course much longer, and the query plan generated for it is less efficient, with a run time about 2.5 times longer than the GROUPING SETS form.

Query Tuning

If you are having a performance problem with a query in your data warehouse on a large SMP system, you can often improve performance by tuning the query by modifying it so that it runs faster. Before you resort to tuning a query, make sure you have appropriate indexes and your statistics are up to date. Also, make sure you have chosen your hardware (processors, memory, and storage) and configured it according to the guidelines outlined earlier in this paper. If you have good index structures and fresh statistics, and your hardware is up to the task of running the query in the desired response time, first look at the query and make sure you are using the guidelines outlined in the preceding section, Query Design.

If you still have a performance problem, you may need to tune the query. There are two main classes of query plans for star join queries typical in data warehouses: scan plans and seek plans. Scan plans scan a range of the fact table and join the resulting rows with one or more dimensions. Seek plans find rows from dimensions that qualify, and then seek into the fact table to find joining rows, via clustered or nonclustered indexes on the dimension key columns of the fact table.

Seek plans are typically the most efficient for highly selective queries where a tiny percentage of rows qualify. Scan plans typically are the most efficient when a relatively large fraction of the rows qualify.

If you are getting poor performance for a query even though everything else checks out, the main categories of problems to consider are these:

·         Plan choice: The query plan is not efficient enough and you need a better plan (independent of parallelism)

·         Parallel performance: The query plan is a parallel plan, but the full parallel power of the computer is not being utilized effectively.

You can tell that parallel performance is a problem if you observe the following:

1.    You are getting a serial plan but runtime is too long. You can tell the plan is serial if the plan does not have any parallelism (exchange) operators [BOL08d].

2.    You run the query with degree of parallelism K and during a significant part of execution, fewer than K cores are shown to be busy in Windows Task Manager under the Performance tab.

3.    In the STATISTICS XML plan or runtime .sqlplan file generated from graphical showplan, the RunTimeCountersPerThread information shows that the ActualRows count for each thread is skewed so that a small number of threads are doing most of the work.

For an example that illustrates both symptoms 2 and 3 above, if you run a query that does a nested loop join between a Date dimension and a partitioned fact table, and only three different Date rows qualify, you may see RunTimeCountersPerThread information that look similar to this:

 

<RunTimeInformation>

**** <RunTimeCountersPerThread Thread="7" ActualRows="0" ActualEndOfScans="1" ActualExecutions="1" />

**** <RunTimeCountersPerThread Thread="5" ActualRows="0" ActualEndOfScans="1" ActualExecutions="1" />

**** <RunTimeCountersPerThread Thread="6" ActualRows="0" ActualEndOfScans="1" ActualExecutions="1" />

**** <RunTimeCountersPerThread Thread="4" ActualRows="0" ActualEndOfScans="1" ActualExecutions="1" />

**** <RunTimeCountersPerThread Thread="3" ActualRows="0" ActualEndOfScans="1" ActualExecutions="1" />

**** <RunTimeCountersPerThread Thread="2" ActualRows="49153977" ActualEndOfScans="1" ActualExecutions="1" />

**** <RunTimeCountersPerThread Thread="1" ActualRows="51723098" ActualEndOfScans="1" ActualExecutions="1" />

**** <RunTimeCountersPerThread Thread="8" ActualRows="56013568" ActualEndOfScans="1" ActualExecutions="1" />

**** <RunTimeCountersPerThread Thread="0" ActualRows="0" ActualEndOfScans="0" ActualExecutions="0" />

</RunTimeInformation>

 

It is normal for thread 0 to have zero actual rows. The other threads ideally will have roughly the same number of rows. In this case, five threads have zero ActualRows counts. This pattern usually arises from a nested loop join that has a small number of outer rows qualifying (equal to the number of threads doing work). You can work around the problem by modifying the query to force the use of a hash join.

For example consider a query like the following (based on the Project REAL schema [Wyatt05]):

 

   select COUNT(*)

   from Tbl_Fact_Store_Sales f, Tbl_Dim_Date d

   where f.SK_Date_ID = d.SK_Date_ID

   and d.Date_Value in

   (N'December 1, 2004', N'December 2, 2004', N'December 3, 2004')

 

With a large fact table, this query typically gets a parallel nested loop join query plan. That plan will use at most three threads because there are only three qualifying outer rows. To improve parallelism, you can rewrite the query to the following equivalent form:

     

select Date_Value, COUNT(*)

   from Tbl_Fact_Store_Sales f, Tbl_Dim_Date d

   where f.SK_Date_ID = d.SK_Date_ID

   and f.SK_Date_ID between 20041201 and 20041203

   group by Date_Value

   option (hash join)

 

This query gets the same answer as the previous one but it eliminates the nested loop join and can be fully parallelized—it can use all the processor cores on the computer. The important changes from the original query are that the range predicate is now expressed directly on the fact table and not on the dimension table, and that the query hint option(hash join) has been added. The BETWEEN predicate on the fact table allows a parallel range seek to be done against the fact table to find the rows for the specified days. The use of a hash join avoids the limit of one thread per outer row faced with a loop join.

If the number of days of data that qualify is greater than or equal to the maximum degree of parallelism at which the query executes, filtering the fact table by using a join with the date dimension typically will perform well, and this type of query tuning is not needed.

Forcing Seek Plans When Scan Is Chosen and Vice Versa

As mentioned earlier, in a data warehouse star join query two main types of plans are often possible: scan plans and seeks plans. Depending on the selectivity of the filters on dimensions, there is a crossover point where a scan plan’s execution time equals that of an equivalent seek plan, as illustrated below:

ScanSeekCrossover.GIF

Figure 2. Scan-seek cost crossover as a function of join selectivity (fraction of rows from fact table that qualify after joining with filtered dimension)

To the left of where the curves cross, a seek plan is best. To the right, a scan plan is best. The query optimizer’s internal cost estimates for plans are quite good, but they are not perfect. If the cost characteristics of a query place it near the crossover point for actual scan versus seek execution time, the plan chosen by the optimizer could be a scan when a seek would execute faster or vice versa.

The following example based on the AdventureWorksDW database illustrates this:

 

use AdventureWorksDW;

 

select d.ModelName, SUM(f.SalesAmount)

from FactInternetSales f

, DimProduct d

where d.ProductKey = f.ProductKey

and d.ModelName like N'B%'

group by d.ModelName;

 

The plan for this query obtained by default by the optimizer looked like this in one of our tests (it is a scan plan because it scans the fact table rather than using the nonclustered index on ProductKey on the fact table):

ScanPlan.GIF

Figure 3. Scan plan for sample query (scans fact table and performs hash join with dimension(s))

The following query has been modified to include hints to force a plan that finds the dimension table rows that qualify, then seeks into the nonclustered index on the key of that dimension on the fact table, and finally looks up the qualifying rows of the fact table:

 

select d.ModelName, SUM(f.SalesAmount)

from FactInternetSales f with(forceseek)

, DimProduct d

where d.ProductKey = f.ProductKey

and d.ModelName like N'B%'

group by d.ModelName;

 

Following is a fragment of this plan (the aggregation part is excluded):

SeekPlan.GIF

Figure 4. Seek plan for sample query

This seek plan actually executes several times faster than the equivalent scan plan above. If you think you may need a seek plan but you see a scan plan, you can force a seek plan by using a forceseek hint similar to that shown above. Optionally, you may also wish to use an index hint to force seeking into a specific index.

Similarly, if you are getting a seek-style plan (joining a dimension to a nonclustered fact table index, then looking up the associated fact table rows), you can convert it into a plan that scans all or a range of the fact table with hints. For example, consider this query:

 

select d.ModelName, SUM(Fact.SalesAmount)

from Fact /*with(index=1)*/, #tempProd d

where Fact.ProductKey = d.ProductKey

and OrderDateKey between 10 and 200

group by d.ModelName;

 

In this example, Fact is a large fact table with a clustered index on OrderDateKey and a nonclustered index on ProductKey. #tempProd is a small table with two Product dimension rows in it. The default plan for this, without hints, is a seek plan that seeks the ProductKey index on Fact. Adding the index=1 hint (commented in the example) changes the plan so a range seek is performed on the Fact table’s clustered OrderDateKey index and the results are hash-joined with #tempProd. In general, consider using index=1 hint on the fact table, optionally with an option(hash join) query hint, to force a scan plan if that is beneficial.

One important consideration regarding using scan versus seek plans on a large SMP is related to the costing of seek operations. Because it assumes that a seek requires a random I/O (the worst case), the query optimizer tends to over-cost seeks, both

a.    When the I/O system is very fast, and

b.    When all relevant data fits in the main memory buffer pool.

In other words, scan plans may be chosen instead of seeks in these situations, even when seeks would be better. If you are seeing scan plans and you are not getting the performance you need, and a) or b) apply to you, consider forcing a seek plan with hints.

ETL

Scale-up systems such as the HP Integrity Superdome can provide very fast index construction capability due to their high I/O performance and parallel processing capability. An important part of ETL is index maintenance. Index creation and maintenance both can be time consuming. Depending on the amount of data you will change or add during your ETL, you may wish to modify the existing indexes in place (by using DML statements) or drop the indexes, do your load, and then rebuild the indexes. If you are changing a very large number of rows, the drop-and-rebuild approach is typically better. Otherwise, maintaining the indexes in place is preferred. Consider experimenting with both approaches to find the best one for your situation.

When it comes to loading and maintaining a data warehouse by using the ETL approach, SQL Server 2008 and SQL 2008 Server Integration Services provide some definite advantages over the previous release. In this section we briefly highlight some of the new features and how to take advantage of them to speed up your ETL processes.

Pipeline Performance

SQL Server Integration Services (SSIS) enables you to construct ETL solutions by using an interactive development environment with the required business functionality represented visually as a directed graph composed of linked data sources, transformations, and destinations. At run time, the SSIS engine converts this abstraction into a series of pipelines that move the data down the graph, while keeping the data in memory buffers (to avoid staging) and minimizing the number of data-copy operations. Because of these optimizations, the SSIS engine can efficiently process data even within an arbitrarily complex graph.

In SQL Server 2005 the pipeline execution engine allocates worker threads in an affinitized manner, meaning that once a thread has finished its work it is not able to take on any other pending work. This means that ETL solutions often reach a plateau of scalability, beyond which further performance gains are difficult to achieve.

In SQL Server 2008, the resource assignment uses a thread-pooling mechanism whereby a thread can be dynamically assigned to a different task as soon as it has completed its current work. This feature is known as Pipeline Performance since the net result is that the SSIS pipeline engine makes more optimal use of the available hardware, leading to predictable performance, increased parallelism, faster load times, and a higher return on investment. The performance improvement due to this enhancement compared to SQL Server 2005 can be impressive—speedups of a factor of four for the same package running on the same hardware running on SQL Server 2008 versus SQL Server 2005 are not uncommon. A second benefit is that you need to spend less time manually tuning the solution, which means higher productivity.

Lookup Performance

The Lookup component in SSIS tests whether each row in a stream of rows has a matching row in another dataset. A lookup is akin to a database join operation (specifically a relational hash join); the key difference is that the comparison takes place outside the context of the relational engine. Data from one source is hashed into an in-memory cache within a Lookup component and data from the other source is streamed past the Lookup so that specified keys can be compared. This is useful because the two datasets may come from completely different (and possibly non-relational) data sources.

Lookup in SQL Server 2008 Integration Services has several major advantages over the previous version:

·         The in-memory cache can be loaded from any source, unlike Integration Services in SQL Server 2005, which only permitted OleDb sources. For instance, the cache can now be loaded from a flat file, ADO.NET, OleDb, ODBC, a Visual Basic®.Net or C#® script, or even a separate pipeline. This lessens the amount of staging in a solution, and increases the ultimate flexibility of the solution.

·         The cache content can be located in virtual memory or persisted to permanent file storage. This means that within the same package, multiple lookup components can share the same cache. If the cache is saved to disk, it can be shared across different packages. The cache file format is optimized for speed and access to it can be orders of magnitude faster than reloading the reference dataset from the original relational source.

·         Lookup introduces the miss-cache feature, which saves time by loading into cache the key values that have no matching entries in the reference dataset. This reduces a redundant and expensive trip to the database. For instance if a value is requested from the Lookup transformation but it is not found, the next time that value is requested the Lookup does not try to locate the value; instead it immediately returns a ‘not found’ result. The miss-cache feature alone can contribute up to a 40 percent performance improvement in some scenarios. In addition to the partial caching previously described, Lookup also supports full (all data values are loaded into the cache upfront) and no caching of key values. Full caching requires that the entire key value set fit into memory. The no caching option is only recommended if it is evident that there will hardly ever be any repeated lookups.

Other enhancements to the lookup component include optimized I/O routines leading to faster cache loading and lookup operations. Lookup is also easier to use and program thanks to its more intuitive user interface, improved error handling and reporting, and the ability to make changes to query statements at run time. Compared to Lookup in the previous version of SSIS, there are now more options that provide better control and hence the potential for improved performance. Users are encouraged to take advantage of the ability to reuse the lookup cache across packages and choose a specific caching option (full, partial, no caching) for best results.

ETL Features in the SQL Server Engine

Though the following features are not provided directly by SSIS—but rather by the SQL Server 2008 Database Engine—they are directly usable by SSIS and thus help deliver high performance ETL solutions with the programming convenience and flexibility of SSIS. They are also useful with custom ETL solutions.

Change Data Capture

One of the biggest performance bottlenecks in an ETL solution can be the identification and application of the set of rows that changed during some time period, known as the change set or delta. In many operational systems, there is no direct way to identify which rows were changed within the last 24 hours (or however long the batch window is) and so the question of which rows comprise the last day’s operations is difficult to answer. Even if the rows are tagged with a change timestamp, they are often not tagged with the operation that changed them—whether a specific change was an update, insert or delete. The ETL developer is forced to come up with imaginative ways of dealing with this issue. Often the source system is augmented with timestamps, triggers, and other database objects to track changes, but these changes can be intrusive and affect the operational behavior and performance of the source.

SQL Server 2008 introduces a compelling new feature to mitigate this problem. Change data capture uses an asynchronous log reader mechanism to propagate changes at the table grain to capture instance tables in a transactionally consistent manner, through windowed API functions:

·         Asynchronous Log Reader: Whenever a change is made to the database, the engine first writes an entry to the database log before writing it to the database store. This is an existing SQL Server feature that change data capture uses to good effect. A change data capture log reader job can be scheduled to run at regular intervals or during server idle time. The log reader trawls the logs and copies the changes into a configured set of tables along with metadata to help identify when and how the original changes were made. Because the reader operates asynchronously, the cost on the source system can be amortized over a longer period (or moved to non-operational hours such as 3 A.M.). Also, the mechanism requires no changes to the schema of the source system (though it needs to be hosted on SQL Server 2008).

·         Table grain: Change data capture functionality can be switched on or off at the table level. This means the solution can be implemented incrementally—change data capture is not an all-or-nothing proposition. When a table is enabled for change data capture, the database automatically generates a capture instance table (also called a shadow table) that contains the data changes along with extra metadata to audit the operations. If the shape of the source table changes (for instance a column is added or deleted) the technology can continue to deliver a stable shape in order to not break any downstream processes while they are updated.

·         Transactionally-consistent: Since the change data capture feature relies on the database log, it is fully aware of transactions and thus can provide related changes across multiple tables correctly.

·         Windowed API functions: A host of functions are available to query and manage the capture instance tables. For instance, the APIs can be used to select all the data changes from the last batch window (such as 24 hours), with each row tagged with its operation (insert, update, or delete) as well as a mask identifying which specific columns changed.

By using the change data capture feature, ETL solutions can be designed to be much more robust and deliver increased performance and scalability. Lowering the cost of change capture by using change data capture can lower ETL processing time significantly, shortening your batch window, increasing availability of your data warehouse, and reducing ETL hardware costs. Change data capture can also enable “trickle feed” scenarios without the need for changes to your operational applications.

Merge

The MERGE statement (also known as upsert) enables the database to apply a source rowset containing a mixture of inserted, updated, and deleted rows against a destination table within a single operation. The merge clause enables you to specify how each operation type should occur—including advanced semantics such as SCD-2 (slowly changing dimension Type 2) behavior.

The common problem that this operator solves is that when the source system provides a batch of rows, some of which are new (inserts) and some of which already exist in the destination but have different values (updates), there is no way to distinguish the respective types. One solution would be to use an SSIS Lookup operation to figure out which rows were new and which already existed. The easiest option to accomplish this is by using the Slowly Changing Dimensions wizard available in SSIS 2008, which automatically configures the proper packages for performing the merge operation. However, this approach may not scale well since the Lookup component requires that one side of the join is explicitly cached in memory. Although that sounds fine for simple cases, if the data to be cached is large and/or wide, the overhead and time of physically loading the data into memory as well as the hardware requirements might make this approach untenable. The other solution would be to join the source and destination within the context of a relational update operation (thereby updating the existing rows) followed by an insert operation that uses a left outer join to determine which rows to insert (to insert the new rows). The bottleneck in this solution is that the two operations must be done serially and you incur the expense of two joins.

In SQL Server 2008, the merge operator solves this type of issue. The merge operator enables you to specify how each row from the source should be treated in relation to the destination. Since it uses only one (internal) join, it offers increased performance, and coupled with granular syntax options it delivers a powerful solution.

For example, the following Transact-SQL script based on the AdventureWorks and AdventureWorksDW sample databases shows how a Customer dimension table can be updated or have rows inserted into it by using a single MERGE statement.

 

USE AdventureWorksDW;

GO

 

MERGE dbo.DimCustomer AS [Dest]

USING (

       SELECT

              ContactID,

              N'AW' + RIGHT(N'0000000' + CONVERT(NVARCHAR(10), ContactID), 8) AS [Key],

              FirstName,

              LastName,

              Phone

       FROM AdventureWorks.Person.Contact

) AS [Source]

ON [Dest].CustomerAlternateKey = [Source].[Key]

WHEN MATCHED THEN UPDATE SET

       [Dest].FirstName = [Source].FirstName,

       [Dest].LastName = [Source].LastName,

       [Dest].Phone = [Source].Phone

WHEN TARGET NOT MATCHED THEN INSERT (

       GeographyKey, CustomerAlternateKey, FirstName,

       LastName, Phone, DateFirstPurchase

) VALUES (

       1, [Source].[Key], [Source].FirstName,

       [Source].LastName, [Source].Phone, GETDATE()

);

Minimally Logged Insert

Minimally logged insert can have a large impact on the load performance of the ETL process. Rather than writing data to the log and then to disk as is required for recovery under write-ahead logging, it is possible to write the data to disk only once if full recovery is not desired, for example, when inserting large amounts of data into a table, such as during ETL.

Minimal logging was introduced in SQL Server 2005 and refers to logging only the information that is required to roll back the transaction without supporting point-in-time recovery. Minimal logging is only available under the bulk logged and simple recovery models. Operations that can be minimally logged in SQL Server 2005 include bulk import operations, SELECT INTO, and index creation and rebuild. SQL Server 2008 extends the optimization to INSERT INTO…SELECT FROM Transact-SQL operations that insert a large number of rows into an existing table under either of the following conditions:

·         Inserting into an empty table that has a clustered index and no non-clustered indexes

·         Inserting into a heap that has no indexes but that can be non-empty

Minimal logging greatly improves large-scale INSERT operations by increasing performance and reducing the amount of log space required. For a discussion of table structure and DML statement requirements for minimal logging, see SQL Server Books Online.

Periodic Maintenance

This section covers the maintenance tasks that need to be done periodically in order to run a smooth SQL Server computer. This includes, but is not limited to, the following tasks:

·         Index maintenance

·         Statistics maintenance

·         Database consistency checks

The most challenging tasks in the largest databases are the maintenance tasks. This can be more difficult than tuning the queries. And since maintenance tasks can parallelize over many CPUs, they may require more CPUs than you might originally estimate for your user workload.

Index Maintenance

Indexes on tables that are frequently modified (via insert, update, and delete statements) need to be rebuilt occasionally. It is recommended that you rebuild your indexes before they reach 50 percent fragmentation. If your table is too big to constantly rebuild your indexes, consider partitioning the table and indexes so you can rebuild one index partition at a time. If that is not feasible, at least repack the indexes; this does not reclaim the unused space but it does reorder the pages. Use the sys.dm_db_index_physical_stats dynamic management function to check for the amount of fragmentation in a table or partition.

Index maintenance is a resource-intensive process if you have one or more indexes on a large fact table. Scale-up systems such as the HP Integrity Superdome can be very useful for this process because they can speed up the index maintenance operations, shortening your maintenance window.

Statistics Maintenance

Statistics maintenance is critical for getting the best query plans for SQL Server 2008. For most databases, it is best to leave the default of having Auto Create Statistics and Auto Update Statistics turned on. Even with these on, you still need to periodically update the statistics on your most frequently modified tables. Create jobs that will run periodically, perhaps once a week.

Statistics are still by table, not by partition. SQL Server does not yet have the ability to manage statistics by partition.

Use the sampling rate that will give you the best plans. Most databases will be fine with the default sampling rate. You may want to change the sampling rate if you start getting bad query plans and discover that increasing the sampling rate gives better query plans. In this case, you need to experiment with higher rates until you start getting better plans. Start with FULLSCAN as the sampling rate and use it unless statistics gathering takes too much time. If you do not have time to do FULLSCAN, reduce the sampling rate.

Backup

A good backup strategy is essential for a well-managed data center. Use hardware snapshot backups whenever possible to get near instantaneous backups. If your hardware does not have that feature, use the compressed backups from SQL Server 2008 or any of the third-party vendors that supply compressed backup software. The compression will save you valuable disk space and is faster than the regular backup.

If your database is very large, striped backups to multiple files and tapes is best for increasing speed. The number of files you use depends on the saturation rate of your disk subsystem.

Although a complete discussion of backup and disaster recovery strategies is beyond the scope of this paper, it is best to use Microsoft Data Protection Manager in your overall data center protection plans. See the product details at https://www.microsoft.com/systemcenter/dpm/default.mspx.

For more discussion on these topics, see the following sites:

·         SQL Server Customer Advisory Team Best Practices

·         SQL Server Customer Advisory Team Blog

·         SQL Server Customer Advisory Team Articles

Internal Scale Testing During SQL Server 2008 Development

During the development of the SQL Server 2008 release, Microsoft conducted extensive internal performance testing on a star schema data warehouse. This testing, and the feedback it provided to the product development team, helped us to achieve sizeable performance gains on large HP SMP systems compared with SQL Server 2005. HP worked closely with Microsoft to make these tests possible, providing a 64-core HP Integrity Superdome for the tests.

Workload

Microsoft developed a set of 123 queries to run against the test database. The queries include:

1.    A modified subset of queries from a draft proposal for the TPC-DS benchmark [TPC08], a realistic, modern data warehouse benchmark, modified to run against the schema of the test database,

2.    Modified customer queries, and

3.    A collection of queries to examine specific additional scenarios (e.g. table scans, and GROUPING SETS queries).

The experiments were conducted at two different scale points: 600 GB and 6 terabytes. The data was artificially scaled to these two different sizes. The largest fact table in each configuration contains 2.5 billion and 25 billion rows, respectively.

Hardware Configuration

The primary hardware we used for internal scale testing for SQL Server 2008 development centered around a 64-core HP Integrity Superdome furnished by HP to assist with development of the software release. The Superdome was split into three partitions as follows:

 

Name

Testing purpose

Cores

Memory

Total available storage

ASDune

Relational data warehousing with SQL Server Database Engine

32

128 GB

26‑terabyte RAID 5 (Data/Temp)
7-terabyte RAID 10 (Backup)
2-terabyte RAID 10 (Log)

ASDune1

Analysis Services

16

256 GB

7-terabyte RAID 10

ASDune2

OLTP with SQL Server Database Engine, and ETL with Integration Services

16

64 GB

7-terabyte RAID 5

 

We used direct-attached storage for the tests, although for ease of management, you may prefer to use a SAN in your environment. The I/O throughput on the partition of the system named ASDune has the following characteristics, as measured by using SQLIO with a 64-KB block size:

 

Sequential reads

Random reads

Sequential writes

Random writes

Data (26 LUNs)

4100 MB/sec

2170 MB/sec

1280 MB/sec

647 MB/sec

Backup (13 LUNs)

2675 MB/sec 

1750 MB/sec

1130 MB/sec

800 MB/sec

 

For additional test runs on the 600GB database, we also used the following computer, a four‑processor, quad-core HP DL585:

 

Name

Testing purpose

Cores

Memory

Total available storage

DWSDL585

Relational data warehousing with SQL engine

16

64 GB

2 terabyte,
direct attached

 

The I/O capability of DWSDL585 is as follows:

 

Sequential reads

Random reads

Sequential writes

Random writes

Data (HP SA P800)

320 MB/sec

35 MB/sec

38 MB/sec

31 MB/sec

Tempdb (HP SA P400)

185 MB/sec 

10 MB/sec

11 MB/sec

8 MB/sec

Performance Results

We divided the 123 queries into five groups, from fastest to slowest (Group was 1 the fastest and Group 5 the slowest), based on their runtimes on a baseline test on SQL Server 2005 SP2. Following is a description of performance results. All queries were run “cold start”—the buffer pool was emptied by using DBCC DROPCLEANBUFFERS before each query. The performance difference improvement for SQL Server 2008 is likely to be even greater in the case of “warm start” tests that do not clear the buffer pool, because compression allows a larger percentage of the working set to remain in memory and reduces I/O.

 

Computer name

ASDune

Database size (uncompressed)

6 terabyte

Baseline version

SQL Server 2005 SP2

Compared version

SQL Server 2008 Internal Build, January 2008 (close to RC0 version), with page compression enabled on fact tables.

Geometric mean of query response time improvement over all queries (no change = 100%; higher is better)

254%

Geometric mean of query response time improvement for shortest to longest running groups of queries

Group 1 (shortest)

108%

Group 2

309%

Group 3

469%

Group 4

268%

Group 5 (longest)

174%

Sum of total query runtime, Compared version, as a percentage of Baseline version, excluding queries that did not complete within a timeout period on either version

60%  (i.e. on average SQL Server 2008 query response times were only 60% of those on SQL Server 2005)

 

Computer Name

DWSDL585

Database size (uncompressed)

600GB

Baseline version

SQL Server 2005 SP2

Compared version

SQL Server 2008 Internal Build, January 2008 (close to RC0 version), with page compression enabled on fact tables.

Geometric mean of query response time improvement over all queries (no change = 100%; higher is better)

202%

Geometric mean of query response time improvement for shortest to longest running groups of queries

Group 1 (shortest)

122%

Group 2

192%

Group 3

252%

Group 4

335%

Group 5 (longest)

176%

Sum of total query runtime, Compared version, as a percentage of Baseline version, excluding queries that did not complete within a timeout period on either version

37% (on average SQL Server 2008 query response times were only 37% of those on SQL Server 2005)

 

The geometric mean of the query runtime change for each group was computed as the geometric mean of the ratio of query runtimes, Cc719182.ScaleUpDWinSQL2008Fig05(en-us,SQL.100).png, between the baseline version and the compared version. Specifically, for a query number Cc719182.ScaleUpDWinSQL2008Fig06(en-us,SQL.100).png,

Cc719182.ScaleUpDWinSQL2008Fig07(en-us,SQL.100).png= BaselineRuntime(Cc719182.ScaleUpDWinSQL2008Fig06(en-us,SQL.100).png)/VersionRuntime(Cc719182.ScaleUpDWinSQL2008Fig06(en-us,SQL.100).png)

The geometric mean is a special type of average that tends to dampen the effect of very large and small values [Wiki08b].

What does this mean for you? SQL Server 2008 is about twice as fast as SQL Server 2005 for a demanding I/O-bound data warehouse query workload on the same hardware. The only different feature explicitly used was page compression. No application changes were required. For a system that was I/O bound in an earlier version of SQL Server and is no longer I/O bound with SQL Server 2008 due to the effects of compression, even greater improvement may be observed.

We have described repeated, routine scale tests on 600‑GB and 6‑terabyte data warehouse configurations. By using the scaling guidelines described in this paper, and large-scale SMP systems like the HP Integrity Superdome with a high-performance I/O subsystem and large main memory, SQL Server 2008 can readily scale to handle data warehouse installations of 30 terabytes and beyond.

Conclusion

Scaling up your data warehouse on a large SMP with SQL Server 2008 provides simple system management, high performance, and resilience to hardware failure. New features in SQL Server 2008, including partitioned table parallelism, nested loop join parallelism enhancements, enhanced star joins, and data compression can give impressive speedups on large SMPs by better using the available CPUs, memory, and I/O capacity. Many customers can expect their performance to nearly double on the same hardware, merely by upgrading to SQL Server 2008 and compressing their fact tables. Given the increased number of cores and main memory available at the same price point since SQL Server 2005 was released, those customers setting up a complete new hardware/software system running SQL Server 2008 can expect major performance gains. What could SQL Server 2008 and a state-of-the-art SMP computer like an HP Integrity Superdome mean for you, your data warehouse installation, and your business users?

 

For more information:

·         SQL Server Web site

·         SQL Server TechCenter

·         SQL Server Developer Center

Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to 5 (excellent), how would you rate this paper and why have you given it this rating? For example:

·         Are you rating it high due to having good examples, excellent screenshots, clear writing, or another reason?

·         Are you rating it low due to poor examples, fuzzy screenshots, unclear writing?

This feedback will help us improve the quality of the white papers we release. Send feedback.

 

End Notes

1.    Alternatively, you may wish to use the new DATE built-in data type as the surrogate key type for the date dimension. However, be aware that seeks into the fact table are not normally supported for operations that use date functions. For best performance, use <, >, <=, >=, BETWEEN, or IN ( … list of keys … ) to filter fact table rows by date, regardless of the type of the column used as the surrogate key for date.

2.    Enterprise Edition of 64-bit SQL Server 2008 uses so-called Large Memory Pages (see Large Page Support, MSDN Library), which are not reflected in the memory usage of the SQL Server process. In this case, it may appear that SQL Server is not using memory while the system may be starved for memory. This is similar in concept to using the AWE mechanism on a 32-bit system. Information about memory usage by SQL Server should be obtained from either SQL Server-provided performance counters or DBCC MEMORYSTATUS command.

 

References

[Kim02] Ralph Kimball and Margy Ross, The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition), Wiley Publishing, Inc., April 26, 2002.

 

[Mun06] Joy Mundy, Warren Thornthwaite, and Ralph Kimball, The Microsoft Data Warehouse Toolkit: With SQL Server 2005 and the Microsoft Business Intelligence Toolset, Wiley Publishing, Inc., Feb 13, 2006.

 

[HP05] Tuning SQL Server 2005 on HP Integrity Servers, Hewlett Packard Co., (http://docs.hp.com/en/8875/WIE-SQLTuning-1006-00.pdf), 2005.

 

HP reference configuration for Business Intelligence: HP Integrity Superdome servers with 64 Intel Itanium 2 processors and Microsoft SQL Server 2005, (http://h71028.www7.hp.com/ERC/downloads/4AA0-8979ENW.pdf), 2005.

 

Best practices for Microsoft SQL Server 2005 on HP Integrity Superdome servers for Very Large Database (VLDB) BI solutions, http://h71028.www7.hp.com/ERC/downloads/4AA0-8981ENW.pdf, 2005.

 

[Han07] Eric N. Hanson et al., An Introduction to New Data Warehouse Scalability Features in SQL Server 2008, December 2007.

 

[Wiki08a] Third Normal Form, Wikipedia, http://en.wikipedia.org/wiki/Third_normal_form, 2008.

 

[Mel07] Edward Melomed, Irina Gorbach, Alexander Berger, and Py Bateman, Microsoft SQL Server 2005 Analysis Services (SQL Server Series), Sams Publishing, 2007.

 

[Mosh08] Mosha Pasumansky, Microsoft OLAP Blog, http://my.bloglines.com/preview?siteid=689772, 2008.

 

Mosha Pasumansky, Microsoft OLAP Resource List, http://www.mosha.com/msolap/util.htm, 2008.

 

[Han05] Eric N. Hanson, Improving Performance with SQL Server 2005 Indexed Views, May, 2005.

 

[Adam06] Christopher Adamson, Mastering Data Warehouse Aggregates: Solutions for Star Schema Performance, Wiley Publishing, Inc., 2006.

 

[Agar08] Sunil Agarwal, Data Compression in SQL Server 2008, white paper in progress, 2008.

 

[CAT08] SQL Server Customer Advisory Team Best Practices, 2008.

     SQL Server Customer Advisory Team blogs.

 

[BOL08a] Deploying a Scalable Shared Database, SQL Server 2008 Books Online.

 

[BOL08b] Setting Server Configuration Options, SQL Server 2008 Books Online.

 

[BOL08c] Resource Governor Concepts, SQL Server 2008 Books Online.

 

[BOL08d] Parallel Query Processing, SQL Server 2008 Books Online, 2008.

 

[RGWP08] SQL Server 2008 Resource Governor Best Practices, white paper in progress, 2008.

 

[WSRM03] Windows System Resource Manager, 2003.

 

[HanKol05] Eric N. Hanson and Lubor Kollar, Statistics Used by the Query Optimizer in SQL Server 2005, 2005.

 

[How03] Michael Howard and David LeBlanc, Writing Secure Code, 2nd edition, Microsoft Press, 2003, pg. 399.

 

[Samp08] Microsoft SQL Server Product Samples: SQL Server Sample Databases, https://www.codeplex.com/MSFTDBProdSamples, 2008.

 

[Wyatt05] Len Wyatt, Project REAL: Technical Overview, 2005.

 

[TPC08] TPC Benchmark™ DS (Decision Support), Draft Specification, Revision 32, Transaction Processing Council (TPC), http://www.tpc.org/tpcds/default.asp, 2008.

 

[Wiki08b] Geometric mean, http://en.wikipedia.org/wiki/Geometric_mean.