Incremental Data Loads (DW)---a Technical Reference Guide for Designing Mission-Critical DW Solutions
Want more guides like this one? Go to Technical Reference Guides for Designing Mission-Critical Solutions.
Incremental data loads are typically the everyday "regular" loads into the data warehouse. These "jobs" are scheduled on recurring basis. The load frequency can vary between real-time to daily, weekly, monthly, and so on. Often times, incremental load strategies have matured slowly over time where the early load jobs were relatively unplanned/unsophisticated ("just get the data in") to more mature environments where loads are carefully planned (error-recovery, minimize data fragmentation, data validation, and so on).
The following resources provide reference material and additional information.
When loading incremental data into a populated, clustered, partitioned table, you will incur full logging if you do not use Trace Flag 610. If the order if the source data is not the same as the clustered or partition key, you will incur full logging due to page split even if you use TF610.
The Data Loading Performance Guide
SQL Server Best Practices Article
Bulk Loading into a Table with Concurrent Queries
Case Studies and References
Questions and Considerations
This section provides questions and issues to consider when working with your customers.
Plan the Load Sequence carefully as certain tables may need to be loaded first to help verify loads of other tables. This is typically the case with loading dimensions first to generate surrogate keys, followed by fact loads where dimension surrogate keys are looked up and inserted in facts. Consider strategy for addressing dimension inferred members which will be applied to handle generating dimension members during fact load where dimension lookups yield no match.
Consider using incremental loading strategy for historical loads, when format and grain of historical files allow this, rather than loading all historical data at once. This will also help with reducing the number of SSIS packages by combining historical and incremental load routines in a single package. Additionally, partitioning of historical loads should have a positive effect on preventing tempdb and log file sizes from abnormal expansion.
Consider incorporating Lookup and Conditional Split SSIS components to identify and manage changes in data being loaded to implement appropriate increment loading strategy (see SSIS Design Pattern – Incremental Loads5).
Take advantage of Change Data Capture in SQL 2008 Enterprise for incremental data loads. In SQL Server, change data capture offers an effective solution to the challenge of efficiently performing incremental loads from source tables to data marts and data warehouses. See Improving Incremental Loads with Change Data Capture6 and How To Process Change Data Capture (CDC) in SQL Server Integration Services SSIS 2008.7
Consider CDC solutions for other platform, such as Oracle CDC for SSIS (see Attunity Oracle-CDC for SSIS8).
If CHECKSUM is considered as an option to assist in determining changes for incremental loads, make sure that columns used for computations do provide for true uniqueness of records. See SSIS – Using a checksum to determine if a row has changed.9
When using SSIS, avoid writing to intermediate work tables as much as possible. Read the data once and write the data once.
Following are the full URLs for the hyperlinked text.
1 The Data Loading Performance Guide http://msdn.microsoft.com/en-us/library/dd425070.aspx
2 SQL Server Best Practices Article http://technet.microsoft.com/en-us/library/cc966380.aspx
3 Bulk Loading into a Table with Concurrent Queries http://sqlcat.com/te
4 Project REAL: Business Intelligence ETL Design Practices http:/
5 SSIS Design Pattern – Incremental Loads http://sqlblog.com/blogs/
6 Improving Incremental Loads with Change Data Capture http://msdn.microsoft.com/en-us/library/bb895315.aspx
7 How To Process Change Data Capture (CDC) in SQL Server Integration Services SSIS 2008 http://www.mssqltips.com/tip.asp?tip=1755
8 Attunity Oracle-CDC for SSIS http://www.attunity.com/oracle_cdc_for_ssis
9 SSIS – Using a checksum to determine if a row has changed http://www.ssistalk.com/2007/03/09/ssis-using-a-checksum-to-determine-if-a-row-has-changed/