Initial or One-Time Data Loading (DW)---a Technical Reference Guide for Designing Mission-Critical DW Solutions

Want more guides like this one? Go to Technical Reference Guides for Designing Mission-Critical Solutions.

Initial data loads, or large one-time loads, are extremely important, because it is with these loads that the database design (from an ETL/ELT perspective) is often verified. It is also with these loads that many issues with the entire data lineage stream appear. Initial data loads might not be characteristic of the "regular" data loads. This would apply when initial data loads need to be set up as special jobs to load historical data from archived data sources that are not in the same format and that do not contain the same business logic as regular data sources. In these scenarios, fact table data requires different ETL logic for historical data than incremental. Initial loads can run for long periods of time, and might require additional attention to prevent tempdb and log files from uncontrolled growth, and customized logic to prevent duplicates and other referential integrity issues.

Best Practices

The following resources provide additional information. (Note that the full URLs for the hyperlinked text are provided in the Appendix at the end of this document.)

Case Studies and References

An example is described in the article, Load 1TB in Less than 1 Hour.4

Questions and Considerations

This section provides questions and issues to consider when working with your customers.

  • Is it possible to handle historical loads with SSIS packages used for regular (incremental) loads? This is the most ideal scenario for handling historical loads because it doesn’t require additional SSIS development and management. This can often be accomplished by adding conditional logic in SSIS packages to include Foreach Loops (see the article, Foreach Loop Container5) to handle multiple source files (for example, historical files) based on some distinguishing attribute (such as client_id, and order_date).

  • Plan the load sequence carefully as certain tables may need to be loaded first to help verify loads of other tables. Typically master data/dimensions need to be loaded prior to loading facts. Identify strategy for handling late arriving facts, and inferred members.

  • It is usually better to perform the initial loads incrementally (for example, load a month at a time, or a certain filesize, and so on).

  • Establish a detailed load strategy with expected time-frames:

    • How much historical data will be loaded?

    • How many historical data sources are there and is the data consistent?

    • Do historical dimension attributes differ?

    • Will RDBMS or flat files (or both) be used?

    • What is the data extract strategy from source systems?

    • How will loads be verified?

    • What are the clean-up steps?

    • How will fragmentation be minimized?

    • Should only the last instance of dimensional change be included in dimensions or all historical changes available in historical files?

Appendix

Following are the full URLs for the hyperlinked text:

1 Using BULK INSERT to Load a Text File http://www.sqlteam.com/article/using-bulk-insert-to-load-a-text-file

2 SQL Server Best Practices Article http://technet.microsoft.com/en-us/library/cc966380.aspx

3 Bulk Loading Data into a Table with Concurrent Queries http://sqlcat.com/technicalnotes/archive/2009/04/06/bulk-loading-data-into-a-table-with-concurrent-queries.aspx

4 Load 1TB in Less than 1 Hour http://blogs.msdn.com/b/sqlcat/archive/2006/05/19/602142.aspx

5 Foreach Loop Container https://msdn.microsoft.com/en-us/library/ms141724.aspx