Real-Time Data Loading (DW)---a Technical Reference Guide for Designing Mission-Critical DW Solutions

Want more guides like this one? Go to Technical Reference Guides for Designing Mission-Critical Solutions.

Data warehouse industry pundits have long argued that "real-time data loading cannot happen in data warehousing." With ever-growing business pressure from TELCO’s, financial institutions, operations management, and so on, we have to provide access to real-time data in the data warehouse. Depending on the hardware platform, the strategy used to capture/load the data can be radically different than the "conventional" method (e.g. transformations, dimensional lookups to load facts, and so on). There are many products on the market that offer a complex (and expensive) technology "stack" to capture data and load in real-time.

Best Practices

The following resources provide reference material and additional information.

  • The "near real time" latency is a very important factor. You can generally get to around one minute with Integration Services—after that you have to consider other load mechanisms (.NET applications and/or StreamInsight).

  • Real time need NOT be a push model. A "busy loading" model may be used instead. It can bring you down into the minute range if you just loop and check for incoming data.

  • Must use technologies such as RCSI for SQL Server and ROLAP for analysis services. Design of database and cube becomes less tolerant of errors when going real time resulting in increased risk of failure. One must cherry-pick best implementers and partners as there is no room for "learning on the job" when building the team.

  • Real-time data warehouse loading methodology 1

  • Considerations for Building a Real-time Data Warehouse 2

  • SQL Server 2008 R2 – StreamInsight 3

  • Creating a Real Time Data Warehouse 4

Case Studies and References

Examples of successful architectures are described in the following case studies and white papers:

Questions and Considerations

This section provides questions and issues to consider when working with your customers.

  • Consider positioning of StreamInsight as the primary platform to capture/load data realtime.

  • Identify and understand requirements for real-time data loads versus near real-time: Near real-time.7

  • If near real-time is acceptable, consider Change Data Capture (detailed in Incremental Data Loads).

  • Based on accepted latency, consider loading the Output of a Package into Another Program. This is a SSIS feature where a client application can also read the output of a package directly from memory, without the need for an intermediate step that persists the data. See Loading the Output of a Package into Another Program.8

  • Understand the differences between a "push" and "pull" model. Real-time data is normally a "push" model.

  • Understand customer’s position on Read-Committed versus Read uncommitted ("dirty reads").

  • Data capture/loading is often "event-driven." While there may be a steady stream of real-time date (e.g. web logs, trading feeds, operations monitoring) only certain events or combination of events (Complex Event Processing or "CEP") need be captured.

  • Determine how to handle "catch-up" files or late-arriving data. This is sometimes referred to "out-of-sequence" processing and usually requires a table design (e.g. seq_id) to help you manage the sequence of data.

  • Understand the impact on real-time data loads:

    • Impact on query performance

    • Up-stream system requirements (data latency, transformation, and so on)

    • Can data be loaded directly from sources (e.g. CDR data) or must it be heavily transformed (e.g. Insurance/Financial data)?

    • Down-stream system requirements (e.g. for SSAS or SSRS cached reports, aggregations, enrichment, and so on)

  • Understand Microsoft Integration Technologies for establishing most appropriate strategy for real-time data integrations (see Understanding Microsoft Integration Technologies9).

  • Consider integration with Biz Talk for Real-Time Data Loads. See the Microsoft BizTalk Server site10 and Working With BizTalk Adapter for SQL Server.11

Appendix

Following are the full URLs for the hyperlinked text:

1 Real-time data warehouse loading methodology http://portal.acm.org/citation.cfm?id=1451949

2 Considerations for Building a Real-time Data Warehouse http://www.grcdi.nl/considerations.pdf

3 SQL Server 2008 R2 – StreamInsight https://www.microsoft.com/sqlserver/2008/en/us/r2-complex-event.aspx

4 Creating a Real Time Data Warehouse http://www.andrewscg.com/pdfs/Creating_RealTime_DW.pdf

5 Speeding Up SSIS Bulk Inserts into SQL Server http://henkvandervalk.com/speeding-up-ssis-bulk-inserts-into-sql-server

6 StreamInsight Case Study https://blogs.msdn.com/b/streaminsight/archive/2010/10/11/streaminsight-case-study.aspx

7 Near real-time http://en.wikipedia.org/wiki/Near_real-time

8 Loading the Output of a Package into Another Program https://msdn.microsoft.com/en-us/library/ms135917.aspx

9 Understanding Microsoft Integration Technologies https://technet.microsoft.com/en-us/library/dd879265(BTS.10).aspx

10 Microsoft BizTalk Server site https://www.microsoft.com/biztalk/en/us/default.aspx

11 Working With BizTalk Adapter for SQL Server https://msdn.microsoft.com/en-us/library/ms935658(BTS.10).aspx