Data Sources (DW)---a Technical Reference Guide for Designing Mission-Critical DW Solutions

Want more guides like this one? Go to Technical Reference Guides for Designing Mission-Critical Solutions.

Data warehouses do not, generally speaking, create their own data. Instead they are always "fed" data from other sources—in many cases, from a myriad of sources. In a perfect world, data could be loaded directly from the sources into the DW, but that is seldom (if ever!) the case. Over 80% of a DW project is dedicated to the Extract, Transformation and Load (or more recently Extract, Load, then Transform), so it is imperative that data sources be carefully understood.

Given the three data loading scenarios (initial, incremental, real-time), data source characteristics can present numerous challenges to the ETL/ELT process.

Best Practices

Following are some best practices:

  • Manage the data source identification process:

    • Identify Subject Matter Experts (SMEs).

    • Identify dimension data sources.

    • Identify fact data sources.

  • When the major data sources have been identified it is time to quickly gain detailed understanding of each one:

    • Obtain existing documentation.

    • Model and define the input.

    • Profile the input.

    • Improve data quality.

    • Save results for further reuse.

Case Studies and References

Data Warehouse Architecture 1 is a nice summary of the various data layers that may comprise a well-built data warehouse.

Questions and Considerations

This section provides questions and issues to consider when working with your customers.

  • What is the quality/cleanliness of the raw data?

  • Are there any data profiling requirements?

  • What are the data sources (flat file, XML, streaming, relational) and how reliable are they in terms of connectivity?

  • Is there any special security requirements needed to access the data source(s)?

  • Are there any "political" considerations (e.g. blockers) for getting access to the raw data sources?

  • What are the metadata requirements (e.g. data lineage, transformation mapping, and so on)?

  • Are there any Master Data considerations?

  • What is the timeliness (aka "latency") of the data (instantaneous, every hour, daily, weekly, and so on)? How does the timeliness of the data compare to when the business needs the information?

  • What is the complexity of data transformation? Is it a simple matter of surrogate key lookups or is there a significant amount of data cleansing and enriching that must occur?

  • What is the proximity of the data feeds to the final data warehouse destination? Consider proximity of initial data versus ongoing incrementally loaded data.

  • What is the disposition of the data once it has been loaded to the data warehouse? Is it simply discarded or must it be stored/archived in load-ready format?


Following are the full URLs for the hyperlinked text.

1 Data Warehouse Architecture