Enterprise Data Warehouse: A Patterns Approach to Data Integration
Published: June 2010
Microsoft Information Technology (MSIT) uses a set of solution patterns which provide a balance between the need to collect and integrate data for its Enterprise Data Warehouse (EDW) and the need to have immediate, tactical solutions. This approach creates a powerful framework for a single commonly supported infrastructure with high data quality and flexibility.
Article, 237 KB, Microsoft Word file
IT Professionals and Business Decision Makers
When large organizations implement an EDW, major challenges include deciding what data to integrate and how to implement the data integration. MSIT found that an evolutionary approach was effective for creating an EDW. This approach helped MSIT identify solution patterns that they can apply when migrating Business Intelligence (BI) data.
Determining the Right Approach
Data integration involves searching through data residing in different sources and providing users with a unified view of that data. The process includes:
- Defining common entities and attributes around subject areas of importance to the business
- Sourcing, mapping, and loading that data into common structures
Typically, an enterprise is comprised of several Lines of Business (LOB), each with overlapping and distinct information needs. Although the various LOBs may agree that data integration is a requirement to achieve a critical overall view of the enterprise, different businesses may have different perspectives on what integration means. For example, each business may have a different definition of "customer" which works well within the context of that business, but is an obstacle to conformity within the EDW. LOBs may not have the resources or motivation to work with IT and other businesses to analyze and agree on common definitions and the integration of data and business rules.
The following commonly used organizational approaches can jeopardize the success of an EDW initiative:
- Top-down. Forcing business groups down the integration path in a top-down approach usually does not work because too much time is spent defining and mapping data for the entire organization. In addition to being resource intensive, top-down efforts usually fail because of an inability to deliver the entire solution in a timely manner.
- Subject area. A subject area approach leaves enormous data gaps in the EDW and leads to low adoption and "shadow" solution development. This occurs because of competing business priorities on subject area implementation.
Likewise, locking a solution into a particular architecture, whether it is a centralized warehouse architecture or a federated data mart architecture, limits flexibility by pushing integration when businesses might not be ready. Locking a solution into an architecture may also require development of redundant processes to build data marts from the same sources. What may be needed instead is an evolutionary approach to BI and the EDW which adapts to the BI maturity levels of the enterprise.
Figure 1 shows the three steps in the BI evolutionary process.
Figure 1. BI evolutionary process
Step 1: Consolidate Data Sources
The first step in this process is critical. It involves acquiring data once in a single store, and then distributing the data in a mart fashion—one mart per LOB or solution. This is a hybrid approach to building the EDW which utilizes both a centralized store and federated marts.
The goal is to collocate the data and processes into a single, consistent infrastructure. This infrastructure consists of:
Shared hardware and storage
A single set of tools and processes for Extract, Transform, and Load (ETL)
A cohesive deployment environment
Although the data may not be integrated, it allows each LOB access to a far greater range of data in a more timely fashion. The problem focus shifts from how to get the needed data, to what the data means, why there are so many versions of the data, and how to link the data together in new and insightful ways.
The following are characteristics of this stage of the evolutionary process:
LOB project dependencies are limited to data source acquisition priorities.
Due to the variability of business rules, there is little alignment of data.
The business case is platform optimization.
There is little need for data governance.
Step 2: Start to Apply Standard Rules
In the second step of the process, integration and governance discussions occur, resulting in the agreement and development of common, integrated data and business rules. This begins the migration from disparate LOB rules embedded in each data mart solution, to common rules for loading the central data store, which reduces data and rule redundancy and conflict.
At this stage, there is variability of measurements, but there is also increasing alignment of integration rules. There is a platform for "truth" from the point of view of data governance.
Step 3: Align Insight and Action by Conforming the Data Marts
In the third step of the process, the data marts are conformed, either in a federated or a consolidated fashion. Since data disparity and semantics are no longer reasons to isolate the data marts, physical segmentation of the data is used for other compelling reasons, such as performance, latency, security, privacy, or unique feature sets dependent on structure and format. Effort is made to drive alignment of performance measurements, scorecards, analytics, and decision support. The executive scorecard records these statistics.
Utilizing the phases of the BI evolutionary process as a backdrop, MSIT defined a set of three solution patterns which provide a balance between the need to collect and integrate data for the EDW and the need for the LOBs to have immediate, tactical solutions. A pattern offers well-established solutions to software engineering problems by capturing essential elements of an architecture and depicting those elements in a way that enables categorization of the components. MSIT uses the three patterns, which align to the phases of the BI evolutionary process (with Pattern 3 being the first phase), as a framework to classify how and where a solution fits within the architecture based on the BI maturity level of the business. These patterns are predicated on an overall EDW architecture which utilizes a common infrastructure, an integrated central data store, and a business-driven distribution (data mart) model.
Pattern 3—Point Solutions. These are stand-alone, point solutions, either inherited as legacies or developed for business areas in which there may be limited or no strategic value for integrating within the EDW. They are characterized as uncommon data models and data sources supported by IT, and may utilize EDW infrastructure such as hardware, storage, and process (ETL engine). Pattern 3 provides the bridge between a legitimate independent data warehouse and the EDW.
Pattern 2—On Infrastructure. These solutions benefit from fully utilizing the EDW infrastructure, common data sources, and the integrated data within the central data store. A key assertion for Pattern 2 solutions is that they may utilize data directly from the staging area (acquisition) or the central data store. This is an intermediate benefit in that it allows for the quick acquisition and use of data without waiting for the eventual integration or collocation of that data in the central data store. In addition, it provides the ability to define and implement LOB-specific data models. Although such collocated data may be redundant and in conflict with the enterprise-level model, it provides enormous advantages in flexibility to the business and in acquiring and reusing data that other solutions and businesses may find useful. The sourcing and common storage of data from across the enterprise is a critical first step in ensuring that LOBs get the data that they need, thus providing tactical advantage while providing better understanding of existing data assets, including associated problems with sources, definition, gaps, and data management quality and control.
Pattern 1—On Strategy. These are fully integrated solutions which utilize a common, subject-oriented data model and conformed domains and dimensions. The data is tightly governed and adheres to the data management policies of the enterprise.
Figure 2 describes the three patterns and shows how they work together.
Figure 2. Solution patterns
The boundaries between the patterns are not discrete. By design, a solution can have components that fit in one, two, or all three patterns. This creates a powerful framework that:
Ensures that the data is sourced (acquired) once
Ensures that the data is available from a common store, either integrated or collocated
Provides insight into problems with data quality, redundancy, and gaps
Facilitates analysis and profiling of data for later integration
Enables integration evolution over time when the business is ready
Provides a single, commonly supported infrastructure
This architecture is not an either or proposition on the two classic schools of thought for warehousing (centralized or federated); it uses the best of both. The architecture provides the flexibility for integration when it is needed or warranted but it also accommodates LOB requirements in a timely manner by reducing the cross-dependencies associated with reconciling business needs and requirements into a single application. The architecture provides the added benefit of supporting self-service BI by allowing the publishing of centralized data for use in self-service BI applications, such as a sandbox environment.
MSIT utilized an evolutionary approach to create an EDW and identified three patterns that create a framework for migrating to an integrated data environment for BI. Given the challenges associated with traditional data warehouse implementations, patterns provide a measurable and systematic approach to migrating BI solutions to an enterprise-class warehouse while at the same time meeting the needs for immediate tactical solutions. These advantages could be the difference between successfully implementing an EDW over time and having a project failure statistic.
For More Information
For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Order Centre at (800) 933-4750. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information via the World Wide Web, go to:
© 2010 Microsoft Corporation. All rights reserved.
This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft, Windows, and Windows Server are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.