Hub-And-Spoke: Building an EDW with SQL Server and Strategies of Implementation
SQL Server Technical Article
Writers: Mark Theissen, Eric Kraemer
Published: February 2009
Applies to: SQL Server 2008
Summary: This document describes common data warehouse implementation strategies and proposes an effective hub-and-spoke architecture using a massively parallel processing system with multiple instances of SQL Server databases.
To date, the implementation of a true hub-and-spoke architecture for a data warehouse environment has been an idealized and elusive goal. Although building a centralized “hub,” or enterprise data warehouse (EDW) that supports company-wide detail data is achievable, building and maintaining “spokes,” or dependent departmental data marts has proved to be the challenge. Most data warehouse environments have evolved to one of two architectures: a centralized EDW or a series of distributed and/or federated data marts.
In response to a few crucial problems with federated approaches to data warehousing, vendors and pundits alike began to promote a centralized, monolithic EDW approach. This approach has historically helped some vendors differentiate themselves from competitors and justify unreasonably high Total Cost of Ownership (TCO) for their products.
Centralization is seldom scalable or easy to manage. Conflicting end-user and management-use cases, demanding performance requirements, increasing user sophistication, and on-going demands for scalability and flexibility all work to defeat even the most efficiently managed monolithic system. In many cases, these issues cause departments and business units to create their own independent data marts.
In August 2008, Microsoft Corporation acquired data warehouse appliance vendor DATAllegro. This strategic acquisition will enable Microsoft to offer a complete data warehousing solution that addresses the issues of centralized EDWs and distributed independent data marts. Code named Project “Madison”, the first release of a Microsoft appliance-style solution will be available in 2010.
The Madison hub-and-spoke solution, with the power of massively parallel processing (MPP) and SQL Server 2008 at its core, will offer a scalable, low-cost alternative to existing centralized warehouses. This solution combines the ease of data management from a unified enterprise data warehouse (hub) with the flexibility of matching hardware to specific use-case requirements for federated data marts (spokes). In Madison, spokes can be in either MPP or SMP (symmetric multi-processing) configurations. In addition, this architecture maintains the security of centralized data management by leveraging rapid distribution of hub data through the Microsoft data warehouse.
Traditional approaches to data warehousing include:
- A centralized EDW, or “monolithic” approach
- Federation of independent, distributed data marts
- Hub-and-spoke architectures
Each approach has its appeal and advocates, and each one has serious stumbling blocks in practice. The following section focuses on the disadvantages of these approaches.
The Centralized EDW
The centralized EDW arose from the pursuit of a “single version of the truth” and the recognition of the issues with distributed independent data marts. The centralized EDW approach uses a mainframe model with all data and applications residing on a monolithic platform. Although there are some advantages to this approach, there are also some significant disadvantages:
- Complex workloads: Efficiently managing the workloads of ad hoc queries, enterprise reporting, application queries, near real-time loads, batch windows, high concurrency loads, and large aggregations is a difficult challenge.
- Personnel requirements: A centralized approach implies increasing complexity as use cases are added. The combination of scaling workloads and incremental use cases drives the overall system complexity. Complexity demands increasing resources and sophistication to manage effectively. This self-reinforcing cycle generates increasingly non-linear management costs.
- Slow-to-adapt technology: Some parallel solutions rely on older, proprietary architectures that don’t take advantage of the rapid performance advances in commodity server hardware. By themselves, common SMP solutions don’t take full advantage of performance increases (Moore’s Law) due to the inherent bottlenecks of shared resource architectures.
- Expensive maintenance: Factors such as expensive, proprietary “platform lock-in” and limited scalability of SMP based data warehouses make maintenance costs increasingly difficult to manage under a monolithic architecture.
- Expensive upgrades: Upgrades are particularly challenging with a monolithic architecture. Single-system resources mean extended outages and are difficult to schedule and manage. Monolithic architectures scale poorly as capacity stretches into the tens or even hundreds of terabytes, which costs exponentially more, for increasingly less return.
- High total cost of ownership: Poor scalability, expensive maintenance, and limited flexibility create a massive challenge to managing TCO. Proprietary lock-in and limited hardware extensibility limit the ability to take advantage of advances in performance, power, and space management.
- Limited environmental agility: Layering often-causes conflicting use cases on a single system and a single monolithic architecture demands increasingly rigid processes and procedures.
Although a single environment for data warehousing sounds attractive, long-term success is difficult. Businesses often struggle to justify the cost of this approach, hoping for success with a centralized approach, but watching the benefits take a long time to materialize.
Distributed Independent Data Marts
The opposite approach to a monolithic design is the practice of distributing data across an enterprise within departmental data marts. Distributed data marts are created independently from all other data marts and are derived from non-centralized source data processes. Each organization within an enterprise has a unique version of the truth. Such data marts are often a result of individual departments getting frustrated with a centralized EDW, because of the issues listed in the preceding section. The end result is disparate data marts across an organization with highly specialized and unique views of corporate information.
The advantage in this multiple data mart approach is that the data marts are highly relevant to the departments using them and the environments are appropriately sized based on need. Departmental ownership is clear.
The disadvantages of this approach include:
- Data silos: Each department has its own version of a data mart that does not align to those in other departments or business units. Getting cooperation between departments can be difficult and implementing a cross-organizational view of the data silos requires significant time and expense.
- Data duplication: When large volumes of data are duplicated, control and enforcement of standards can become a significant challenge.
- Impact to operational systems: Multiple and duplicate ETL processes compete for small batch windows against production systems.
- Proliferation of technologies: Multiple hardware and software technologies are deployed across departments, eliminating economies of scale at the enterprise level.
- Data management costs: The cost of data management rises unnoticed because the costs are spread across multiple departmental budgets.
Hub-and-spoke architectures, sometimes referred to as “federated” EDWs, match the business structure of most large enterprises by offering a centralized EDW and a set of dependent data marts. The EDW hub allows the enterprise as a whole to set and enforce common standards while enabling analysis and reports that cut across business units. The data mart spokes allow business units to meet their own needs quickly at relatively low cost and also conform to the IT requirements of the overall enterprise. A well-managed hub-and-spoke architecture allows business units to set their own budgets and priorities, while contributing as necessary to the central EDW.
This close fit between the architecture of the business and the architecture of the data warehouse means that hub-and-spoke systems are widely regarded as the best overall approach, in theory.
However, hub-and-spoke systems have been notoriously difficult to implement in the past. Although it’s possible to build a centralized EDW that can handle the enterprise-wide detail data and create the necessary data mart structures, traditional technology has been unable to distribute the required data quickly enough to meet the needs of the business units.
As a result, most efforts to build a hub-and-spoke architecture degenerate into a set of distributed, siloed data marts after being segmented by conflicting business units and requirements. Centralized EDWs often fail because of a lack of buy-in and interest from the business units.
Overcoming the Limitations of Traditional Approaches
Understanding the challenges of implementing hub-and-spoke architectures, DATAllegro developed a parallel data movement technology for data warehousing. Through the acquisition of DATAllegro, Microsoft will provide a comprehensive solution that includes the hardware reference architectures, software, and services to support a true hub-and-spoke architecture. Microsoft’s grid DW technology enables companies to build a scalable, affordable hub-and-spoke environment. See Using SQL Server To Build A Hub-and-Spoke Enterprise Data Warehouse Architecture for an overview discussion of grid-enabled EDW.
Benefits of Microsoft’s technology include:
- The interconnection of multiple Microsoft data warehouse units via a dedicated high-speed network. This enables parallel high-speed data movement at the node level between units, making possible data transfer rates approaching 500GB per minute.
- Simplification of data mart ETL/ELT processes via a data publishing model. Data mart tables can be generated on a hub without impacting data mart users and then be published to one or many spokes. The use of a publishing model for creating data marts not only simplifies ETL/ELT processes, it also reduces the personnel costs associated with maintaining and managing these processes.
- Separation of both management and user workloads. This eliminates traditional conflicts between processes required for IT data management and end-user reporting and analysis.
- The ability to size a database solution (MPP or SMP) to the specific needs of a group of users or processing requirement (i.e., archive, performance, capacity). Microsoft grid-enabled database hub-and-spoke solutions support both MPP and traditional SMP SQL Server 2008 solutions within the grid, enabling true cost and performance scalability.
- The ability to independently expand any database in the EDW based on requirements. This enables ownership and control of spokes to be within the department or business unit.
- The ability to add additional spokes to the EDW at any time without impacting other users.
- Deployment of development and test environments that leverage parallel connectivity for easy production system replication. Time to deploy test data in support of agile prototyping, audit, or data management is not hindered by slow and costly export and load operations.
- Centralized metadata management and provisioning. Madison will include basic grid management and provisioning facilities that are sufficient to support standalone data warehouses. In the future, these capabilities will be broadened to enable administrators to manage a grid of data warehouse units.
Figure 1: Users see department- or subject-area views of enterprise data. IT manages a single version of the enterprise data model on the hub system. High-speed parallel replication enables simple, easy-to-manage data transformations to spoke systems.
Using SQL Server to Replace a Monolithic or Distributed Solution
Many enterprise data warehouse owners find themselves to be stewards of solutions at one or the other extreme end of the spectrum. Whether their warehouses are rigidly monolithic or chaotically distributed, they can be locked into heavy investments and spiraling infrastructure costs. The need for change is triggered by degradation in performance, lack of storage space, changes in business practices, increasingly stringent governance, increases in users, or any combination of these scenarios, all of which underscore the importance of a scalable, agile environment.
Grid-enabling hub-and-spoke technology from Microsoft allows you to choose a flexible, scalable environment that aligns with both management and delivery needs. Microsoft architectural features make this a fast and low risk process.
Companies who improve their EDW environment commonly take one of three approaches:
- Forklift Migration
- Divide and Conquer
- Green Fields
A “forklift migration” is the complete re-hosting of an existing EDW or data mart to Microsoft Madison. The key advantage to a forklift database migration is that an existing data model can be ported to the Microsoft EDW without change. This can be a very compelling advantage, considering the cost of running an EDW re-hosting project.
Common challenges faced in this approach include:
- Data model requirements: Traditional alternatives for very large enterprise data EDW installations are limited and focus on specialized solutions that require specific data models. It is rare for an existing data model to match the idealized model required to achieve optimal performance.
- Mirrored inefficiencies: Fork lifting does not allow the necessary data model changes that take full advantage of a hub-and-spoke architecture because it does not allow correction of architectural flaws or inefficiencies in the originating environment.
- “All or nothing:” A forklift approach can achieve significant time-to-delivery advantage in part because complex decoupling of dependant data sources across subject areas can be completely avoided. Instead of investing time to safely move single-use cases or business subject areas, the entire data model is moved at once. This limits opportunities to demonstrate early successes with smaller deliverables. The project becomes vulnerable to bottlenecks because all process dependencies become potential showstoppers.
Although these challenges are daunting, the Microsoft EDW solution delivers several mitigating advantages:
- Flexible, low-cost prototyping: Samples of complete schema can be copied easily between the central hub repository and the development spokes. This allows project teams to focus testing on the dependencies in dedicated environments, scaled to suit their needs. Potential defects can be identified early in the development process. Benchmarking on dedicated test spokes allows measured adoption performance enhancements and features unique to the new system.
- Bandwidth that makes replication feasible: A forklift migration can make the incorporation of high-value or problematic use cases difficult. Because decoupling of process dependencies rarely occurs during the initial stages, the only alternative is to move significant portions of the existing data model to a spoke. Parallel copying, which is possible in Madison, makes this high volume bulk transfer of both schema and data achievable.
- Data-model agnosticism: Microsoft promotes a best-fit approach to data model and information delivery. No single data model is required to unlock the performance and scalability of the Microsoft EDW solution, providing an environment that is flexible for both system and data architecture.
Divide and Conquer
To divide and conquer by focusing on key subject areas or use cases is a proven method for large scale data warehouse migration. This approach fits well with iterative project and development methodologies, which are also proven to be highly effective for this type of project. An important advantage to this approach is the ability to demonstrate iterative successes to business constituencies.
Common challenges faced in a divide-and-conquer approach include:
- System-project alignment: The hardware platform is often overlooked as an integral part of the overall development methodology. The divide and conquer approach suffers when prototyping must occur in a limited or shared testing environment, because it is difficult to simulate real-world scenarios. A key advantage to this method is the ability to deploy multiple development teams in parallel efforts. Single-resource environments act as bottlenecks in this type of migration.
- Early deployment: Realizing the greatest advantage of a divide-and-conquer strategy means the early deployment of a tightly scoped use case. Once that deployment occurs, development teams face the challenge of working in a production environment. This situation gets worse in a monolithic or limited-resource environment. Migrations often slow down considerably after the first expectation-setting deployment success.
The Microsoft hub-and-spoke solution aligns well with a divide-and-conquer migration strategy. Some key advantages are:
- Separation of development and production: New subject areas can be included in a single environment while production processes run on a separate hub.
- Insulation of users: Spokes can be created at the business unit or subject area level, allowing existing users to be unaffected by the roll-out of new subject areas or business units.
- High-speed data movement: Ease data moves and processes between development, test, and production environments. High-speed data movement also enables spokes to serve as back-ups to each other in the event of a system outage.
- Staged purchases: Additional spokes can be added in synch with development efforts. There is no requirement to purchase all hardware and software at the start of a divide-and-conquer implementation.
- Centralized data model: The hub maintains a centralized data model for all subject areas.
Starting fresh with a new hardware stack, new architecture, new database essentially “green fields” is the riskiest approach to data warehouse migration but the benefits can be tremendous in the right situation. This approach counters all of the disadvantages of a forklift migration. In the case of an environment with deeply entrenched practices and architecture, a green field approach can save time over remodeling and reconfiguring, resolve major architectural flaws of an existing environment, and deliver better long term performance. Migrating to an entirely new architecture can be an effective way to manage the organizational change inherent in any data migration.
Some of the risks for green field migration include:
- Overwhelming scope: Scope must be broken down to iterative projects that can be delivered regularly. Management of scope must be rigorous in order to keep scope creep to a minimum.
- Funding: Large initiatives are difficult to fund and to keep funded over the length of time required by starting fresh. Organizations must have the discipline and financial fortitude to commit to such an effort.
- First deliverables take time: The first iteration of a green field implementation is the longest iteration because a baseline data model must be developed that meets the requirements of multiple user constituencies. Many architectural decisions are made at this time.
Like the previous strategies, the Microsoft hub-and-spoke solution works to mitigate these risks in the following ways:
- Right-size funding: The Microsoft EDW solution provides the flexibility to right-size one or more systems, tailored to a specific EDW use case. This allows capacity to be tailored to the early phases of a green field project and delivers easy scalability throughout the project timeline. This match between smaller scale early deliverables and funding minimizes the large up-front costs often associated with green field migrations.
- Agile prototyping: High speed copy between database systems and scalable deployment from MPP to SMP platforms support agile development efforts. Although complex and extensive development efforts may be required to achieve first-production deliverables, the Microsoft EDW solution provides an environment that makes early prototyping or demonstrations attainable.
It takes significant effort to balance the CPU, memory, networking and storage resources of the system against the loading, processing, reporting, and query processing requirements of users in an EDW environment. Without a carefully balanced system, adding new user groups and new data, extending data retention, and adding new processes or applications impacts the performance of existing system activities. In some cases, there is not enough capacity in the EDW system (capacity and/or CPU cycles) to support the new requirements. As a result, many organizations face the dilemma of decreasing the performance of the EDW system for all users or upgrading the system.
Migrating a portion of the EDW workload to a SQL Server-based MPP data warehouse provides a low-cost alternative. In most Teradata implementations, numerous processing tasks or subsets of users can be easily migrated. Typically, workloads migrated from an EDW to an MPP SQL Server data warehouse can run five to twenty times faster. Workloads and users remaining on the existing EDW also see an increase in performance through the elimination of users and processes competing for the same resources.
Migrating workload from an existing EDW to SQL Server is a straightforward process that requires a minimal amount of change. Data models and tables from Teradata can be migrated with minor changes to the table DDLs. Queries can be migrated with little to no change. ETL and ELT processes can also be migrated with minimal impact. Business Intelligence tool universes and projects can also be redirected to SQL Server with ease. Best of all, Microsoft provides tools, utilities, and services to make migration even easier.
Many organizations are exhausted by inflexible and unresponsive centralized EDWs coupled with independent data marts and inconsistent data, duplicate ETL processes, and no real controls. By resolving these issues using a grid-enabled data warehouse architecture that supports key features, such as the ability to move data at high speeds, data mart publishing, centralized management, support of disaster recovery, and metadata management, Microsoft Madison will be the only provider of a comprehensive data warehouse infrastructure capable of supporting a hub-and-spoke architecture.
The availability of a usable hub-and-spoke EDW platform represents the next advancement in data warehousing. Microsoft customers are already outpacing competitors by applying the technology in diverse and innovative ways. With Project Madison, Microsoft will have a completely integrated end- to- end software stack offering best-of-breed ETL, data warehouse infrastructure and Business Intelligence products, tools, and solutions.
For more information:
Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to 5 (excellent), how would you rate this paper and why have you given it this rating? For example:
- Are you rating it high due to having good examples, excellent screen shots, clear writing, or another reason?
- Are you rating it low due to poor examples, fuzzy screen shots, or unclear writing?
This feedback will help us improve the quality of white papers we release.