Using SQL Server to Build a Hub-and-Spoke Enterprise Data Warehouse Architecture
SQL Server Technical Article
Writer: Stuart Frost, Group Manager, Data Warehousing Product Unit
Technical Reviewer: Eric Kraemer, Consulting Manager, and Mark Theissen, Data Warehouse Product Unit
Published: February 2009
Applies to: SQL Server 2008 and later
Summary: Scalable, cost-effective EDW implementation is an elusive goal for many large organizations. Three common approaches include centralized EDW the “monolithic” approach, decentralized collections of data marts, and attempts at hub-and-spoke architectures that combine the previous two. With the acquisition of DATAllegro and the pending release of Project codename “Madison,” Microsoft is poised to deliver a unique and compelling Massively Parallel Processing (MPP) data warehouse solution that combines the best features of both EDW and decentralized data marts. Madison data warehouse architecture will include high speed, parallel transfers between other Madison or SQL Server 2008 databases. This provides the basis for solving one of the most intractable problems in large-scale data warehousing.
While many organizations have obtained significant value from their data warehouse (DW) installations, very few have been entirely successful at implementing full-scale, cost-effective enterprise data warehouses (EDW). An often overlooked reason that EDW efforts fail is database technology itself. Data warehouse installations at large companies generally fall into two broad architectural categories: centralized EDW, or the “monolithic approach,” and decentralized collections of data marts.
Data management (quality and governance) is a rapidly growing concern of IT management. To that end, data model and platform centralization (the monolithic approach) has become more common in large EDW installations. However, centralized EDW tends to be extremely expensive and very inflexible. As a result, business units become frustrated at the inability of the EDW to meet their needs at a sensible cost and within a reasonable timeframe.
A divide-and-conquer strategy is a natural and effective approach to addressing large-scale problems such as the creation of an integrated, enterprise-wide data warehouse. Decentralized EDW architectures align well with this approach and fit the many compartmentalized demands presented by large organizations. Decentralized data marts are more responsive to business unit needs, but often result in many versions of the same data that are very difficult to keep consistent across the enterprise.
Each approach tends to evolve (or degenerate) into the other, but neither is a tenable long-term solution for most large organizations.
Microsoft’s EDW Platform: Enabling a Hub-and-Spoke Architecture
Microsoft recently acquired DATAllegro, a DW appliance company. DATAllegro’s DW appliances are massively parallel processing (MPP) systems that consist of multiple, shared-nothing nodes. A DATAllegro appliance enables a true “hub-and-spoke” architecture, where the centrally managed “hub” contains detailed enterprise data, and departments or business units use “spokes” to exchange data with the hub according to their unique schemas. This architecture can exist across many connected databases.
Project code name "Madison" is the upcoming Microsoft scale-out solution for very large data warehouses (VLDW). Madison is based on the MPP technology developed by DATAllegro and the proven SQL Server 2008 database. Madison expands the DATAllegro hub-and-spoke solution to include not only MPP appliances but also standard symmetric multi-processing (SMP) instances of SQL Server 2008 and SQL Server Analysis Services, (SSAS), allowing either to be viewed as nodes within a grid.
The result is a highly flexible, affordable, and scalable platform that, for the first time, makes large-scale hub-and-spoke EDW architectures a practical reality. They combine the benefits of central control and governance with the agility of decentralized data marts, but without the inherent delivery pains, administrative headaches, risks, and costs associated with previous strategies.
Hub-and-spoke architectures match the structure of most large enterprises by offering a combination of a centralized EDW and a set of dependent data marts. The EDW hub allows the entire enterprise to set and enforce common standards while answering questions that cut across business units. The data mart spokes allow business units to meet their own needs quickly and at relatively low cost while still conforming to the needs of the overall enterprise. There is also an obvious fit with the realities of large enterprise budgetary and political environments. A well-managed hub-and-spoke architecture allows business units to set their own budgets and priorities, while contributing as necessary to the central EDW.
This close fit between the architecture of the business and the architecture of the DW platform means hub-and-spoke systems are widely regarded as the best overall approach. In practice, hub-and-spoke systems have been notoriously difficult to implement. Common challenges with hub-and-spoke approaches include:
- Network Resources. Distributing data from a centralized EDW reliably and quickly enough to meet the needs of the business units is a big challenge in the face of growing data volumes. Traditional database systems rely on existing customer network infrastructure that is often overwhelmed. To try to compensate for this, IT organizations develop complex change data capture ETL processes to update data marts, resulting in high maintenance costs and an inability to change with the business.
- Database Functionality. Although it is possible to build a centralized EDW that can handle the enterprise-wide detail data and create the necessary data mart structures, traditional database platforms have been unable to transfer data, at very high bandwidth, between “hubs” and “spokes.” This also leads to expensive and cumbersome ETL operations to facilitate data distribution.
- Loosely Coupled Federation. Data transfers between the centralized hub and satellite spokes must be done outside of the database itself. These ETL processes must be managed independently and add layers of complexity to data governance efforts. In addition, timelines to deliver new information to spoke systems must include both database and ETL development.
In general, efforts to build a hub-and-spoke architecture have quickly degenerated into a set of siloed data marts after being torn apart by conflicting business units and requirements. Any attempts to maintain a central EDW often fail because of a lack of buy-in and interest from the business units.
One response to the difficulties of building a hub-and-spoke architecture has been to simply centralize everything onto one monolithic EDW. One well-known vendor is a key proponent of this approach, since it helps to justify the very high price of their platforms and plays to their ability to scale and handle a mixed workload. Key limitations of a centralized EDW approach include:
- Lack of Flexibility. A centralized EDW platform quickly becomes overloaded with conflicting use cases. Solving any one problem requires evaluation of all existing dependencies, which drives rigid change control processes. This ultimately impacts time-to-delivery for new projects.
- Complexity Spiral. The more successful the EDW team is at delivery, the more complex workloads become within the same environment. As the monolithic environment grows more complex, it becomes more inflexible and expensive to maintain.
In the ultimate expression of this approach, an EDW schema is combined with a set of logical data mart schemas and the necessary ETL/ELT processes (often involving near real-time loads) on a single platform. In some cases, the same platform even handles operational data store (ODS) workloads.
The result is a highly complex workload consisting of the following, often conflicting, use cases:
- Large batch data loads
- Large batch updates
- Frequent smaller batch loads
- Frequent row-by-row inserts and updates
- Large aggregation queries to create data marts
- Complex ad hoc queries
- Large-scale business intelligence implementations, often involving a number of different tools
- A large number of smaller, simple queries, often involving single-row lookups
- Large data extracts for external tools such as SAS and disparate data marts
While a few very large enterprises have been successful in getting acceptable results, most are being overwhelmed by the complexity and cost of scaling and maintaining these massive proprietary systems. In addition, business units are becoming frustrated by the inability of IT to quickly meet new requirements via the central EDW and have started to build their own independent data marts as a result.
With a decentralized approach (which the other two tend to degenerate into anyway), business units simply build their own independent data marts. Although such an approach is obviously responsive to business needs, it doesn’t allow management to answer cross-enterprise questions easily or quickly.
Even worse, it is very difficult to apply any real measure of enterprise-wide standards and controls. In the age of SOX, GLB and PCI, independent data marts are very risky to the enterprise.
Additional risks of a decentralized approach to EDW include:
- Data Management. Keeping all copies of data across a decentralized infrastructure current and accurate can become overwhelming. This problem becomes worse as relatively low bandwidth data movement options drive complex data transformation and audit processes.
- Fractured Control. Over time it becomes increasingly difficult to maintain a cohesive perspective of the enterprise data model as stakeholders at the data mart level drive conflicting requirements.
- Limited Resources. Data federation requires costly database export and load operations over limited and expensive corporate network resources. This is worsened by existing database technologies that scale poorly, in both cost and resources, at enterprise scale.
Microsoft’s Next Generation EDW Platform
As mentioned in the introduction, Microsoft recently acquired DATAllegro. DATAllegro’s MPP data warehouse appliances are often used by customers as data warehousing ‘black boxes’, with data access being from a single point (the control node). However, the nodes within a DATAllegro appliance are in fact self-contained database servers running their own database and operating system. Therefore, a DATAllegro appliance can be viewed as a highly-specialized grid of servers being pulled together to collectively form a DW appliance.
Taking this view, it is a small step to think of a connected set of DATAllegro systems as both a grid of appliances and a grid of nodes. Moving data across this grid of appliances is incredibly efficient, since data can be moved directly from node to node within the grid. This maximizes parallelism across the environment and minimizes the conversion overhead associated with export and load operations. Such a grid of appliances can be used to implement a data warehousing hub-and-spoke architecture.
With the upcoming release of Madison, MPP scalability and grid connectivity can be taken to a new level. Madison expands the DATAllegro hub-and-spoke solution to include not only MPP appliances but also standard SMP instances of SQL Server 2008 and SSAS to be viewed as nodes within a grid. A grid of SMP databases and MPP appliances can be used as the basis for any large-scale data warehouse environment or architecture. However, it is particularly suitable for a hub-and-spoke architecture.
Madison appliances are built on a platform of standard Microsoft Windows 2008 Enterprise servers connected via a high-speed network, with each running the SQL Server 2008 database in a very pure shared-nothing architecture. One way of looking at this is as a specialized grid technology, focused on creating a stand-alone data warehouse appliance.
Figure 1. A Madison appliance consists of a set of independent database servers connected via a high-speed network and can therefore be viewed as a specialized use of grid technology.
Given the independence of the various servers in our appliances, it is a relatively easy task to extend this technology to create a grid of independent appliances that can work together, each responding to queries from different groups of users.
Figure 2. The grid concept can be easily extended to allow parallel data transfers between nodes in separate appliances, thereby creating a grid of appliances.
MPP for Hub-and-Spoke
The Madison MPP product with high speed parallel database copy is fundamental to solving one of the most intractable problems in large-scale data warehousing: building an effective, scalable, and affordable hub-and-spoke solution. The basic idea is to take a "divide-and-conquer" approach to building an EDW:
- Avoid performance problems due to conflicts between queries from different business units or IT management functions at peak times.
- Provide dedicated, high-speed, network interconnecting all hub and spoke databases.
- Support a centralized data model for metadata management and data governance without the limitations of single-platform centralization.
- Allow business units to scale their dedicated data mart platforms separately from the hub system.
- Allow central IT to scale the hub appliance to meet overall enterprise requirements without having to scale every dependent data mart appliance at the same time.
- Physically separate the processes associated with data management and development from those associated with the consumption of information by end users.
Figure 3. Users view the appliances as a set of separate data marts, but can drill into detailed data on the hub where required.
The Madison Hub-and-Spoke Solution
Imagine a fairly large MPP appliance acting as the hub for a set of MPP appliance and SMP database data marts. The hub holds detailed data, probably in a normalized schema, for a number of business units or the entire enterprise. The hub is loaded in near real time or in daily batches from source systems leveraging a preferred ETL solution. Data is then transformed or restructured to a denormalized structure (e.g., star, cube, etc.), as needed, and transferred to the appropriate data mart(s) via the high speed grid for consumption by end users. If a data mart requires data from sources that are not covered by the hub, this data is loaded independently using standard ETL tools. However, most of the data required (both fact and dimensions) comes from the hub.
Users connect to the independent data mart appliances as usual for running queries. This allows each data mart to be tuned for the needs of a particular set of users and sized to handle the required level of performance and concurrency. While the data marts can be independently designed to meet the needs of each business, it will be possible to leverage existing data mart applications such as Microsoft Analysis Services, Reporting Services, Excel, or other BI vendor products.
Bandwidth within the grid is large enough to enable the direct copy of detailed fact data or entire data marts. This can greatly simplify the data mart creation and update process by using a publish model as opposed to complex transformation logic that, coupled with expensive export and load scenarios, creates significant challenges for traditional federated approaches.
The end result is an EDW platform that can handle a very complex workload while being extremely scalable at a sensible cost. The ability to size and scale each component individually also fits well with budget cycles and structures.
Disaster Recovery and High Availability
The Microsoft EDW platform provides the capability to set alternate database systems within the dedicated high speed network as failover targets. As an example, a user attempting to connect to a spoke that is currently unavailable would automatically be redirected to an alternate spoke specified within the standard connection protocol. This simple approach becomes very powerful when combined with the hub-and-spoke architecture. The high-speed and bandwidth of the grid copy facility allows full copies of end-user data marts to be moved to multiple spokes. This effectively recreates the end-user view of the data on multiple spoke systems, each a valid failover option for the other in an outage scenario.
This concept can also be leveraged across multiple data centers to provide effective disaster recovery architecture. Individual appliances can be replicated on a second site and automatically kept up-to-date through ETL replication. Note that not all of the appliances on a grid would need to be replicated. In most scenarios only the hubs need to be replicated, as spokes can be recreated from the hubs. This provides the flexibility for each business unit to decide whether or not to provide a disaster recovery capability, based on their own service-level agreements (SLAs).
Microsoft’s Grid-Enablement Strategy
The version 1 release of Madison is expected to contain all of the grid functionality currently available in the DATAllegro V3 product. Also planned for the release is support of SMP SQL Server 2008 instances as spokes. Different reference architectures for both SMP and MPP platforms will be released with Madison. These reference architectures are designed to support various mixes of performance and capacity that can be used to create a solution that meets the specific needs of a customer.
Madison data warehouse architecture will include high speed, parallel transfers between other Madison or SQL Server 2008 databases. This provides the basis for solving one of the most intractable problems in large-scale data warehousing. Enabling large enterprises to build an effective and highly scalable hub-and-spoke EDW at sensible cost allows the data warehouse to better meet today’s business needs and realities.
This new approach offers customers an attractive alternative to centralized, monolithic approaches. Data marts can be tailored to meet the individual needs of business units (both in terms of capacity and performance). Furthermore, customers can buy into the Microsoft EDW approach with the deployment of a few stand-alone data marts on standard SQL Server DW reference architectures. From this relatively low-cost start point, Microsoft provides EDW solutions that scale into the hundreds of terabytes while delivering manageable flexibility without sacrificing cost and performance.
For more information:
Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to 5 (excellent), how would you rate this paper and why have you given it this rating? For example:
- Are you rating it high due to having good examples, excellent screen shots, clear writing, or another reason?
- Are you rating it low due to poor examples, fuzzy screen shots, or unclear writing?
This feedback will help us improve the quality of white papers we release.