.gif)
SQL Server
Technical Article
Writers: Mark Theissen, Eric Kraemer
Published: February 2009
Applies to: SQL Server 2008
Summary: This document describes common data
warehouse implementation strategies and proposes an effective hub-and-spoke
architecture using a massively parallel processing system with multiple
instances of SQL Server databases.
Introduction
To date, the
implementation of a true hub-and-spoke architecture for a data warehouse
environment has been an idealized and elusive goal. Although building a
centralized “hub,” or enterprise data warehouse (EDW) that supports
company-wide detail data is achievable, building and maintaining “spokes,” or
dependent departmental data marts has proved to be the challenge. Most data
warehouse environments have evolved to one of two architectures: a centralized
EDW or a series of distributed and/or federated data marts.
In response
to a few crucial problems with federated approaches to data warehousing,
vendors and pundits alike began to promote a centralized, monolithic EDW
approach. This approach has historically helped some vendors differentiate
themselves from competitors and justify unreasonably high Total Cost of
Ownership (TCO) for their products.
Centralization
is seldom scalable or easy to manage. Conflicting end-user and management-use
cases, demanding performance requirements, increasing user sophistication, and
on-going demands for scalability and flexibility all work to defeat even the
most efficiently managed monolithic system. In many cases, these issues cause
departments and business units to create their own independent data marts.
In August
2008, Microsoft Corporation acquired data warehouse appliance vendor
DATAllegro. This strategic acquisition will enable Microsoft to offer a
complete data warehousing solution that addresses the issues of centralized
EDWs and distributed independent data marts. Code named Project “Madison”, the
first release of a Microsoft appliance-style solution will be available in
2010.
The Madison
hub-and-spoke solution, with the power of massively parallel processing (MPP)
and SQL Server 2008 at its core, will offer a scalable, low-cost alternative to
existing centralized warehouses. This solution combines the ease of data
management from a unified enterprise data warehouse (hub) with the flexibility
of matching hardware to specific use-case requirements for federated data marts
(spokes). In Madison, spokes can be in either MPP or SMP (symmetric
multi-processing) configurations. In addition, this architecture maintains the
security of centralized data management by leveraging rapid distribution of hub
data through the Microsoft data warehouse.
Traditional Approaches
Traditional
approaches to data warehousing include:
- A
centralized EDW, or “monolithic” approach
- Federation
of independent, distributed data marts
- Hub-and-spoke
architectures
Each approach
has its appeal and advocates, and each one has serious stumbling blocks in
practice. The following section focuses on the disadvantages of these
approaches.
The Centralized EDW
The
centralized EDW arose from the pursuit of a “single version of the truth” and
the recognition of the issues with distributed independent data marts. The
centralized EDW approach uses a mainframe model with all data and applications
residing on a monolithic platform. Although there are some advantages to this
approach, there are also some significant disadvantages:
- Complex
workloads: Efficiently managing the workloads of ad hoc queries, enterprise
reporting, application queries, near real-time loads, batch windows, high
concurrency loads, and large aggregations is a difficult challenge.
- Personnel
requirements: A centralized approach implies increasing complexity as use cases
are added. The combination of scaling workloads and incremental use cases
drives the overall system complexity. Complexity demands increasing resources
and sophistication to manage effectively. This self-reinforcing cycle generates
increasingly non-linear management costs.
- Slow-to-adapt
technology: Some parallel solutions rely on older, proprietary architectures
that don’t take advantage of the rapid performance advances in commodity server
hardware. By themselves, common SMP solutions don’t take full advantage of
performance increases (Moore’s Law) due to the inherent bottlenecks of shared
resource architectures.
- Expensive
maintenance: Factors such as expensive, proprietary “platform lock-in” and
limited scalability of SMP based data warehouses make maintenance costs
increasingly difficult to manage under a monolithic architecture.
- Expensive
upgrades: Upgrades are particularly challenging with a monolithic architecture.
Single-system resources mean extended outages and are difficult to schedule and
manage. Monolithic architectures scale poorly as capacity stretches into the
tens or even hundreds of terabytes, which costs exponentially more, for
increasingly less return.
- High
total cost of ownership: Poor scalability, expensive maintenance, and limited
flexibility create a massive challenge to managing TCO. Proprietary lock-in and
limited hardware extensibility limit the ability to take advantage of advances
in performance, power, and space management.
- Limited
environmental agility: Layering often-causes conflicting use cases on a single
system and a single monolithic architecture demands increasingly rigid
processes and procedures.
Although a
single environment for data warehousing sounds attractive, long-term success is
difficult. Businesses often struggle to justify the cost of this approach,
hoping for success with a centralized approach, but watching the benefits take
a long time to materialize.
Distributed Independent Data
Marts
The opposite
approach to a monolithic design is the practice of distributing data across an
enterprise within departmental data marts. Distributed data marts are created
independently from all other data marts and are derived from non-centralized
source data processes. Each organization within an enterprise has a unique
version of the truth. Such data marts are often a result of individual
departments getting frustrated with a centralized EDW, because of the issues
listed in the preceding section. The end result is disparate data marts across
an organization with highly specialized and unique views of corporate
information.
The advantage
in this multiple data mart approach is that the data marts are highly relevant
to the departments using them and the environments are appropriately sized
based on need. Departmental ownership is clear.
The
disadvantages of this approach include:
- Data
silos: Each department has its own version of a data mart that does not align
to those in other departments or business units. Getting cooperation between
departments can be difficult and implementing a cross-organizational view of
the data silos requires significant time and expense.
- Data
duplication: When large volumes of data are duplicated, control and enforcement
of standards can become a significant challenge.
- Impact
to operational systems: Multiple and duplicate ETL processes compete for small
batch windows against production systems.
- Proliferation
of technologies: Multiple hardware and software technologies are deployed
across departments, eliminating economies of scale at the enterprise level.
- Data
management costs: The cost of data management rises unnoticed because the costs
are spread across multiple departmental budgets.
Hub-and-Spoke Architecture
Hub-and-spoke
architectures, sometimes referred to as “federated” EDWs, match the business
structure of most large enterprises by offering a centralized EDW and a set of
dependent data marts. The EDW hub allows the enterprise as a whole
to set and enforce common standards while enabling analysis and reports
that cut across business units. The data mart spokes allow business units to
meet their own needs quickly at relatively low cost and also conform to the IT
requirements of the overall enterprise. A well-managed hub-and-spoke
architecture allows business units to set their own budgets and priorities,
while contributing as necessary to the central EDW.
This close
fit between the architecture of the business and the architecture of the data
warehouse means that hub-and-spoke systems are widely regarded as the best
overall approach, in theory.
However,
hub-and-spoke systems have been notoriously difficult to implement in the past.
Although it’s possible to build a centralized EDW that can handle the
enterprise-wide detail data and create the necessary data mart structures,
traditional technology has been unable to distribute the required data quickly
enough to meet the needs of the business units.
As a result,
most efforts to build a hub-and-spoke architecture degenerate into a set of
distributed, siloed data marts after being segmented by conflicting business
units and requirements. Centralized EDWs often fail because of a lack of buy-in
and interest from the business units.
Overcoming the Limitations
of Traditional Approaches
Understanding
the challenges of implementing hub-and-spoke architectures, DATAllegro
developed a parallel data movement technology for data warehousing. Through the
acquisition of DATAllegro, Microsoft will provide a comprehensive solution that
includes the hardware reference architectures, software, and services to
support a true hub-and-spoke architecture. Microsoft’s grid DW technology
enables companies to build a scalable, affordable hub-and-spoke environment.
See Using SQL Server To Build A
Hub-and-Spoke Enterprise Data Warehouse Architecture for an overview discussion of
grid-enabled EDW.
Benefits of
Microsoft’s technology include:
- The
interconnection of multiple Microsoft data warehouse units via a dedicated high-speed
network. This enables parallel high-speed data movement at the node level
between units, making possible data transfer rates approaching 500GB per
minute.
- Simplification
of data mart ETL/ELT processes via a data publishing model. Data mart tables can
be generated on a hub without impacting data mart users and then be published
to one or many spokes. The use of a publishing model for creating data marts
not only simplifies ETL/ELT processes, it also reduces the personnel costs
associated with maintaining and managing these processes.
- Separation
of both management and user workloads. This eliminates traditional conflicts
between processes required for IT data management and end-user reporting and
analysis.
- The
ability to size a database solution (MPP or SMP) to the specific needs of a
group of users or processing requirement (i.e., archive, performance,
capacity). Microsoft grid-enabled database hub-and-spoke solutions support both
MPP and traditional SMP SQL Server 2008 solutions within the grid, enabling
true cost and performance scalability.
- The
ability to independently expand any database in the EDW based on requirements.
This enables ownership and control of spokes to be within the department or
business unit.
- The
ability to add additional spokes to the EDW at any time without impacting other
users.
- Deployment
of development and test environments that leverage parallel connectivity for
easy production system replication. Time to deploy test data in support of
agile prototyping, audit, or data management is not hindered by slow and costly
export and load operations.
- Centralized
metadata management and provisioning. Madison will include basic grid
management and provisioning facilities that are sufficient to support
standalone data warehouses. In the future, these capabilities will be broadened
to enable administrators to manage a grid of data warehouse units.
.gif)
Figure 1: Users see department- or subject-area
views of enterprise data. IT manages a single version of the enterprise data
model on the hub system. High-speed parallel replication enables simple,
easy-to-manage data transformations to spoke systems.
Using SQL Server to Replace
a Monolithic or Distributed Solution
Many
enterprise data warehouse owners find themselves to be stewards of solutions at
one or the other extreme end of the spectrum. Whether their warehouses are
rigidly monolithic or chaotically distributed, they can be locked into heavy
investments and spiraling infrastructure costs. The need for change is
triggered by degradation in performance, lack of storage space, changes in
business practices, increasingly stringent governance, increases in users, or
any combination of these scenarios, all of which underscore the importance of a
scalable, agile environment.
Grid-enabling
hub-and-spoke technology from Microsoft allows you to choose a flexible,
scalable environment that aligns with both management and delivery needs.
Microsoft architectural features make this a fast and low risk process.
Companies who
improve their EDW environment commonly take one of three approaches:
- Forklift
Migration
- Divide
and Conquer
- Green
Fields
Forklift Migration
A “forklift
migration” is the complete re-hosting of an existing EDW or data mart to
Microsoft Madison. The key advantage to a forklift database migration is that
an existing data model can be ported to the Microsoft EDW without change. This
can be a very compelling advantage, considering the cost of running an EDW
re-hosting project.
Common
challenges faced in this approach include:
- Data
model requirements: Traditional alternatives for very large enterprise data EDW
installations are limited and focus on specialized solutions that require
specific data models. It is rare for an existing data model to match the
idealized model required to achieve optimal performance.
- Mirrored
inefficiencies: Fork lifting does not allow the necessary data model changes
that take full advantage of a hub-and-spoke architecture because it does not
allow correction of architectural flaws or inefficiencies in the originating
environment.
- “All
or nothing:” A forklift approach can achieve significant time-to-delivery
advantage in part because complex decoupling of dependant data sources across
subject areas can be completely avoided. Instead of investing time to safely
move single-use cases or business subject areas, the entire data model is moved
at once. This limits opportunities to demonstrate early successes with smaller
deliverables. The project becomes vulnerable to bottlenecks because all process
dependencies become potential showstoppers.
Although
these challenges are daunting, the Microsoft EDW solution delivers several mitigating
advantages:
- Flexible,
low-cost prototyping: Samples of complete schema can be copied easily between
the central hub repository and the development spokes. This allows project
teams to focus testing on the dependencies in dedicated environments, scaled to
suit their needs. Potential defects can be identified early in the development
process. Benchmarking on dedicated test spokes allows measured adoption
performance enhancements and features unique to the new system.
- Bandwidth
that makes replication feasible: A forklift migration can make the
incorporation of high-value or problematic use cases difficult. Because
decoupling of process dependencies rarely occurs during the initial stages, the
only alternative is to move significant portions of the existing data model to
a spoke. Parallel copying, which is possible in Madison, makes this high volume
bulk transfer of both schema and data achievable.
- Data-model
agnosticism: Microsoft promotes a best-fit approach to data model and
information delivery. No single data model is required to unlock the
performance and scalability of the Microsoft EDW solution, providing an
environment that is flexible for both system and data architecture.
Divide and Conquer
To divide and
conquer by focusing on key subject areas or use cases is a proven method for
large scale data warehouse migration. This approach fits well with iterative
project and development methodologies, which are also proven to be highly
effective for this type of project. An important advantage to this approach is
the ability to demonstrate iterative successes to business constituencies.
Common
challenges faced in a divide-and-conquer approach include:
- System-project
alignment: The hardware platform is often overlooked as an integral part of the
overall development methodology. The divide and conquer approach suffers when
prototyping must occur in a limited or shared testing environment, because it
is difficult to simulate real-world scenarios. A key advantage to this method
is the ability to deploy multiple development teams in parallel efforts.
Single-resource environments act as bottlenecks in this type of migration.
- Early
deployment: Realizing the greatest advantage of a divide-and-conquer strategy
means the early deployment of a tightly scoped use case. Once that deployment
occurs, development teams face the challenge of working in a production
environment. This situation gets worse in a monolithic or limited-resource
environment. Migrations often slow down considerably after the first
expectation-setting deployment success.
The Microsoft
hub-and-spoke solution aligns well with a divide-and-conquer migration
strategy. Some key advantages are:
- Separation
of development and production: New subject areas can be included in a single
environment while production processes run on a separate hub.
- Insulation
of users: Spokes can be created at the business unit or subject area level,
allowing existing users to be unaffected by the roll-out of new subject areas
or business units.
- High-speed
data movement: Ease data moves and processes between development, test, and
production environments. High-speed data movement also enables spokes to serve
as back-ups to each other in the event of a system outage.
- Staged
purchases: Additional spokes can be added in synch with development efforts.
There is no requirement to purchase all hardware and software at the start of a
divide-and-conquer implementation.
- Centralized
data model: The hub maintains a centralized data model for all subject areas.
Green Fields
Starting
fresh with a new hardware stack, new architecture, new database essentially “green fields” is the riskiest approach to data warehouse
migration but the benefits can be tremendous in the right situation. This
approach counters all of the disadvantages of a forklift migration. In the case
of an environment with deeply entrenched practices and architecture, a green
field approach can save time over remodeling and reconfiguring, resolve major
architectural flaws of an existing environment, and deliver better long term
performance. Migrating to an entirely new architecture can be an effective way
to manage the organizational change inherent in any data migration.
Some of the
risks for green field migration include:
- Overwhelming
scope: Scope must be broken down to iterative projects that can be delivered
regularly. Management of scope must be rigorous in order to keep scope creep to
a minimum.
- Funding:
Large initiatives are difficult to fund and to keep funded over the length of
time required by starting fresh. Organizations must have the discipline and
financial fortitude to commit to such an effort.
- First
deliverables take time: The first iteration of a green field implementation is
the longest iteration because a baseline data model must be developed that
meets the requirements of multiple user constituencies. Many architectural
decisions are made at this time.
Like the
previous strategies, the Microsoft hub-and-spoke solution works to mitigate
these risks in the following ways:
- Right-size
funding: The Microsoft EDW solution provides the flexibility to right-size one
or more systems, tailored to a specific EDW use case. This allows capacity to
be tailored to the early phases of a green field project and delivers easy
scalability throughout the project timeline. This match between smaller scale
early deliverables and funding minimizes the large up-front costs often
associated with green field migrations.
- Agile
prototyping: High speed copy between database systems and scalable deployment
from MPP to SMP platforms support agile development efforts. Although complex
and extensive development efforts may be required to achieve first-production
deliverables, the Microsoft EDW solution provides an environment that makes
early prototyping or demonstrations attainable.
Not
Ready to Migrate? Assume a Portion of the EDW Workload
It takes significant effort to balance
the CPU, memory, networking and storage resources of the system against the
loading, processing, reporting, and query processing requirements of users in
an EDW environment. Without a carefully balanced system, adding new user groups
and new data, extending data retention, and adding new processes or
applications impacts the performance of existing system activities. In some
cases, there is not enough capacity in the EDW system (capacity and/or CPU
cycles) to support the new requirements. As a result, many organizations face
the dilemma of decreasing the performance of the EDW system for all users or
upgrading the system.
Migrating a portion of the EDW
workload to a SQL Server-based MPP data warehouse provides a low-cost
alternative. In most Teradata implementations, numerous processing tasks or
subsets of users can be easily migrated. Typically, workloads migrated from an
EDW to an MPP SQL Server data warehouse can run five to twenty times faster.
Workloads and users remaining on the existing EDW also see an increase in
performance through the elimination of users and processes competing for the
same resources.
Migrating workload from an existing
EDW to SQL Server is a straightforward process that requires a minimal amount
of change. Data models and tables from Teradata can be migrated with minor
changes to the table DDLs. Queries can be migrated with little to no change.
ETL and ELT processes can also be migrated with minimal impact. Business
Intelligence tool universes and projects can also be redirected to SQL Server
with ease. Best of all, Microsoft provides tools, utilities, and services to
make migration even easier.
Conclusion
Many
organizations are exhausted by inflexible and unresponsive centralized EDWs
coupled with independent data marts and inconsistent data, duplicate ETL
processes, and no real controls. By resolving these issues using a grid-enabled
data warehouse architecture that supports key features, such as the ability to
move data at high speeds, data mart publishing, centralized management, support
of disaster recovery, and metadata management, Microsoft Madison will be the
only provider of a comprehensive data warehouse infrastructure capable of supporting
a hub-and-spoke architecture.
The
availability of a usable hub-and-spoke EDW platform represents the next
advancement in data warehousing. Microsoft customers are already outpacing
competitors by applying the technology in diverse and innovative ways. With
Project Madison, Microsoft will have a completely integrated end- to- end
software stack offering best-of-breed ETL, data warehouse infrastructure and
Business Intelligence products, tools, and solutions.
For more information:
Using SQL Server To Build A
Hub-and-Spoke Enterprise Data Warehouse Architecture
Introduction to SQL Server Fast Track
Data Warehouse Architectures
Implementing a SQL Server Fast Track
Data Warehouse
Project codename Madison
SQL Server 2008 Data Warehousing
SQL Server CAT Team 10 Best Practices
for Building Large Scale Relational Data Warehouses
SQL Server TechCenter
SQL Server DevCenter
Did this
paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to
5 (excellent), how would you rate this paper and why have you given it this
rating? For example:
- Are
you rating it high due to having good examples, excellent screen shots, clear
writing, or another reason?
- Are
you rating it low due to poor examples, fuzzy screen shots, or unclear writing?
This feedback
will help us improve the quality of white papers we release.
Send feedback.