Data Warehousing (DW)---a Technical Reference Guide for Designing Mission-Critical DW Solutions
Want more guides like this one? Go to Technical Reference Guides for Designing Mission-Critical Solutions.
Data warehouses are characterized by queries that scan larger numbers of rows, large ranges of data and may return relatively large results for the purposes of analysis and reporting. Data warehouses are also characterized by relatively large data loads versus small transaction-level inserts/updates/deletes. Database (DB) schemas are often de-normalized, and many times in the form of a star-schema. However, in the case of a hub-and-spoke architecture, the "hub" (also known as the operational data store [ODS]) is usually normalized, while the "spokes" (data marts) are de-normalized.
While a relational database engine is at the heart of a data warehouse, most relational technologies are not tuned to deal with the large-query and data-loading workloads of data warehousing (DW). This is why we have developed special architectures (called reference architectures) to help the database engine work more efficiently with the hardware.
The following resources provide some general information and best practices.
Data Warehouse Fundamentals
Articles: Data Warehouse Fundamentals
Best Practices for Data Warehousing with SQL Server 2008
Understanding Concurrent Users versus Number of users: Concurrent users are defined as the number of users actively running tasks (usually queries) at any moment in time. Number of users usually refers to the total population of users having access to the database. Usually the number of users is a factor of 10 larger than the concurrent users. So if a customer claims to have 1000 users, it is acceptable to protract that number into 100 concurrent users for the purposes of capacity planning.
Understanding mixed-query workloads: Queries can often be categorized into small, medium and large based on a number of criteria, but usually it is based on time. In most data warehouses, there is a mixed workload of fast-running versus long-running queries. In each case, it is important to determine this mix and to determine its frequency (hourly, daily, month-end, quarter-end, and so on). It is important to understand that the mixed query workload, coupled with concurrency, lead to proper capacity planning for a data warehouse.
Case Studies and References
SQL Server has been deployed for many Tier-1 and lower Tier applications by customers. Examples include:
Fast Track is a reference architecture with specific guidelines for Hardware and software to configure SQL Server for large-scale data warehouses < 50TB in size. Note that Fast Track is NOT a product with a SKU, but is recommended, vendor-specific hardware configuration and best practices. Fast Track Reference Architecture6
Questions and Considerations
This section provides questions and issues to consider when working with your customers.
Data Warehouse Maturity Level
7: Fully understand the customer’s data warehouse maturity level: are they just beginning (Basic) or do they have fully-managed DW’s and using data-governance (Dynamic).
Determine if the client has a DW enterprise initiative or policy, or are they just trying to fulfill the needs of a specific project. This will govern the solutions we may present to the client (e.g. Fast Track or PDW).
Understand the entire DW workload that is particular to the client. The workload is everything from loading data, to running DW queries, to reporting, to analytics and beyond. The depth and breadth of these components are often commensurate with the DW Maturity Level.
Understand the data warehouse size, sizes of individual databases, incremental load volumes, concurrent users, and so on.
Understand the nature of the DW queries; in other words, understand the client’s business. TECLO and investment bank customers tend to have less complex queries, but very large data-scan requirements. Conversely, Retailers, Transportation, Petrochemical customers have more complex queries.
Understand the DW user-population mix: is it comprised of a large number of "Consumers" who want regularly-scheduled reports, or are there a few DW analysts running complex/elaborate queries requiring massive amounts of data?
Understand the client’s hardware allegiance and acceptance of an appliance-based DW solution versus a hand-built/customized H/W solution.
Following are the full URLs for the hyperlinked text.
1 Data Warehouse Fundamentals http://it.toolbox.com/wiki/index.php/Data_Warehouse_Fundamentals
2 Articles: Data Warehouse Fundamentals http://www.rkimball.com/html/articlesFundamental.html
3 Inmoncif.com http://www.inmoncif.com/home
4 Best Practices for Data Warehousing with SQL Server 2008 http://msdn.microsoft.com/library/cc719165.aspx
5 DWMantra.com http://www.dwmantra.com/dwconcepts.html
6 Fast Track Reference Architecture http://sharepoint/sites/SSDWPU/default.aspx"
7 IBM DB2 Driver for ODBC and CLI Overview http://sharepoint/sites/bpdcxstg/SQLCAT/Documents/Tier%201%20DW%20Toolbox