Project REAL: Technical Overview

Article
01/28/2010

By Len Wyatt

Applies To: SQL Server 2005

Summary: In Project REAL we are creating a reference implementation of a business intelligence (BI) system using real large-scale data from a real customer. The goal is to discover the best practices for creating BI systems with SQL Server 2005 and to build a system that exhibits as many of those best practices as we can. This project is not just a demo —we are creating this system for ongoing operation. It is a complete system, including daily incremental updates of the data, large multiuser workloads, and system monitoring. This paper introduces the Project REAL system, its data models, subsystems and deployment scenarios. It is the first in a series of papers which will give more details about various aspects of Project REAL and the best practices that have been uncovered during this exercise.

Project REAL is a cooperative endeavor between Microsoft and a number of its partners in the BI area. These partners include (in alphabetical order): Apollo Data Technologies, EMC, Intellinet, Panorama, Proclarity, Scalability Experts, and Unisys. The business scenario for Project REAL and the source data set were graciously provided by Barnes & Noble.

Introduction

Project REAL is an effort to discover best practices for creating business intelligence (BI) applications that are based on Microsoft® SQL Server™ 2005. In Project REAL we are doing this by creating reference implementations based on actual customer scenarios. This means that customer data is brought in-house and is used to work through the same issues that customers face during deployment. These issues include:

Design of schemas — both relational schemas and those used in Analysis Services.
Implementation of data extraction, transformation, and loading (ETL) processes.
Design and deployment of client front-end systems, both for reporting and for interactive analysis.
Sizing of systems for production.
Management and maintenance of the systems on an ongoing basis, including incremental updates to the data.

By working with real deployment scenarios, we gain a complete understanding of how to implement a BI system using SQL Server BI tools. Our goal is to attempt to address the full gamut of concerns that a company wishing to analyze potentially large data sets would face during their own real-world deployment.

This paper provides a general overview of the data, designs, systems, and methods used in Project REAL. In most areas, more detailed white papers that describe the lessons learned and best practices uncovered through this work are available on the Project REAL Web site. It is our hope that anyone planning or implementing a business intelligence system based on SQL Server 2005 will find the reference implementation useful.

Project REAL will result in a number of papers, tools and samples over its lifetime. To find the latest information, visit the following site:

https://www.microsoft.com/sql/bi/ProjectREAL

Cooperative effort

Project REAL is a cooperative effort between Microsoft and a set of partner companies known for their expertise in their respective fields. Each partner committed resources to the project and agreed to perform technical work that focused on developing general best practices, not on distributing marketing information. The partners are listed below along with their areas of focus:

Apollo Data Technologies implemented the data mining models and integrated them so they would always be "live" as the data is updated.

Barnes & Noble provided the business scenario for Project REAL and the source data set. They did this knowing that the purpose of the project was not to create the precise system that they would deploy, but to create best practices and instructional information for a wide audience.

EMC provided substantial storage resources for the project, implemented the data integrity features of the system, and provided a backup system for the project.

Emulex provided host bus adapters (HBAs) for connecting the servers to the storage subsystems.

Intellinet designed and implemented the overall ETL system at Barnes and Noble and then modified it to meet the needs of a model implementation for Project REAL.

Panorama developed various client access strategies for the system to model the connectivity and security needs of intranet, wide-area network, and Internet users, and also developed a showcase front-end system for Project REAL.

Proclarity developed and documented guidelines for migrating Analysis Services 2000 implementations to Analysis Services 2005, and also developed a showcase client front-end system for Project REAL.

Scalability Experts designed and implemented the data lifecycle management functionality, including the partitioning of relational tables and Analysis Services cubes, and implemented the management of partitions in the ETL processing.

Unisys contributed expertise from their Business Intelligence Center of Excellence and substantial hardware resources including 32-bit and 64-bit servers. Unisys also designed and implemented the monitoring system used for ongoing operations.

Phase 1 and Phase 2

Project REAL currently uses data from two Microsoft business intelligence customers. Phase 1 of the project was modeled on a large electronics retailer that keeps sales and inventory data in an SQL Server 2000 data warehouse. Phase 1 implementation focused primarily on the concerns that an existing SQL Server 2000 customer might have when migrating to SQL Server 2005. Our results largely represent the migration of existing functionality, with a few new capabilities used where appropriate. The following papers about Phase 1 have been (or are about to be) published on the Microsoft Web site:

"SQL Server 2005 Integration Services: Lessons from Project REAL" at (https://msdn2.microsoft.com/library/bb469733.aspx)
A paper that describes building custom components for the SSIS pipeline is currently in progress.

Phase 2 of Project REAL is based on the business scenario, schema, and data used by Barnes & Noble in their sales and inventory data warehouse. Where Phase 1 was concerned primarily with the migration process, Phase 2 is a complete reimplementation of the system using best practices for SQL Server 2005. As such, it exercises more parts of the system, including features not in the Barnes and Noble implementation. The remainder of this paper is about Phase 2 of Project REAL.

Business scenario

Barnes & Noble booksellers (B&N) have 40,000 employees in about 800 stores around the U.S. They use the Barnes & Noble and B. Dalton names, and own BookStop and barnesandnoble.com. The initial data warehouse Barnes & Noble created is intended to support planners and buyers at the corporate level, store managers, and distribution center managers.

The goals of the B&N warehouse included:

Improving the visibility and prediction of out-of-stock conditions.
Enabling better forecasting and purchasing decisions.
Improving the reporting system and the exposure of Key Performance Indicators (KPIs).
The ability to perform ad hoc analysis.
Better integration of data on sales, store inventories, and distribution center inventories.

To illustrate the "realness" of this data, Figure 1 shows the sales of certain titles in a five-state area for the months in 2004. It shows that titles managed with the "backlist" strategy had the highest sales in the winter and summer months, and dipped in the spring and fall. Titles managed with the "Frontlist" strategy had fairly even sales through most of the year, but increased sharply in November and December — the Christmas holiday season.

Figure 1: Sales by ordering strategy

For purposes of Project REAL, B&N delivered their existing data warehouse databases plus three months of daily incremental update data. This way the Project REAL systems can be operated to simulate ongoing usage of the system. The data has been "masked" for Project REAL to protect the confidentiality of information. The B&N databases, originally developed using SQL Server 2000, were converted to SQL Server 2005 and adapted in various ways that would illustrate best practices for SQL Server 2005. Note that B&N chose to do their ETL work using prerelease versions of SQL Server Integration Services (SSIS) because of its new design features and high performance. The packages developed for use at B&N formed the basis for the ETL work in Project REAL, but again, the packages have been adapted to show additional best practices and features.

Note: Data Transformation Services (DTS) in SQL Server 2000 has been redeveloped into a new ETL facility called SQL Server Integration Services (SSIS).

Overview of configurations and tests

Project REAL is not a single system implementation — it is a set of architectural variants that show the merits and performance of different configurations. Regardless of the physical implementation, there are certain logical and software components that will always be present. As illustrated in Figure 2, these include:

The "source" database from which data is extracted to feed the data warehouse.
The ETL packages and SSIS service that integrate data from the data source into the relational warehouse database.
The relational data warehouse database and service which are used for relational reporting and as a source for creating cubes, and provide long-term highly reliable storage.
The Analysis Services database and service for analytical queries, for reporting, and for data mining.
The Reporting Services service provides a vehicle for distributing reports in many formats to a wide audience, using a report cache for high performance.
There may be multiple instances of Web servers, allowing a 3-tier implementation for intranet, extranet, or public Internet scenarios.
Client emulation facilities allow a workload to be created that emulates the behavior of a large number of users accessing the system.

Figure 2: High-level architecture

These components can be implemented in a variety of physical implementations, which will have differing advantages in terms of performance, ease of deployment, client connectivity, security, and cost:

Consolidated vs. distributed server architecture. Both consolidated and distributed architectures will be tested since it is possible to run each principle service on a separate server, or to consolidate services onto a single larger server. The goal of Project REAL is to reflect actual customer choices and tradeoffs. Unisys provided a variety of servers so that both configurations could be tested and the approximately 30 TB of storage that EMC provided allows us to provision both configurations as the same time.
32-bit and 64-bit performance. 32-bit vs. 64-bit performance will be compared and evaluated for the consolidated scenario.
Client connectivity scenarios. Clients may exist on the corporate network (intranet scenario), may tunnel in through firewalls but still use corporate security (extranet scenario), or may exist on the public Internet accessing BI services (Internet scenario). These have various access and configuration concerns. In Project REAL we can implement all three.

There are also variants of the software architecture and design. Two examples you will see in Project REAL:

Prior to SQL Server 2005, partitioning of large amounts of relational data (such as fact tables reaching hundreds of gigabytes of data) would be accomplished by creating multiple fact tables (called partition tables) — for example, there might be one table for each week of data. SQL Server 2005 introduces partitioning within a relational table. Project REAL will exercise both approaches.
Reporting Services can generate reports using relational data or Analysis Services cubes. Again, Project REAL will exercise both approaches.

Each of the above configurations will appear in appropriate places in the various white papers about Project REAL that are published. The papers will explore architectural impacts, end-user functionality, configuration details, ease of maintenance, or any number of other topics. However, for all configurations, there are certain performance metrics that will be consistently of interest. We will be consistent about measuring at least these things:

Update rates. How fast is incremental update data processed into the system? This includes both relational data warehouse and the Analysis Services cubes. Update rates are measured in rows per second.
Query or report response times. How fast does the system respond to interactive queries or requests for reports? Response times are in seconds or milliseconds. Experience shows that users are not very sensitive to the average response time of a system; they notice when things run slowly. Therefore we will typically report on the 90th percentile response time — meaning that 90% of requests are faster than this time. Response times clearly depend on the overall load on the system, so they will always be stated in the context of an overall load (such as the number of emulated users).

SQL Server 2005 BI Features

This section will briefly reiterate some of the key features of SQL Server 2005 for BI. This will be a very high-level treatment, since a great deal of information is available in other places. In particular, check the Microsoft SQL Server Web site at https://www.microsoft.com/sql/2005. In addition, if you have an early SQL Server 2005 build (one of the Community Technical Previews) there is a resource section on that CD.

SQL Server 2005 includes the tools that you need to create BI applications that are highly functional, scalable, and robust. You do not pay extra for them.

Data from various sources can be integrated and written to create the relational data warehouse using SQL Server Integration Services (SSIS). SSIS has a new pipeline architecture that allows data to be retrieved from various sources, manipulated in memory, and stored in the destination database without being written to disk in intermediate locations. This has tremendous performance benefits. The kinds of manipulations that can be performed in the pipeline include data validation, surrogate key lookups, derivation of new columns, aggregation, sorting, and more.

The SQL Server relational engine provides reliable and scalable long-term storage for information in a data warehouse. From this point, it can be used for other analysis and reporting purposes.

Analysis Services takes data from the relational warehouse and prepares it for fast access by users. Analysis Services has an aggregation engine that helps it respond rapidly to queries at various levels of detail. The use of a query language known as MDX facilitates the expression of complex queries. Many client tools, including Proclarity and Panorama, use the Multidimensional Expressions (MDX) language to access data through Analysis Services. Web clients can submit MDX queries via HTML or XML for Analysis (XMLA).

Reporting Services facilitates easy creation and retrieval of reports in many formats (HTML, Excel, PDF, etc.). Reports are authored and stored in a common definition format, which may then be rendered in any of the available report formats. Reports can be cached for rapid retrieval, and point-in-time snapshots of reports can be saved for later retrieval. Reporting Services reports can be defined to run using either a relational database or Analysis Services as the data source.

Data for Project REAL

Description of the data

Barnes and Noble keeps their data warehouse in several related databases. For purposes of Project REAL, the primary ones of interest are a Sales database and an Inventory database. The data from those two databases is combined into one larger database called REAL_Warehouse for Project REAL. A star schema is used, with a set of dimension tables that define the entities in the warehouse and fact tables with measures for each subject area. An excellent overview of the subject of dimensional modeling of this sort can be found in Ralph Kimball's book “The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling” (http://www.kimballgroup.com/html/books.html).

The primary subject areas are store sales (Tbl_Fact_Store_Sales_YYYY_MM_DD tables), store inventory (Tbl_Fact_Store_Inventory_YYYY_MM_DD tables) and distribution center (DC) inventory (Tbl_Fact_DC_Inventory_YYYY_MM_DD tables). These fact areas are divided into weekly partition tables, each tagged with the closing date of the partition. There are 15 dimension tables in all, the primary ones representing:

Buyers (Tbl_Dim_Buyer, 584 rows)
Customers (Tbl_Dim_Customer, 5,636,425 rows)
Time (Tbl_Dim_Date, 6,615 rows, because it is pre-populated to the year 2020)
Vendors (Tbl_Dim_DC_Vendor, 12,686 rows)
Products (Tbl_Dim_Item, 6,975,084 rows)
Stores (Tbl_Dim_Store, 4,127 rows)
Employees (Tbl_Dim_Store_Employee, 30,226 rows)

Table 1 shows the initial space used in the REAL_Warehouse database, before applying any incremental updates. Because weekly partitions are maintained for the sales and inventory fact tables, the number of rows per table will vary, but an average is given.

Note: The table sizes will increase as we add indexes in support of relational queries. Creating cubes over the fact tables does not require indexes on the tables.

	Tables	Rows	Size (MB)	Bytes/Row	Rows/Table
Dimension tables	15	12,666,277	6,420	n/a	n/a
DC Inventory facts	18	54,405,164	4,363	84	3,022,509
Store Inventory facts	53	8,630,298,635	435,983	53	162,835,823
Store Sales facts	157	1,366,052,628	192,354	148	8,700,972
Division Strategy facts	1	33,417,014	2,013	63	33,417,014

Table 1: Initial data warehouse statistics

At B&N, incremental updates are pulled using SSIS from two Oracle databases — one for point-of-sale transactions and one for inventory management — and retained in a staging database before being integrated (again using SSIS) into the warehouse databases. For Project REAL a copy of the staged incremental updates has been forwarded to Microsoft. There are three months of daily incremental update available for the project.

Updates to various tables in the source databases are captured as rows with a date stamp indicating when the modification occurred. There is one table for each data type that is fed in — stores (BN_Store_Full), buyers (IMM_Buyer_Full), sales transactions (SRS_Sales_Trans_Full), etc. Table 2 shows the amount of data available for incremental updates.

	Tables	Rows	Size	Bytes/Row
Dimension data	7	12,050,392	8,734	n/a
DC Inventory facts	1	298,496,583	31,525	111
Store Inventory facts	1	294,776,968	65,713	234
Store Sales facts	1	148,801,022	29,129	205
Division Strategy facts	1	6,782,314	517	80

Table 2: Incremental update source data statistics

It should be noted that the initial data set contained three years of sales data, one year of store inventory data, and just three months of DC inventory data. This means that when three months of incremental update data are added to the system, the data volume of the store inventory will grow proportionally more than the sales data, and the DC inventory can be expected to double in volume.

Transporting and masking the data

To meet the unusual challenge of moving this volume of data from B&N's operations center in New York to the Microsoft lab in Redmond, WA, we took advantage of some recent technologies. First, the databases were backed up using LiteSpeed backup from Imceda Software. This resulted in a significant space savings due to the compression methods employed by LiteSpeed. Then the backups were written on USB hard drives and sent through the mail! The choice of USB hard drives offered great flexibility in connecting and shipping the drives.

After the data was restored it was also "masked" to protect the value of the data for B&N. A significant effort was made to obscure confidential information and to render the data non-useful for certain types of analysis that competitors might be interested in. At the same time, it was important that the data really reflect the operational characteristics of B&N's data, and that the data be real enough for analysis and demos. Thus there were sometimes contradicting objectives, but a resolution was found that all parties believe meet the requirements. The procedures used to mask the data will not be disclosed. It should suffice to say that the resulting data is very realistic, though not totally accurate. You might see a demo using this data and never realize that the data has been modified, if we didn't tell you here.

Logical Components of the Implementation

The overall architecture used in Project Real was shown earlier in Figure 2. This is very typical of a BI system created using SQL Server. Source data is pulled from one or more locations using Integration Services, transformed in various ways, and placed in the relational data warehouse. Analysis Services cubes and data mining models are created based on the data in the relational data warehouse. End users access the data either as reports defined against the cubes, as relational data using Reporting Services, or as exploratory views running against Analysis Services.

The relational database, Integration Services, Reporting Services, and Analysis Services all run on a Microsoft Windows® operating system — at this time, Microsoft Windows Server™ 2003. Whether they run on one or multiple physical machines is discussed in the next section.

ETL using Integration Services

Barnes and Noble have elected to copy their incremental update data to a staging database using SSIS before performing any ETL transformations on it. This is not required by the architecture of the Microsoft BI tools; in fact, one of the advantages of the new SSIS architecture is the ability to perform complex transformations without having to incur the overhead of writing intermediate data to disk. The reason for the staging database in the B&N implementation is to allow re-running of the ETL (for example, if a problem is discovered after the update is performed) without having to extract again from the source systems. This design is fortunate for Project REAL because we can re-create the incremental update stream by obtaining a copy of the staging database. For our purposes the staging database is a substitute for multiple source systems.

In the ETL processing, the main functions being performed are:

Checking for consistency of all the dimension and fact data.
Converting data to the correct data types for the relational data warehouse.
Checking for changes to dimension members and handling them according to the business rules for each dimension.
Inserting new dimension members when they arrive.
Creating new weekly partitions for fact data at the correct times.
Inserting large numbers of fact table entries.
Maintaining an audit history.

Figure 3 shows an example of a high-level package that calls other packages to handle dimension and fact data. It uses the control flow aspect of SSIS to invoke other packages, so that the updating of dimensions and fact tables will occur in the proper order. There are four packages to update dimensions (Merchant Division, Return Type, Buyer, and Item) that must be updated before updates are brought into the Division Strategy fact table. When that is complete and the Store dimension is updated, the Store Sales facts are updated. This is an example of a load group package in the B&N parlance. That is, it controls a group of packages that run on a daily basis. There are other load groups for daily inventory data and there is a weekly load group that manages partition creation. (In the live system at B&N, the weekly load groups serve other business functions as well that are not replicated in Project REAL.)

Figure 3: Example ETL control flow

Each of the packages called by the load groups has a control flow customized to some entity in the system (a dimension or fact table) and has at its heart a data flow task. A simple data flow for a dimension is shown in Figure 4.

Figure 4: Example ETL data flow

Whereas a control flow deals with the conditions under which various tasks occur and the order in which they occur, a data flow deals with the flow of data through the system. Data records will follow the indicated paths. The data flow shown in Figure 4 reads source data with vendor information and checks to see if the inputs represent new records or changes to existing records. For each case the data will follow a different path. This is one of the simplest data flows in our ETL — some of them are much more complex with rich data validation and multiple error-handling paths.

The data flow architecture, sometimes also referred to as a pipeline architecture, is substantially faster than the architecture of previous releases. This, combined with the fact that SSIS encourages data flows from sources through transformations to destinations without intermediate on-disk storage, can result in dramatic performance gains.

For a much more thorough examination of the ETL processes involved in Project REAL, and the lessons learned from creating then, see the "Project REAL ETL Design: Integration Services Lessons Learned" white paper.

The Relational Data Warehouse

The general characteristics of the relational data were given in Description of the data earlier in this paper. The discussion here is limited to adding a few interesting points. These relate to the use of views to access the relational data, the use of table partitions, and the addition of a schema for configuration data.

A note about referential integrity: Although there is a logical primary key / foreign key relationship between dimension tables and fact tables, this is not explicitly declared in the Project REAL schema. It is common in large data warehouses to leave these relationships undeclared in order to ease maintenance. It is then even more important for the ETL processes to maintain referential integrity in the data.

Tables and views

B&N uses a naming convention for objects in the database. Tables have a "Tbl_" prefix and views have a "vTbl_" prefix. In addition, dimension tables have "Dim_" and fact tables have "Fact_" following the prefix. The large fact tables are partitioned by week, so fact partition tables have the date of the last day of the week as a suffix. Thus the store dimension table is called "Tbl_Dim_Store" and the table with store inventory for the week ending June 19, 2004 is "Tbl_Fact_Store_Inventory_WE_2004_06_19".

One principle in a good data warehouse design is to provide a view for each table that may be accessed by other systems or by end users. This can provide a place to implement simple business logic, and more importantly, provides an isolation layer in case there are changes in the system design. This view of an inventory fact table provides an example:

CREATE VIEW vTbl_Fact_Store_Inventory_WE_2004_06_19 AS 
SELECT 
        INV.SK_Store_ID,
        STORE.SK_Parent_Store_ID,
        INV.SK_Item_ID,
        ITEM.SK_Parent_Item_ID,
        SK_Dept_ID = CAST(CASE WHEN ITEM.Dept_Num = -1 THEN 0
                               ELSE ITEM.Dept_Num
                          END AS Tinyint),
        INV.Sk_Date_Id, 
        INV.Sk_Buyer_Id, 
        INV.Sk_Model_Strategy_Id, 
        SK_Measure_ID = CAST(1 as Tinyint),
        INV.Model_Qty, 
        INV.On_Hand_Qty, 
        INV.On_Order_Qty,
        INV.Return_Qty, 
        INV.Retail_Amt,
        INV.Days_In_Stock
FROM dbo.Tbl_Fact_Store_Inventory_WE_2004_06_19 INV
  INNER JOIN Tbl_Dim_Store STORE ON STORE.Sk_Store_ID = INV.Sk_Store_ID
  INNER JOIN Tbl_Dim_Item ITEM   ON ITEM.Sk_Item_ID   = INV.Sk_Item_ID

This view ensures that only fact data with referential integrity on the Store and Item dimensions will be returned (in other words, inventory rows will not be returned if they do not have corresponding entries in the Store and Item dimensions). In addition, a simple bit of logic is applied (Dept_Num of -1 is changed to 0) and the SK_Measure_ID field is cast to a tinyint.

The caveat to the idea of implementing logic in views is that the logic must be executed each time data is retrieved from the views. The value must exceed the performance costs. In the example, the operations on SK_Measure_ID and SK_Dept_ID are trivial, but the joins between the fact table and dimension tables (especially a large one like Item) can be expensive and must be carefully considered.

Relational partitioning

Prior to SQL Server 2005, the way to maintain partitioned tables in the relational database was to create multiple tables. For example, in the original Project REAL database, we had 157 tables just for store sales data, and that was before the three months of incremental updates!

In Project REAL we operated a version of the system this way, but we also created a version of the system that took advantage of the new partitioned tables feature in SQL Server 2005. Using partitioned tables clearly simplified the database, as shown in Figure 5, because all those partition tables became internal partitions within a single table, "Tbl_Fact_Store_Sales". More importantly, the new structure simplified the implementation of the Integration Services packages, the Analysis Services cubes, and Reporting Services reports. For more information on the partitioning scheme and its impacts, see the "Project REAL Data Lifecycle – Partitioning" white paper.

Figure 5: Cleaner design with Partitioned Tables

Configuration schema

SQL Server allows tables within a database to be placed in logically related groups, each of which is known as a schema. All of the dimension and fact tables in the data warehouse are in the default schema, "dbo". At B&N a small separate database is used by the ETL processes to maintain state information. For this project we moved the state data into the data warehouse. This ensures that the state information always stays consistent with the data. For Project REAL there is an additional advantage to this approach: It enables us to keep multiple copies of the warehouse in different states. We might have a warehouse on one set of servers being updated, and a warehouse on another set of servers supporting a multiuser reporting experiment.

To prevent the configuration tables from becoming a potential source of confusion to users, those tables are placed in a different schema, "config". Access to the config schema can be granted to administrators and system accounts only.

The use of the tables in the config schema is discussed in the "Project REAL ETL Design: Integration Services Lessons Learned" white paper.

OLAP Cubes using Analysis Services

Analysis Services, in combination with end-user query tools, provides a rich environment for the interactive exploration and analysis of data. The organization provided on the data makes it easy for users to navigate. The internal aggregation engine provides rapid response to queries that would otherwise require significant time and resources to answer. This speed advantage is also a benefit when generating reports, so Reporting Services is often used in combination with Analysis Services.

Traditionally OLAP systems have been structured to provide a hierarchical view of dimension data. This is useful because it is easy for users to comprehend. Analysis Services cubes function this way too. Figure 6 shows a hierarchy in the product dimension for the B&N cube. The dimension starts at the Product Type level (Book, Multimedia, etc.). Next is the Subject level (a few of the book subject areas are Art, Cookbooks, Education, and Fiction), then the Category level (one category of Fiction is Literary Fiction) and the Subcategory level (there is only one subcategory under Literary Fiction). Finally, underneath the subcategories is the Item level — the individual titles (Bel Canto, Catch-22, etc.)

Figure 6: Item dimension hierarchy sample

Many things of interest can naturally be modeled with this pattern of levels in a dimension. Stores are in Countries, Regions, States, and Cites, and so on. But there are also some things that are not as easily modeled this way, and SQL Server 2005 Analysis Services introduces the concept of attribute hierarchies for these cases. For example, it may not be sensible to include information indicating whether there is a large print edition of the book as part of a hierarchy Likewise an attribute indicating whether the book is classified as a bargain book is not likely to be included in a hierarchy. Still, these are useful properties of a book to analyze. Previous versions of Analysis Services had the concept of member properties and virtual dimensions for these cases, but a great deal of modeling flexibility can be achieved by using attribute hierarchies. Figure 7 shows an analysis in which the size of a store (in square feet) and the amount of shelf space in a store (linear feet) are compared, using sales as the measurement. Both the size and shelf space are attributes of a store — they are not part of any natural hierarchy. (It appears that stores with more shelf space sell less than stores with less shelf space. This would certainly be an interesting finding if it were true, but it may simply be that there aren't many of the largest stores. The next version of our schema should make it easier to analyze this.)

Figure 7: Analyzing attributes

The ability to do analyses like these is possible because Analysis Services keeps metadata about the underlying data. Figure 8 shows two snippets from the Business Intelligence Design Studio, which is the user interface for doing design work. It shows parts of the Store dimension structure definition. On the left side is the linkage between the City attribute of the Store dimension and the underlying data source. It comes from the view vTbl_Dim_Store and the column City. At the bottom right is the collection of attributes of a store that are available for analysis, and above that is the definition of a hierarchy called Geography in the Store dimension. The hierarchy expresses a logical path for navigating some of the attributes: Division, Region, District, City, Store.

Figure 8: Excerpts from the BI Design Studio

Another important piece of metadata is the association between dimensions and measure groups. Measure groups are the subject areas in a cube. They correspond directly to fact tables in the relational database, and are related to each other by the dimensions. Figure 9 shows the measure groups and dimensions in the Project REAL cube. Readers familiar with data warehousing literature will note the resemblance between this screen and the "data warehouse bus" technique described by Ralph Kimball.

Figure 9: Measure groups and dimensions

In addition to the metadata shown very superficially above, there are other important aspects of Analysis Services and its use in Project REAL:

The use of semi-additive measures, as exemplified by the inventory data. Semi-additive measures were not simple to implement in prior releases and did not always perform well. Now the system directly supports them, and their use in Project REAL illustrates this. (For readers not familiar with the term: Semi-additive measures do not sum up over time. Sales for the three months in a quarter can be added to get the total sales for the quarter; this is an additive measure. Inventories for the three months cannot be added to get the "total" inventory — it makes no sense. Inventories at a particular time at different locations can however be added to get a total. Hence, inventories are the classic example of a semi-additive measure.)
Many-to-many relationships were extremely difficult to model in Analysis Services before this release. For Project REAL we implemented a many-to-many relationship between the Vendor and Item dimensions.
Implementation of a partitioning scheme is important in large cubes for performance and manageability reasons. The scheme used in this project was to follow the weekly partition model used for the relational data. Partitions are dynamically created in the ETL processing and populated with data.
Design of aggregations to optimize query performance. Aggregations are key to achieving the rapid interactive query response that Analysis Services provides. There are guidelines for managing aggregations well.

These and all the design, implementation and deployment aspects of Analysis Services in Project REAL are discussed in more detail in the "Project REAL: Analysis Services Technical Drilldown" white paper.

Managing reports with Reporting Services

At the time of publication, little of our planned work with Reporting Services has been performed. We will be defining a set of reports representing common use of the system and running those reports against both the Analysis Services cubes and the relational data, rendering the reports in various formats (HTML, Excel, PDF, etc.) We will experiment with storing a large number of snapshot reports and with various deployment options. We will test the deployment options with a multiuser workload representing hundreds or thousands of users.

This paper will be updated with an overview of the work performed and highlights of the results. A more extensive exploration of the use of Reporting Services in Project REAL will be in a paper tentatively titled "Project REAL: Reporting Services Technical Drilldown.".

Data mining with Analysis Services

At the time of publication, none of our planned work with Reporting Services has been performed. We will be exploring the Project REAL data with multiple data mining algorithms. For example, a clustering algorithm might be used to identify clusters of stores based on customer buying patterns. A prediction algorithm might be used to forecast out-of-stock conditions, and an association algorithm might be used to perform market basket analysis.

One of they key goals of the data mining work will be to capture and describe methods of preparing data for successful data mining. Star schemas, which work so well for end-user exploration, may not be the ideal format for data mining.

Another goal will be to illustrate the use of data mining in everyday applications. Often data mining is thought of as primarily a discovery tool, but the data mining tools in SQL Server are designed for easy integration into production applications.

This paper will be updated with an overview of the work performed and highlighting the results. A more extensive exploration of the use of data mining in Project REAL will be in a paper tentatively titled "Project REAL: Data Mining Technical Drilldown.".

Client access to the BI system

There are a number of approaches to setting up client connectivity to a BI system. At the simplest end, Windows client tools can connect to Analysis Services or Reporting Services directly on the corporate intranet. This poses the fewest configuration questions and security concerns. Most of the focus in Project REAL will be on deploying Web-based applications, and it may be desirable to expose the applications on the intranet or though extranet connections to the intranet. The latter poses some security and performance questions, but the greatest number of questions arise when considering offering a service that would be available to users who access it from the Internet at large. These scenarios will be the focus of the client connectivity work in Project REAL. At a later date this paper will be updated with an overview of the work performed and highlights of the results. A more extensive exploration of the use of data mining in Project REAL will be in a paper tentatively titled "Implementing Secure Web Applications on the SQL Server BI Platform" to be published in the future.

It would be easy to get so involved in the technical discussions that we forget to show the amazing things that good client tools enable. In this section, we will digress long enough to show a few screen shots from the Proclarity and Panorama tools that are being used with Project REAL.

First, a reminder to the reader: The original Barnes and Noble data was masked. It cannot be assumed that the data shown in these charts is accurate.

Figure 10 is an example of a surface chart in the Panorama tool. On the two horizontal axes are five product departments and five states, and the vertical axis shows the sales volume of those products in those states. This is a handy way to look at the interrelationship between two dimensions. The use of color coding makes it easier to see patterns in the data.

Figure 10: Surface chart

(In the interest of full disclosure, it should be noted that Figures 1, 6, and 7 are also taken from Panorama client displays.)

Figure 11 is a Proclarity decomposition tree. This is a powerful tool for interactive exploration of data. Imagine a user starting at the top node, and breaking down all sales by departments. The chart shows that 40% of sales are from paperback books, and 21% from hardcover books. Then the user decides to look at what replenishment strategies are being used for hardcover books. This is a simple drill down using the tool, but it's important to note that the user changed dimensions at this point — the replenishment strategies are in a different dimension than the department. Finally the user drills down again on store managed books (hardcover only, because of the first level of drill down) to see the sales by month. Below the tree is a chart showing the amounts that correspond to each item at the lowest level; this is an aid to visualizing the amounts involved. It's clear that there is a big spike in sales late in the year, with another small spike in the second quarter.

Figure 11: Decomposition tree

Figure 12 is a Proclarity perspective view. This chart possibly has more points than a perspective view normally would (there are over 6,000), but even here it is clearly possible to see patterns in the data. The horizontal axis shows the average price of an individual title in the Business books; the vertical axis shows the number of each item that were sold. The dots are color-coded according to category within the business books subject area. It appears that real estate books command high prices, while books in the marketing/advertising and accounting/economics categories sell much higher volumes at lower prices. There are a lot of low-volume, mid-priced books on sales.

Figure 12: Perspective view

Physical Components of the Implementation

Up to now we have discussed the Project REAL implementation from a logical point of view — the software systems and the data they support. Now we begin to look at the physical implementation — the server, storage, and network configurations. Many parts of the system were implemented multiple ways to help understand the tradeoffs in various approaches.

Server systems: distributed and consolidated

One of the most common questions about Integration Services, Reporting Services, and Analysis Services is whether the service should be run on the same server as the relational database or a different one. While there are some "rules of thumb" that have been used in the past to answer this question, there never has been a single correct answer, and with SQL Server 2005 the parameters may have changed. Therefore, one aspect of Project REAL is to deploy into each of those configurations, and to explore the tradeoffs in this context.

In order to support the objectives of Project REAL, Unisys provided significant hardware resources to facilitate exploration and testing. Table 3 lists the primary servers. The complete set of servers provides flexibility for testing and is not reflective of the typical customer environment. For each architectural scenario, the benefits and best practices will be explored and shared via the Project REAL experience.

There are four servers for the distributed environment, so that each machine can support one of the major services: the relational data warehouse, Integration Services, Analysis Services, and Reporting Services. The configurations of these machines were chosen to represent typical "commodity" servers. In addition, there is one machine whose sole purpose is to provide the source of the incremental update data feed.

To support the consolidated environment, Unisys has provided two ES7000 servers — one 32-bit and one 64-bit. Each of them will be tested in the consolidated role independently. In the consolidated role, one system will run the relational data warehouse, Integration Services, Analysis Services, and Reporting Services.

Role	Server Name	Model	CPU	Cache	Memory
Data source	BI-REAL-DS		4x 700 MHz		4 GB
Distributed relational DW	BI-REAL-DW	ES3040L	4x 2.2 GHz	2 MB	8 GB
Distributed Integration Services	BI-REAL-IS	ES3040L	4x 2.2 GHz	2 MB	4 GB
Distributed Analysis Services	BI-REAL-AS	ES3040L	4x 2.2 GHz	2 MB	4 GB
Distributed Reporting Services	BI-REAL-RS	ES3040L	4x 2.2 GHz	2 MB	4 GB
Consolidated 32-bit server	BI-REAL-ES32	ES7000	16x 2.2 GHz	2 MB	32 GB
Consolidated 64-bit server	BI-REAL-ES64	ES7000-420	16x 1.3 GHz	3 MB	32 GB

Table 3: Servers provided by Unisys

It is assumed that there will be tradeoffs in performance, manageability, and reliability between these configurations. For example:

Does the system perform better if each service has its own dedicated server, or is it better to share a larger pool of processors?
Does the network slow down interactions between the services, or are other factors more significant?
Is it easier to maintain a single large server or multiple small ones?
Are there more failures when more systems are in the configuration? Are failures easier to isolate?
What is the price/performance tradeoff between these configurations?

The results of these comparisons will be reported in a paper tentatively titled "Project REAL: Performance Results and Architectural Comparisons" to be published in the future.

It should be noted that the distributed / consolidated comparison is not the same as a scale-out vs. scale-up comparison. These latter terms have to do with multiple servers cooperating in a single role vs. a single larger server fulfilling the role. In a BI implementation, these concepts tend to be more relevant in the front-end systems: multiple reporting servers or analysis servers supporting end-user queries. These will be more thoroughly discussed in a paper tentatively titled "Implementing Secure Web Applications on the SQL Server BI Platform" to be published in the future. For a complete description of the servers used in Project REAL, see the upcoming paper tentatively titled "Project REAL: Architecture of the Hardware Environment."

Storage configurations

A common question in large BI deployments is how the storage system should be configured for optimal performance. EMC provided significant resources to help address this question. The available equipment includes both Symmetrix storage (DMX1000-P) and CLARiiON storage (CX700), a CLARiiON disk Library (CDL) for backups, and two Fiber Channel switches. Together this provides a fault-tolerant Storage Area Network and allows storage volumes to be attached to any machine. In addition, Emulex has provided Host Bus Adapters (HBAs) for connecting the servers to the switch.

Figure 13 illustrates the storage connectivity. Each of the storage systems (the DMX1000-P and the CX700), which contain many physical disks, is configured as a number of logical volumes. Through the switches, each volume is exposed to one of the servers. On the server the volume is made available via a drive letter (e.g., the H: drive) or a mount point (e.g., C:\mount\DW4). The figure is a high-level illustration of the multiple paths available from the hosts to the storage. Each server can access many logical volumes simultaneously.

Figure 13: Storage connectivity

The logical volumes are designed for optimal performance in a particular role in the overall system — for example, a volume that will hold relational log files is configured for sequential write performance. There are a number of storage roles in the overall system:

Source data for incremental updates (relational)
Relational data warehouse
Relational database log files
Analysis Services cubes
Report databases (relational)

EMC has provided sufficient storage so that multiple copies of the data can be retained while we experiment with different configurations. For example, we can store cubes created with different aggregation designs, which will have different performance characteristics even though they have the same underlying data. In addition, we have the ability to keep data for various roles on both Symmetrix and CLARiiON storage, so that both storage systems can be tested with each server configuration (consolidated and distributed).

The cross-product of the above data roles, storage systems, and multiple server configurations means that the overall needs of Project REAL are much larger than a typical BI system with this amount of data. Any single configured scenario, however, is indicative of the type of configuration that would be in a typical deployment.

As is typical with storage systems of this class, the logical volumes are configured as RAID arrays or mirrored storage, depending on the role. This is done to improve reliability, and means that the sum of the logical volume sizes is less than the physical storage available. The available space provided by EMC is:

System	Physical Space
Symmetrix (DMX1000)	21 TB
CLARiiON (CX700)	24 TB

For a complete description of the storage configurations used in Project REAL, see the paper tentatively titled "Project REAL: Architecture of the Hardware Environment."

Network configurations

The back-end functions in a BI system — the ETL, relational storage, and Analysis Services — run on servers in a corporate domain, inside the firewalls that protect the company network from the outside world. The network connectivity for machines in these roles is simple: they typically have high-speed Ethernet connections.

Depending on the client connectivity desired for the BI application, the front-end systems — the ones that support end-user access — can be considerably more interesting. Intranet clients (those on the corporate network) have simple connectivity, but for extranet clients (those accessing the corporate network from outside the firewalls) care must be taken to ensure successful connectivity and adequate throughput. Internet clients (also outside the firewalls, but also not sharing the company security model) add the challenge of authenticating users and providing access to the data they have rights to use, but not more. To support experimenting with these configurations, a lab environment has been constructed to allow various servers to be placed in a simulated corporate LAN, in the perimeter network (also known as DMZ, demilitarized zone, or screened subnet, i.e., between two firewalls), or in a simulated Internet (outside the firewalls). It is expected that the Internet client will use a Web service to access the reports or for interactive analysis; the intranet and extranet clients may or may not use a Web service.

These three scenarios are illustrated in Figure 14. For simplicity, only the Internet client is shown connecting through a mid-tier Web server. Each of the configurations is implemented in the Project REAL lab. Using these three configurations we have the ability to confirm how to properly configure services, ports, routers, and security settings. For complete information on the client connectivity work in Project REAL, see the upcoming paper tentatively titled "Implementing Secure Web Applications on the SQL Server BI Platform."

Figure 14: Network Connectivity

Outcomes from Project REAL

This section outlines some of the key things we have done and will do to show the best practices or to understand tradeoffs or to measure performance. At the time of writing of this draft, some of these activities are either still underway or not yet started.

ETL

In Phase 1 of Project REAL, a complete migration of a customer's ETL processing from SQL Server 2000 DTS to SQL Server 2005 SSIS was performed. Based on this early experience with SSIS, some practices information was reported in "SQL Server 2005 Integration Services: Lessons from Project REAL" Also, custom components were developed to allow transaction logs (TLOGs) files from cash registers to be directly read into an Integration Services data flow. This experience, and some advice on doing similar custom data sources and transforms, is described in a paper tentatively titled "Developing Custom Components for SQL Server 2005 Integration Services".
In Phase 2 of Project REAL, the B&N SSIS packages have been migrated to the Project REAL environment. The logging feature has been enriched for auditing purposes, packages have been converted to use the deployment features, and partitioning support and the ability to process cubes have been added. This work is reported in "Project REAL ETL Design: Integration Services Lessons Learned" white paper. The performance of the system will be measured as incremental updates are performed. This will be done in both the distributed and consolidated configurations. The entire ETL system will be reviewed by the SSIS product team for best practices input.

Relational data warehouse

The B&N data has been migrated from SQL Server 2000 to SQL Server 2005. Large data masking operations were performed which placed significant stress on the relational database, although they are not at all typical of BI deployments. The data was converted from using multiple partition tables to using one partitioned table for each subject area.
The complexity of the IS packages and Analysis Services cubes that use these tables will be compared to the originals, and the performance of the system using each design will be measured. This partitioning work is reported in the "Project REAL Data Lifecycle – Partitioning" white paper.

Analysis Services cubes

Cubes have been defined based on the Project REAL schema and data set. Attribute hierarchies are used much more extensively than the current implementation at B&N. A many-to-many relationship was added, semi-additive measures are used, the partitioning scheme was implemented, aggregations were designed, and the entire data set was processed into MOLAP storage. These cubes were then browsed by interactive users using Excel, Proclarity and Panorama tools. The cube design and a number of lessons learned and best practices are described in the "Project REAL: Analysis Services Technical Drilldown" white paper.
A set of realistic business queries will be defined as templates for query performance work, and multiuser performance tests measuring response times for various numbers of users (hundreds to over 1,000) will be carried out, on both the distributed and consolidated configurations.
Based on Project REAL and other real-world experiences, a paper has been developed to assist SQL Server 2000 Analysis Services users to migrate to SQL Server 2005. See the "Migrating to SQL Server 2005 Analysis Services" white paper.

Analysis Services data mining

Data mining scenarios will be defined based on realistic scenarios with input from B&N. Appropriate algorithms will be chosen and data preparation steps will be performed and documented. The resulting models will be tested for their usefulness in meeting the needs of the given scenario. (For example, if the purpose of a model is to predict out-of-stock conditions, how often does it make a good prediction?)
In addition to using the models to discover patterns in the Project REAL data, we will explore options for using data mining in a production application, and create a representative example.
A paper will be published detailing the models developed, the data preparation steps, the example application, and best practices information learned.

Reporting Services

Typical reports will be developed for different user communities (store management, purchasing, inventory management, etc.). The report definitions will be stored in Reporting Services and rendered with various tools. The effects of retaining a substantial number of report snapshots will be explored.
A multiuser workload tool will be implemented to allow us to emulate large numbers of users requesting reports from the system. This will enable us to observe the effects of various configuration options, including:
Running reports against the relational database or against Analysis Services.
Running in the consolidated or distributed scenario.
Placing the report server inside or outside the firewall.
Various percentages of cached reports or snapshots.
We will explore using Report Builder with the Project REAL data, and describe the practices we find for using this new tool.
A paper will be published detailing the various environments created and best practices found.

Client connectivity

There are two key areas that will be addressed as part of the client connectivity work:

There are particular connectivity concerns when working with Internet scenarios. These have to do with how to configure firewalls and the services that must function together across firewalls. If, for example, an IIS server is placed in the perimeter network, how does it connect to the Analysis Services or Reporting Services servers to obtain data? Or, since Reporting Services is itself a Web service, it could be placed in the perimeter network, but then what ports need to be opened to allow it access to databases?
There are also questions of authentication and authorization which must be addressed to maintain a secure environment. Internet users cannot use the domain authentication mechanisms normally used with Windows — another authentication mechanism must be implemented.

The paper for client connectivity will discuss both of these areas, outlining alternatives for each.

Multiuser workload testing

As "BI for the masses" becomes more common, sites have greater needs to test new BI applications to make sure they will scale to large numbers of users. As part of Project REAL we will be developing tools to execute large multiuser workload tests. Clearly we will use these tools to validate the scalability of our own implementations, but in addition, we plan to make these tools available.
In all probability, there will be two workload tools — one for Analysis Services and one for Reporting Services. Although a single tool would be desirable, the needs of the services are different enough that two tools seems a better approach. These tools each will have a paper describing how to use the tool. Another paper will discuss how to perform successful tests of scalability.

Deployment techniques

During the development cycle of a BI application, the Business Intelligence (BI) Development Studio (based on the Visual Studio product) plays a major role. This brings the option to integrate with a source code control system, which was not really an option before SQL Server 2005. We will explore how to beneficially use these tools in a paper on deployment techniques.
The development of a BI application typically involves moving from a development environment to a test environment to a production environment. The ETL packages, relational data, cube designs, and report definitions must all move together. The paper on deployment techniques will also examine ways to accomplish a deployment when the services must move from stage to stage and system to system.

System management

In the section on client connectivity, we discussed the need for security regarding client access, but servers must be operated in a secure manner as well. We will explore security for the various services that are part of the BI system.
BI systems need to be backed up just like any other important system. With several services involved, there are some important questions we will try to address: What needs to be backed up? At what times? What mechanisms are effective with large volumes of data?
Many companies find that after some period of time (usually a number of years), the value of data is diminished to the point that it is not worth the expense of keeping. Managing the lifecycle of the data becomes important, especially in the largest systems. We will explore ways to accomplish this — either by removing old data or moving it to less expensive storage.
The various components of a BI system should be monitored for their health. We will implement a system based on Microsoft Operations Manager (MOM) that monitors the ongoing health of servers and their levels of resource utilization.
Backup, data lifecycle management, and health monitoring are likely to be covered in three different papers. Security will probably be included in other papers on the major components of the system.

Conclusion

Project REAL represents an extensive commitment to discover and develop best practices information that will make deployments easier for our customers. The results of this work will come to fruition over several months, so check back from time to time at https://www.microsoft.com/sql/bi/ProjectREAL to see what has been published.

This paper has presented a high-level overview of many aspects of the project. Be sure to read the other papers for more in-depth analysis of each area.

AI Skills Fest

Share via