Chapter 12 - Data Warehousing and Online Analytical Processing

Article
01/28/2010

A data warehouse is often used as the basis for a decision-support system (also referred to from an analytical perspective as a business intelligence system). It is designed to overcome some of the problems encountered when an organization attempts to perform strategic analysis using the same database that is used to perform online transaction processing (OLTP).

A typical OLTP system is characterized by having large numbers of concurrent users actively adding and modifying data. The database represents the state of a particular business function at a specific point in time, such as an airline reservation system. However, the large volume of data maintained in many OLTP systems can overwhelm an organization. As databases grow larger with more complex data, response time can deteriorate quickly due to competition for available resources. A typical OLTP system has many users adding new data to the database while fewer users generate reports from the database. As the volume of data increases, reports take longer to generate.

As organizations collect increasing volumes of data by using OLTP database systems, the need to analyze data becomes more acute. Typically, OLTP systems are designed specifically to manage transaction processing and minimize disk storage requirements by a series of related, normalized tables. However, when users need to analyze their data, a myriad of problems often prohibits the data from being used:

Users may not understand the complex relationships among the tables, and therefore cannot generate ad hoc queries.
Application databases may be segmented across multiple servers, making it difficult for users to find the tables in the first place.
Security restrictions may prevent users from accessing the detail data they need.
Database administrators prohibit ad hoc querying of OLTP systems, to prevent analytical users from running queries that could slow down the performance of mission-critical production databases.

By copying an OLTP system to a reporting server on a regularly scheduled basis, an organization can improve response time for reports and queries. Yet a schema optimized for OLTP is often not flexible enough for decision support applications, largely due to the volume of data involved and the complexity of normalized relational tables.

For example, each regional sales manager in a company may wish to produce a monthly summary of the sales per region. Because the reporting server contains data at the same level of detail as the OLTP system, the entire month's data is summarized each time the report is generated. The result is longer-running queries that lower user satisfaction.

Additionally, many organizations store data in multiple heterogeneous database systems. Reporting is more difficult because data is not only stored in different places, but in different formats.

Data warehousing and online analytical processing (OLAP) provide solutions to these problems. Data warehousing is an approach to storing data in which heterogeneous data sources (typically from multiple OLTP databases) are migrated to a separate homogenous data store. Data warehouses provide these benefits to analytical users:

Data is organized to facilitate analytical queries rather than transaction processing.
Differences among data structures across multiple heterogeneous databases can be resolved.
Data transformation rules can be applied to validate and consolidate data when data is moved from the OLTP database into the data warehouse.
Security and performance issues can be resolved without requiring changes in the production systems.

Sometimes organizations maintain smaller, more topic-oriented data stores called data marts. In contrast to a data warehouse which typically encapsulates all of an enterprise's analytical data, a data mart is typically a subset of the enterprise data targeted at a smaller set of users or business functions.

Whereas a data warehouse or data mart are the data stores for analytical data, OLAP is the technology that enables client applications to efficiently access the data. OLAP provides these benefits to analytical users:

Pre-aggregation of frequently queried data, enabling a very fast response time to ad hoc queries.
An intuitive multidimensional data model that makes it easy to select, navigate, and explore the data.
A powerful tool for creating new views of data based upon a rich array of ad hoc calculation functions.
Technology to manage security, client/server query management and data caching, and facilities to optimize system performance based upon user needs.

The terms data warehousing and OLAP are sometimes used interchangeably. However, it is important to understand their differences because each represents a unique set of technologies, administrative issues, and user implications.

SQL Server Tools for Data Warehousing and OLAP

Microsoft SQL Server provides several tools for building data warehouses and data marts, and OLAP systems. Using DTS Designer, you can define the steps, workflow, and transformations necessary to build a data warehouse from a variety of data sources. After the data warehouse is built, you can use Microsoft SQL Server OLAP Services, which provides a robust OLAP server that can be used to analyze data stored in many different formats, including SQL Server and Oracle databases.

See Also

In This Volume

Data Transformation Services Data Warehousing Support

Installing OLAP Services

Server Improvements and New Features

OLAP Services Data Warehousing Support

In Other Volumes

"Overview of Data Transformation Services" in Microsoft SQL Server Distributed Data Operations and Replication

Data Warehouse Characteristics

A data warehouse can assist decision support and online analytical processing (OLAP) applications because it provides data that is:

Consolidated and consistent.
Subject-oriented.
Historical.

Consolidated and Consistent Data

A data warehouse consolidates operational data from a variety of sources with consistent naming conventions, measurements, physical attributes, and semantics.

For example, in many organizations, applications can often use similar data in different formats: dates can be stored in Julian or Gregorian format; true/false data can be represented as one/zero, on/off, true/false, or positive/negative. Different applications can also use different terms to describe the same type of data. One application can use the term "balance" instead of "total amount" to represent the amount of money in a bank account.

Data should be stored in the data warehouse in a single, acceptable format agreed to by business analysts, despite variations in the external operational sources. This allows data from across the organization, such as legacy data on mainframes, data in spreadsheets, or even data from the Internet, to be consolidated in the data warehouse, and effectively cross-referenced, giving the analysts a better understanding of the business.

Subject-oriented Data

Operational data sources across an organization tend to hold a large amount of data about a variety of business-related functions, such as customer records, product information, and so on. However, most of this information is also interspersed with data that has no relevance to business or executive reporting, and is organized in a way that makes querying the data awkward. The data warehouse organizes only the key business information from operational sources so that it is available for business analysis.

Historical Data

Data in OLTP systems correctly represents the current value at any moment in time. For example, an order-entry application always shows the current value of stock inventory; it does not show the inventory at some time in the past. Querying the stock inventory a moment later may return a different response. However, data stored in a data warehouse is accurate as of some past point in time because the data stored represents historical information.

The data stored in a data warehouse typically represents data over a long period of time; perhaps up to ten years or more. OLTP systems often contain only current data, because maintaining large volumes of data used to represent ten years of information in an OLTP system can affect performance. In effect, the data warehouse stores snapshots of a business's operational data generated over a long period of time. It is accurate for a specific moment in time and cannot change. This contrasts with an OLTP system where data is always accurate and can be updated when necessary.

Read-only Data

After data has been moved to the data warehouse successfully, it typically does not change unless the data was incorrect in the first place. Because the data stored in a data warehouse represents a point in time, it must never be updated. Deletes, inserts, and updates (other than those involved in the data loading process) are not applicable in a data warehouse. The only operations that occur in a data warehouse, when it has been set up, are loading and querying data.

Cc917548.ware2(en-us,TechNet.10).gif

Data Granularity

A significant difference between an OLTP or operational system and a data warehouse is the granularity of the data stored. An operational system typically stores data at the lowest level of granularity: the maximum level of detail. However, because the data warehouse contains data representing a long period in time, simply storing all detail data from an operational system can result in an overworked system that takes too long to query.

A data warehouse typically stores data in different levels of granularity or summarization, depending on the data requirements of the business. If an enterprise needs data to assist strategic planning, then only highly summarized data is required. The lower the level of granularity of data required by the enterprise, the higher the number of resources (specifically data storage) required to build the data warehouse. The different levels of summarization in order of increasing granularity are:

Current operational data
Historical operational data
Aggregated data
Metadata

Current and historical operational data are taken, unmodified, directly from operational systems. Historical data is operational level data no longer queried on a regular basis, and is often archived onto secondary storage.

Aggregated, or summary, data is a filtered version of the current operational data. The design of the data warehouse affects how the current data is aggregated. Considerations for generating summary data include the period of time used to aggregate the data (for example, weekly, monthly, and so on), and the parts of the operational data to be summarized. For example, an organization can choose to aggregate at the part level the quantity of parts sold per sales representative per week.

There may be several levels of summary data. It may be necessary to create summary level data based on an aggregated version of existing summary data. This can give an organization an even higher level view of the business. For example, an organization can choose to aggregate summary level data further by generating the quantity of parts sold per month.

Metadata does not contain any operational data, but is used to document the way the data warehouse is constructed. Metadata can describe the structure of the data warehouse, source of the data, rules used to summarize the data at each level, and any transformations of the data from the operational systems.

Data Marts

A data mart is typically defined as a subset of the contents of a data warehouse, stored within its own database. A data mart tends to contain data focused at the department level, or on a specific business area. The data can exist at both the detail and summary levels. The data mart can be populated with data taken directly from operational sources, similar to a data warehouse, or data taken from the data warehouse itself. Because the volume of data in a data mart is less than that in a data warehouse, query processing is often faster.

Characteristics of a data mart include:

Quicker and simpler implementation.
Lower implementation cost.
Needs of a specific business unit or function met.
Protection of sensitive information stored elsewhere in the data warehouse.
Faster response times due to lower volumes of data.
Distribution of data marts to user organizations.
Built from the bottom upward.

Departmental or regional divisions often determine whether data marts or data warehouses are used. For example, if managers in different sales regions require data from only their region, then it can be beneficial to build data marts containing specific regional data. If regional managers require access to all the organization's data, then a larger data warehouse is usually necessary.

Although data marts are often designed to contain data relating to a specific business function, there can be times when users need a broader level of business data. However, because this broader-level data is often only needed in summarized form, it is acceptable to store it within each data mart rather than implementing a full data warehouse.

Building a Data Warehouse from Data Marts

Data warehouses can be built using a top-down or bottom-up approach. Top-down describes the process of building a data warehouse for the entire organization, containing data from multiple, heterogeneous, operational sources. The bottom-up approach describes the process of building data marts for departments, or specific business areas, and then joining them to provide the data for the entire organization. Building a data warehouse from the bottom-up, by implementing data marts, is often simpler because it is less ambitious.

A common approach to using data marts and data warehouses involves storing all detail data within the data warehouse, and summarized versions within data marts. Each data mart contains summarized data per functional split within the business, such as sales region or product group, further reducing the data volume per data mart.

Cc917548.ware1(en-us,TechNet.10).gif

Data Mart Considerations

Data marts can be useful additions or alternatives to the data warehouse, but issues to consider before implementation include:

Additional hardware and software.
Time required to populate each data mart regularly.
Consistency with other data marts and the data warehouse.
Network access (if each data mart is located in a different geographical region).

Designing and Building a Data Warehouse and OLAP System

The steps required to build a data warehouse include:

Determining business, user, and technical requirements.
Designing and building the database.
Extracting and loading data into the data warehouse.
Designing and processing aggregations using OLAP tools.
Querying and maintaining the data warehouse and OLAP databases.

Determining Business, User, and Technical Requirements

Before a data warehouse can be built, a detailed project and implementation plan should be written. The project and implementation plan includes:

Building a business case.
Gathering user requirements.
Determining the technical requirements.
Defining standard reports required by users.
Analyzing client application tools being used.

Building the business case is common at the beginning of any project. It involves determining the business needs solved by the project, the costs of the project, and the return on the investment.

Gathering user requirements largely involves interviewing the intended users of the data warehouse. The user requirements determine:

Data requirements (level of granularity).
Operational systems within the enterprise containing the data.
Business rules followed by the data.
Queries required to provide the users with data.

The technical requirements may involve determining:

Hardware architecture and infrastructure (for example, links to remote geographical regions where data marts might be located).
Backup and recovery mechanisms.
Security guidelines.
Methods of loading and transforming data from operational systems to the data warehouse.

Standard reports required by users should be analyzed to determine the tables, columns, and selection criteria necessary to create the reports, and the frequency in which they are generated. Provisions should also be made for expanding or modifying the scope of reports as required.

Client application tools should be analyzed to determine if they can provide enhanced processing capabilities that help in processing data, performing queries, or generating reports.

Designing and Building the Database

Designing and building the database is a critical part of building a successful data warehouse. This step is often performed by experienced database designers because it can involve taking data from multiple (sometimes heterogeneous) sources and combining it into a single, logical model.

Unlike OLTP systems that store data in a highly normalized fashion, the data in the data warehouse is stored in a very denormalized manner to improve query performance. Data warehouses often use star and snowflake schemas to provide the fastest possible response times to complex queries, and the basis for aggregations managed by OLAP tools.

The components of schema design are dimensions, keys, and fact and dimension tables.

Fact tables

Contain data that describes a specific event within a business, such as a bank transaction or product sale. Alternatively, fact tables can contain data aggregations, such as sales per month per region. Except in cases such as product or territory realignments, existing data within a fact table is not updated; new data is simply added.

Because fact tables contain the vast majority of the data stored in a data warehouse, it is important that the table structure be correct before data is loaded. Expensive table restructuring can be necessary if data required by decision support queries is missing or incorrect.

The characteristics of fact tables are:

Many rows; possibly billions.
Primarily numeric data; rarely character data.
Multiple foreign keys (into dimension tables).
Static data.

Dimension tables

Contain data used to reference the data stored in the fact table, such as product descriptions, customer names and addresses, and suppliers. Separating this verbose (typically character) information from specific events, such as the value of a sale at one point in time, makes it possible to optimize queries against the database by reducing the amount of data to be scanned in the fact table.

Dimension tables do not contain as many rows as fact tables, and dimensional data is subject to change, as when a customer's address or telephone number changes. Dimension tables are structured to permit change.

The characteristics of dimension tables are:

Fewer rows than fact tables; possibly hundreds to thousands.
Primarily character data.
Multiple columns that are used to manage dimension hierarchies.
One primary key (dimensional key).
Updatable data.

Dimensions

Are categories of information that organize the warehouse data, such as time, geography, organization, and so on. Dimensions are usually hierarchical in that one member may be a child of another member. For example, a geography dimension may contain data by country/region, state, and city. A city member is a child to a state member, which is in itself a child to a country member. Thus, the dimension is comprised of three hierarchical levels: all countries, all states, and all cities in the dimension table. To support this, the dimension table should include the relationship of each member to the higher levels in the hierarchy.

Dimensional keys

Are unique identifiers used to query data stored in the central fact table. The dimensional key, like a primary key, links a row in the fact table with one dimension table. This structure makes it easy to construct complex queries and support drill-down analysis in decision support applications. An optimal data warehouse database contains long, narrow fact tables and small, wide dimension tables.

Star Schema

The most popular design technique used to implement a data warehouse is the star schema. The star schema structure takes advantage of typical decision support queries by using one central fact table for the subject area, and many dimension tables containing denormalized descriptions of the facts. After the fact table is created, OLAP tools can be used to preaggregate commonly accessed information.

The star schema design helps to increase query performance by reducing the volume of data that is read from disk. Queries analyze data in the smaller dimension tables to obtain the dimension keys that index into the central fact table, reducing the number of rows to be scanned.

Cc917548.dts01(en-us,TechNet.10).gif

Snowflake Schema

The snowflake schema is a variation of the star schema where dimension tables are stored in a more normalized form. This can help improve performance for queries by reducing the number of disk reads.

Cc917548.dts02(en-us,TechNet.10).gif

Creating a Database Schema

The database schema should support the business requirements rather than the typical query-driven requirements of an OLTP database design. For example, given the following database schema from an order entry system:

Cc917548.oltpsch(en-us,TechNet.10).gif

The steps involved in converting this OLTP schema into a star schema include:

Determining the fact and dimension tables.
Designing the fact tables.
Designing the dimension tables.

Determining Fact and Dimension Tables

It is important to determine correctly what existing tables and data in the operational systems should comprise the fact and dimension tables. If these are not correctly identified, then the data warehouse can suffer from poor performance, or may have to be redesigned at a later stage. Redesigning an operational data warehouse, possibly containing large amounts of data, can be a prohibitively expensive task.

Although the process of determining the structure and composition of fact and dimension tables can be difficult, especially when there are multiple (heterogeneous) operational systems to consider, the two most important steps to follow are identifying the:

Fundamental business transactions on which the data warehouse will focus (fact tables).
Data associated with the business transactions that determine how business data will be analyzed (dimension tables and hierarchies).

Identify Fundamental Business Transactions

The first step involves identifying the transactions that describe the basic operations of the business which the data warehouse will be used to analyze. For example, using the sample order entry system described earlier, the underlying business transaction is a sale of a product. Identifying the fundamental business transactions yields the information that will be represented in the fact tables. The information needed to describe the sale of a product is largely found in the Order_Details table.

Cc917548.oltps02(en-us,TechNet.10).gif

When analyzing existing operational systems for potential fact tables, always look for the underlying business processes involved. Many operational systems are designed based on necessity rather than an accurate business model. For example, a school database may record only the grade per student for all subjects for a year because it does not have enough disk space to store the data at a lower level of detail.

In this instance, a data warehouse used to store student data from all schools in a region should be designed to capture this summarized data as well as a lower level of detail when the schools are able to provide the information in the future. For example, the fact table might store details regarding the grades for each subject, per student, per school, per region, per date period.

Identify the Dimension Tables

The next step involves identifying the entities that describe how the fact data will be analyzed. For example, given that the order entry system fundamental transaction is the sale of a product, dimension data from the operational schema could include payment method, product name, date of sale, or shipping method. However, the dimension data chosen should represent the focus of the business analysis. As an example, the business analysis performed on the order entry data warehouse will include variations of:

Sales of a specific product per region.
Sales of a specific product per time period (for example, a quarter).
All sales per region.
All sales per time period.

Therefore, the dimension tables will include product data, region data, and time period data. In this example, payment or shipping methods were not required because the business will not use the data warehouse to analyze that data.

From the original order entry OLTP schema, all the fact and dimension data for the data warehouse can be found in the Customers, Orders, Products, and Order_Details tables.

Cc917548.oltps03(en-us,TechNet.10).gif

Designing Fact Tables

The primary goal when designing fact tables is to minimize the size without compromising the data requirements. Fact tables are the largest tables in the database because they contain detail level data representing the underlying business transactions. However, the costs of storing, and maintaining these large tables should be considered. For example, larger tables require more online, and potentially offline, storage; take longer to backup and restore in the event of a system failure; and take longer to query when building OLAP aggregations.

The easiest ways to reduce the size of fact tables include:

Reducing the number of columns.
Reducing the size of each column where possible.
Archiving historical data into separate fact tables.

Reducing the Number of Columns

Remove any columns that are not required to analyze the operations of the business. For example, if the data does not represent a business transaction, or if the data can be derived using aggregates, remove the data from the fact table. Although aggregated columns often improve query performance, the size of a typical fact table can prohibit using them. For example, if the Order_Details fact table contains one billion rows, and a column, Total_Price, is added representing Quantity multiplied by UnitPrice, one billion new values now exist permanently in the table.

Important If a column is moved from a fact table to another table, and is referenced frequently in queries involving data from the fact table, large join operations may be required. These joins can affect query performance. Therefore, the trade-off between reducing storage costs and affecting query performance should be determined.

Although Order_Details forms the basis of the fact table, the OrderID column is not required in the final fact table because OrderDetailID is the unique identifier for the business transaction: a sale or a product. In fact, OrderID does not represent a single business transaction; it represents the sale of one or many products to a single customer, and so cannot be used.

Reducing the Size of Each Column

Because fact tables tend to have a large number of rows, even one redundant byte per row can add up to a large amount of wasted database space. For example, a fact table containing one billion rows, with one unused byte in one of the columns, represents almost 1 GB of unused database. To reduce column widths:

Ensure that all character and binary data is variable length.
Use data types that require fewer bytes of storage where possible. For example, if a column contains integer values only in the range from 1 through 100, use tinyint rather than int, saving 3 bytes per row for that column.

Archiving Historical Data

If data within fact tables is rarely used, such as sales data from several years ago, it may be useful to archive the data. This approach reduces the volume of data in the fact table, hence increasing the performance of queries. Exceptional queries, on older data, can be run against multiple fact tables without affecting the majority of users querying the fact tables containing recent data. When Microsoft SQL Server OLAP Services is used in conjunction with multiple fact tables, the OLAP Services engine manages queries against multiple back end fact tables. This simplifies the management and use of multiple fact tables containing historical data.

Designing Dimension Tables

The primary goal in designing dimension tables is to denormalize the data that references the fact tables into single tables. The most commonly used dimension data should reference the fact tables directly, rather than indirectly through other tables. This approach minimizes the number of table joins, and speeds up performance. For example, the order entry star schema should support the business queries:

Sales of a specific product per region
All sales per region

Currently, the dimension data describing a region (City, StateOrProvince, and Country) is part of the Customers table. However, Customers references Order_Details (fact data) using Orders.

Cc917548.oltps04(en-us,TechNet.10).gif

To better support the business analysis required, the region data should be placed into a new table, Region, directly referencing Order_Details. To implement this, a foreign key from the Region dimension table is added to Order_Details (now renamed to Sales). Any queries involving sales per region now require only a two-table join between the Region dimension table, and the Sales fact table.

Note The existing relationship between the Sales fact table and the Products dimension data is unchanged.

Cc917548.starsc02(en-us,TechNet.10).gif

Date and Time Information

Date information is a common requirement in a data warehouse. To minimize the fact table width, a foreign key is often created in the fact table referencing a dimension table containing a representation of the date and/or time. The representation of the date depends on business analysis requirements.

For example, the business analysis to be performed on the order entry system requires product sales summarized by month, quarter, and year. The date information should be stored in a form that represents these increments. This is achieved by creating a foreign key in the fact table referencing a date dimension table (named Period) containing the date of the sale in a month, quarter, year format. To increase the flexibility of this data, additional dimension tables are created, referenced by the Period dimension table, that contain months, quarters, and years in more detail. When designing dimension tables for use with Microsoft SQL Server OLAP Services, only a date is needed. The OLAP Services Time Wizard enables dates to be summarized into any combination of weeks, months, quarters, and years.

Cc917548.starsc03(en-us,TechNet.10).gif

Implementing the Database Design

After the fact and dimension tables have been designed, the final step is to physically implement the database in Microsoft SQL Server.

Creating the Database

When creating the database, consider the partitioning strategy, if any, that may be used. SQL Server offers filegroups that can be used to stripe data, in addition to the disk striping available with Microsoft Windows NT, and hardware-based implementations.

Creating the Tables

When creating the tables used to store the fact and dimension data, consider creating the tables across the partitions available to the database, based on usage. For example, create separate fact tables containing data segmented by year or division on separate partitions (such as a SQL Server filegroup to improve read performance).

Creating Any User-defined Views

Create user-defined views if necessary. SQL Server views can be used to merge horizontally partitioned tables together logically, as interfaces to predefined queries or as a security mechanism.

Creating Indexes

Indexes should be created to maximize performance. Consider creating indexes on:

Key columns.
Columns involved in joins.
Multiple columns, to take advantage of index coverage.
All dimension table keys used by the fact table.

See Also

In Other Volumes

"CREATE VIEW" in Microsoft SQL Server Transact-SQL and Utilities Reference

"Overview of Creating and Maintaining Databases" in Microsoft SQL Server Database Developer's Companion

"Indexes" in Microsoft SQL Server Database Developer's Companion

"Physical Database Design" in Microsoft SQL Server Diagnostics

Extracting and Loading Data

Extracting and loading data from operational systems to a data warehouse varies in complexity. The process can be simple if there is a direct correlation between the source data and the data that should appear in the data warehouse: for example, if all the source data from a single operational system is in the correct format, and does not have to be modified in any way. The process can also be complex: for example, if source data resides in multiple, heterogeneous operational systems, and requires significant formatting and modification before loading.

The extraction and load process involves:

Validating data in the operational systems.
Migrating data from the operational systems.
Scrubbing data.
Transforming data to the data warehouse.

Validating Data

Before data is extracted from the operational systems, it may be necessary to ensure that the data is completely valid. If the data is not valid, the integrity of the business analysis relying on the data may be compromised. For example, a value representing a monetary transfer between banks in different countries must be in the correct currency.

Data should be validated at the source by business analysts who understand what the data represents. Any changes should be made in the operational systems, rather than the data warehouse, because the source data is incorrect regardless of where it is located.

Validating data can be a time-consuming process. The validation process can be automated by writing stored procedures that check the data for domain integrity. However, it may be necessary to validate data manually. If any invalid data is discovered, determine where the fault originated and correct any processes contributing to the error.

For example, the data in the order entry system should be validated, to ensure that:

Region information (City, State, Country) represents a valid city, state, country/region combination.
Product information (ProductID, ProductName, UnitPrice) represents valid products.

This information can be validated using the Data Transformation Services import and export wizards. A Microsoft ActiveX script, executed by the DTS Import and DTS Export wizards when copying data from the source to the destination, can determine if the region and product information is valid. Any invalid data can be saved to the exception log for later examination by business analysts, to determine why it is incorrect.

See Also

In Other Volumes

"Data Transformation Services Import and Export Wizards" in Microsoft SQL Server Distributed Data Operations and Replication

"Column Mappings" in Microsoft SQL Server Distributed Data Operations and Replication

Migrating Data

Migrating data from operational systems typically involves copying the data to an intermediate database before it is finally copied to the data warehouse. Copying data to an intermediate database is necessary if data should be scrubbed.

Copying data should occur ideally during a period of low activity on the operational system. Otherwise, system performance may degrade, affecting any users. Additionally, if the data warehouse is composed of data from multiple interrelated operational systems, then it is important to ensure that data migration occurs when the systems are synchronized. If the operational systems are not synchronized, the data in the warehouse can produce unexpected results when queried.

The DTS Import and DTS Export wizards can be used to create a series of tasks that copy data from many heterogeneous operational systems to an intermediate database running on Microsoft SQL Server. Alternatively, you can use a Microsoft ActiveX script with DTS to scrub the data, and then copy it directly to the SQL Server data warehouse, avoiding the need for an intermediate database.

See Also

In Other Volumes

"Data Transformation Services Import and Export Wizards" in Microsoft SQL Server Distributed Data Operations and Replication

Scrubbing Data

Scrubbing data involves making the data consistent. It is possible that the same data is represented in different ways in multiple operational systems. For example, a product name might be abbreviated in one operational system, but not in another. If the two values were not made consistent, any queries using the data likely would evaluate the values as different products. If the detail data in the data warehouse is to produce consistent information, the product name must be made consistent for all values.

Data scrubbing can be achieved:

Using the DTS Import and DTS Export wizards to modify data as it is copied from the operational systems to the intermediate database, or directly to the data warehouse.
By writing a Microsoft ActiveX script, executed by a program using the DTS API, to connect to the data source and scrub the data. Any data manipulation that can be achieved using ActiveX scripting, or a programming language such as Microsoft Visual C++, can be performed on the data.
Using a DTS Lookup, which provides the ability to perform queries using one or more named, parameterized query strings that allow a custom transformation to retrieve data from locations other than the immediate source or destination row being transformed.

For example, the data in the order entry system should be scrubbed, such as the values for:

State (must always be a two-character value, such as WA).
ProductName (must always be the full product name, with no abbreviations).

Using the DTS Import and DTS Export wizards, an ActiveX script can be executed during the copy process, that checks the State value, and changes it to a known two-character value. Alternatively, the ProductName value could be scrubbed by writing a Visual C++ program that calls the DTS API to execute Microsoft JScript scripts , and other executable modules.

See Also

In Other Volumes

"Column Mappings" in Microsoft SQL Server Distributed Data Operations and Replication

"Data Transformation Services Import and Export Wizards" in Microsoft SQL Server Distributed Data Operations and Replication

"DTS Lookup" in Microsoft SQL Server Distributed Data Operations and Replication

"Programming DTS Applications" in Microsoft SQL Server Building Applications

Transforming Data

During the data migration step, it is often necessary to transform operational data into a separate format appropriate to the data warehouse design. Transformation examples include:

Changing all alphabetic characters to uppercase.
Calculating new values based on existing data, including data aggregation and summarization.
Breaking up a single data value into multiple values, such as a product code in nnnn-description format into separate code and description values, or a date value in MMDDYY format into separate month, day, and year values.
Merging separate data values into a single value, such as concatenating a first name value with a surname value.
Mapping data from one representation to another, such as converting data values (1, 2, 3, 4) to (I, II, III, IV).

Data transformation also involves formatting and modifying extracted data from operational systems into merged or derived values that are more useful in the data warehouse. For example, copying the OrderDate value from the order entry system to the data warehouse star schema, involves splitting the date into Month, Quarter, and Year components. These date components are required for the type of business analysis performed on the data warehouse.

The transformation process usually takes place during the migration process: when data is copied either directly from the operational sources or from an intermediate database, because the data has been scrubbed. For complex data migrations, DTS provides skip return values to assist in splitting data into multiple tables.

Data transformation and migration can be completed in a single step using the DTS Import and DTS Export wizards. Transforming and migrating data from the order entry OLTP operational system schema to the data warehouse star schema involves using the DTS Import and DTS Export wizards to:

Create a query to extract all the required detail level (fact) data.
Split OrderDate in the Orders table into Month, Quarter, and Year components and add to Period using a Microsoft ActiveX script.
Extract the City, StateOrProvince, and Country data relating to the detail data and add to Region using an ActiveX script.
Perform a simple table copy of Products.
Create a query to generate the data for Summary.

Each step, for example, can be built as a separate package, which is stored in the Microsoft SQL Server msdb database, and scheduled to be executed every Friday night at midnight.

In addition to performing insert-based transformations of data, DTS provides data-driven-queries, in which data is read from the source and transformed, and a parameterized query is executed at the destination, using the transformed values in the destination row.

Note When using DTS to create fact tables for use with Microsoft SQL Server OLAP Services, do not create any aggregations while migrating the data. OLAP Services is specifically designed to create the optimal aggregations after the data warehouse has been populated with DTS. It is also unnecessary to segment a date into week, month, quarter, or year columns in the Time dimension table. The OLAP Services Time Wizard provides an automated facility for this type of time transformations.

See Also

In Other Volumes

"Column Mappings" in Microsoft SQL Server Distributed Data Operations and Replication

"Data-Driven Queries" in Microsoft SQL Server Distributed Data Operations and Replication

"Data Transformation Services Import and Export Wizards" in Microsoft SQL Server Distributed Data Operations and Replication

"Understanding Data Transformation Services" in Microsoft SQL Server Building Applications

Designing and Processing Aggregations

OLAP tools are typically used to create and manage summary data. Microsoft SQL Server OLAP Services allows aggregations to be stored in a variety of formats and locations, with dynamic connections to underlying details in the data warehouse. Summary data is often generated to satisfy the commonly executed queries in the data warehouse. Storing preaggregated data increases query performance, and reduces the load on the data warehouse.

If a data warehouse is built so the data in it does not change, then preaggregating data in the fact table saves only the disk space required by the fact table. OLAP Services uses the processing time that would have been used to preaggregate in the fact table when it processes the fact table as it builds a cube. However, precalculated aggregations are stored in the cube and do not need to be recalculated for each query. If a hybrid OLAP (HOLAP) or relational OLAP (ROLAP) cube is used, the fact table is not copied into the cube as it is in multidimensional OLAP (MOLAP) cubes, so the overhead required to retain availability of the detail data is only the fact table size, not processing time or query response time.

Preaggregation strategy when designing a data warehouse for use by OLAP Services depends on the following variables:

Stability of the data.

If the source data changes, the preaggregations have to be performed each time, whether preaggregated in the fact table or in the OLAP cubes that have to be rebuilt from the fact table.
Query response time.

With properly designed OLAP cubes, the granularity of detail in the fact table has no effect on query response time for queries that do not access detail facts.
Storage requirements.

A finer level of granularity in the fact table requires more storage for the fact table and for MOLAP cubes. This is a trade-off against detail availability and choice of OLAP cube storage mode. OLAP cubes tend to be large regardless of the storage type; therefore the storage required to retain fine granularity in the fact table may not be particularly significant when compared to OLAP storage needs.

When designing the data warehouse for OLAP, the user's needs should drive the preaggregation strategy. The fact table should only be preaggregated to the level of granularity below which no user would want to access detail.

For more information, see your OLAP Services documentation

Maintaining the Data Warehouse

Maintenance of the data warehouse is an ongoing task that should be designed before the data warehouse is made available to users. Maintenance involves:

Implementing a backup and recovery mechanism to protect the data in the event of a system failure, or some other problem.
Archiving the database. This may be necessary to purge the database of unused historical data, and free up space.
Running SQL Server Profiler to determine which indexes to create to enhance query performance.

See Also

In Other Volumes

"Monitoring with SQL Server Profiler" in Microsoft SQL Server Administrator's Companion

Data Transformation Services Data Warehousing Support

Using Data Transformation Services (DTS), you can import and export data between multiple heterogeneous sources using an OLE DB-based architecture, and transfer databases and database objects (for example, indexes and stored procedures) between computers running Microsoft SQL Server version 7.0. You can also use the data transformation capabilities of DTS to build a data warehouse from an online transaction processing (OLTP) system. You can build data warehouses and data marts in SQL Server by importing and transferring data from multiple heterogeneous sources interactively or automatically on a regularly scheduled basis.

DTS components include the DTS Import Wizard, DTS Export Wizard, and DTS Designer, which are available through SQL Server Enterprise Manager. DTS also includes COM programming interfaces you can use to create customized import, export, and transformation applications.

A transformation is the set of operations applied to source data before it is stored at the destination during the process of building a data warehouse. For example, the DTS transformation capability allows calculating new values from one or more source columns, or even breaking a single column into multiple values to be stored in separate destination columns. Transformations, therefore, make it easy to implement complex data validation, scrubbing, and enhancement during import and export.

Data Transformation Services (DTS) allows you to import, export, or transform data in a process that can be saved as a package. Each package defines a workflow that includes one or more tasks executed in a coordinated sequence as steps. Tasks can copy data from a source to a destination, transform data using a Microsoft ActiveX script, execute an SQL statement at a server, or even run an external program. Tasks can also transfer database objects between computers running SQL Server 7.0.

The DTS package can be created manually by using a language that supports OLE Automation, such as Microsoft Visual Basic, or interactively by using the Data Transformation Services wizards or DTS Designer. After a DTS package has been created and saved, it is completely self-contained and can be retrieved and run using SQL Server Enterprise Manager or the dtsrun utility.

DTS packages can be stored in the Microsoft Repository, providing the ability to record data lineage. This allows you to determine the source of any piece of data, and the transformations applied to that data. Data lineage can be tracked at the package and row levels of a table and provide a complete audit trail of data transformation and DTS package execution information in your data warehouse.

DTS Designer is a graphical design environment for creating and executing complex sets of data transformations and workflows, in preparation for moving data to a data warehouse. Experienced users can use DTS Designer to integrate, consolidate, and transform heterogeneous data from multiple sources. Packages created can be stored in a SQL Server msdb database, the Repository, or a COM-structured storage file.

The visual objects used by DTS Designer are based on the DTS object model, an API that includes objects, properties, methods, and collections designed for programs that copy and transform data from an OLE DB data source to an OLE DB destination. This object model can be accessed through ActiveX scripts from within DTS Designer, and through external programs written in languages such as Visual Basic and Microsoft Visual C++. You can also access custom programs through DTS Designer, and include their tasks and icons as part of the package. Because DTS Designer accesses an underlying programming model, it does most of the programming work for you.

See Also

In Other Volumes

"DTS Designer" in Microsoft SQL Server Distributed Data Operations and Replication

"Overview of Data Transformation Services" in Microsoft SQL Server Distributed Data Operations and Replication

OLAP Services Data Warehousing Support

Microsoft SQL Server OLAP Services provides online analytical processing (OLAP) services to applications.

OLAP focuses on finding trends in aggregated or summarized data. The main objects used by OLAP programs are multidimensional cubes. A multidimensional cube records a set of data derived from fact tables and dimensions. A fact table records data about a set of transactions. Measures are numeric columns from the fact table that are of interest to the user. For example, a car sales fact table could provide measures such as sale price, invoice cost, tax paid, and discount. A cube represents how these measures vary over multiple dimensions, such as by car dealer, by location, by customer, or by date.

OLAP Services provides the capability of designing, creating, and managing cubes from a data warehouse, and then making them available to client applications written using either the OLE DB 2.0 OLAP extensions, or the Microsoft ActiveX Data Objects 2.0 Multidimensional Objects (ADO MD).

Cc917548.msdssar1(en-us,TechNet.10).gif

The OLAP server performs multidimensional queries of data and stores the results in its multidimensional storage. It speeds the analysis of fact tables into cubes, stores the cubes until needed, and then quickly returns the data to clients.

The OLAP server is managed through an API called the Microsoft Decision Support Objects (DSO). OLAP Services provides a snap-in for the Microsoft Management Console (MMC). This MMC snap-in uses DSO to provide administrators with an easy-to-use graphical interface for defining, creating, and managing the cubes built by the OLAP server. DSO can also be called from custom applications, which in turn can be added to the OLAP Manager as an add-in.

OLAP Services passes its multidimensional data to a middle-tier PivotTable Service. The PivotTable Service operates as an OLE DB for OLAP provider. It exposes the multidimensional data to applications using either the OLE DB 2.0 OLAP extensions, or the ADO MD API that wraps the OLE DB OLAP extensions.

See Also

In This Volume

Installing OLAP Services

Server Improvements and New Features

Chapter 12 - Data Warehousing and Online Analytical Processing

SQL Server Tools for Data Warehousing and OLAP

Data Warehouse Characteristics

Data Granularity

Data Marts

Designing and Building a Data Warehouse and OLAP System

Determining Business, User, and Technical Requirements

Designing and Building the Database

Creating a Database Schema

Determining Fact and Dimension Tables

Designing Fact Tables

Designing Dimension Tables

Implementing the Database Design

Extracting and Loading Data

Validating Data

Migrating Data

Scrubbing Data

Transforming Data

Designing and Processing Aggregations

Maintaining the Data Warehouse

Data Transformation Services Data Warehousing Support

OLAP Services Data Warehousing Support

Additional resources