Export (0) Print
Expand All
Creating Audit Tables, Invoking COM Objects, and More
Exception-handling Techniques
Exploring SQL Server Triggers: Part 2
Deliver User-Friendly Reports from Your Application with SQL Server Reporting Services
SQL Server: Display Your Data Your Way with Custom Renderers for Reporting Services
Updating Data in Linked Servers, Information Schema Views, and More
XML Features in SQL Server 2000
Expand Minimize

SQL Server 2000
 

Best Practices for Business Intelligence Using the Microsoft Data Warehousing Framework

Microsoft Corporation

July 2002

Summary: Discover best practices and suggestions for working with Microsoft SQL Server and Microsoft Analysis Services for analytic applications. These practices and suggestions have been compiled from answers to questions submitted by developers and customers, and cover topics ranging from dimensional modeling to designing OLAP cubes and writing MDX expressions. This is not an introductory paper. (25 pages)

Contents

Introduction
Relational Data Warehouse Design and Management
DTS and ETL Processes
Cube Design and Dimensional Modeling with Analysis Services
Analysis Services Deployment and Maintenance
MDX Expressions

Introduction

This paper compiles a number of best practices and suggestions for using Microsoft® SQL Server™ and Analysis Services in data warehousing and analysis. This information has been compiled over time in response to interactions with developers, DBAs, analysts, customers, and others working with SQL Server and Analysis Services to gain insight into business information.

The information contained in this paper ranges from general best practices to very precise suggestions or tips, such as how to make the current month the default time dimension member. Some of these topics are discussed in other Microsoft white papers, while others existed primarily in the database of responses. They are assembled here in the hope that putting them together in one place is useful to the developer or designer working with the Microsoft Data Warehouse Framework.

This paper assumes you are a developer familiar with Microsoft SQL Server and Analysis Services. This is not an introductory paper. The SQL Server web site is a good first source of information. A number of helpful papers are available.

For the source of this information, we thank the many Microsoft employees, consultants, and partners who have contributed their time and expertise to answering questions and tracking their answers in a database.

The general topic areas discussed in this paper are:

  • Relational Data Warehouse Design and Management
  • Data Transformation Services and Extract, Transform, Load (ETL) Processes
  • Cube Design and Dimensional Modeling with Analysis Services
  • Analysis Services Deployment and Maintenance
  • MDX Expressions

Relational Data Warehouse Design and Management

This topic area addresses the relational data warehouse that supports the Analysis Services cubes; the design approach is intended for SQL Server 2000 implementations.

Choosing Star or Snowflake Schemas

When creating the relational data store for your analysis application, you can select either a star or snowflake logical schema for the data store. A star schema has a central fact table surrounded by dimension tables. A snowflake schema uses a similar structure, except that a dimension table may reference another, subordinate dimension table.

Analysis Services works well with both logical schema layouts. In general, the snowflake schema is the most efficient way to populate an Analysis Services cube. Only consider a star schema if end users will be accessing the relational data store directly.

  • A snowflake schema helps you maintain standard dimensions and clean hierarchies. The snowflake schema is fully normalized. Using this schema, it's easier to be sure that each child has one and only one parent and that no two members have the same name. Of course, you need to define Primary Key/Foreign Key relationships and the appropriate indexes to ensure hierarchical integrity, but this is much easier to do in the normalized form of the dimension. If using the snowflake schema, create a surrogate key at each level of the dimension.
  • A snowflake schema is the most efficient schema for populating an Analysis Services cube. If the only role of the dimensional relational schema is populating cubes, then there is no reason to create a star schema.
  • Create de-normalized star dimensions if there is any direct end user access to the relational database. (We recommend that all user access be through cubes rather than the relational data store.) A star dimension is generally easier for users to navigate, and queries against a star schema are often more efficient than queries against the equivalent snowflake schema.
  • One simple approach is to define the star schema as a view of the snowflake tables. Taking this approach solves the navigation problem. You will need to verify that performance does not suffer. For large, deep dimensions it is usually better to physically instantiate the dimension a second time as a star.

Directing User Access Through Views

If you must give users direct access to the relational data store (instead of access only through cubes), do not grant direct access to any of the tables in your relational data store. Instead, create views on each base table and grant SELECT privileges on the views, directing all user access through those views. Make sure the object names (views and columns) make sense to the business users; these can be the same names you will use in the Analysis Services cubes.

Using views is good system design for a number of reasons:

  • Using views improves flexibility by providing a buffer between the physical database and downstream systems, like user access and cube population.
  • It increases the likelihood that you can make significant physical changes to the system while minimizing disruption to other applications.

For similar reasons, we recommend that all user access into Analysis Services should be to virtual cubes instead of physical cubes (described later in this paper). Using Surrogate Keys

Use surrogate keys to ensure that your data warehouse can grow or expand over time. This practice is critical for large scale implementations.

The primary key for dimension tables should be a numeric surrogate key (typically integer) maintained by the data warehouse staging process. The business key (customer account number, vendor number, promotion code) should be an attribute of the dimension member.

Using the business key instead of a surrogate key as the primary key in the dimension table introduces a number of risks:

  • The source system may re-use a business key.
  • You may want to combine information from several sources in the future. Managing keys in the data warehouse enables you to resolve namespace collisions.
  • The source system may substantively change the attributes of a dimension member, such as a customer, in a way that you want to track in the data warehouse.

Using surrogate keys avoids these potential problems. But there is a cost associated with managing surrogate keys. There is a minor cost with managing the dimension itself, although the IDENTITY column property makes it pretty easy. The most difficult or expensive part of using surrogate keys is performing the key lookup during fact table loads. However, experienced practitioners make the up-front investment in surrogate keys, knowing they (or their customers) will reap the benefits down the road.

Using Partitions in the Data Warehouse

Partitioning is an important strategy for improving cube performance, and is critical for large scale implementations. Relational databases support a similar kind of partitioning, which is very useful in the relational data store.

The concept behind partitioning is to horizontally partition data by a key, such as date. In a relational database, you do this by creating separate physical tables for each partition and defining a union view over the member tables. The complexity of the physical storage is hidden from the analytical user.

Partitioning the data warehouse has several compelling benefits:

  • It reduces query time
  • It improves load time and maintainability of the database
  • Partitioning can solve the problem of pruning old data from the active database.

Selecting a Recovery Mode for the Data Warehouse

The processes for recovering or protecting a data warehouse from standard disasters are the same as with most other database systems. The DW DBA must be familiar with techniques for backing up and restoring very large databases. In data warehouses, most of the data is usually static, loaded in the distant past. You can use this information to your advantage when designing backup and recovery strategies.

The recovery model you choose for the SQL Server 2000 relational data store has a significant effect on your data load performance and storage requirements. SQL Server 2000 supports the following recovery modes:

  • Full Recovery: Full transaction logging offers point-in-time recovery – essential for critical transaction databases.
  • Bulk-logged Recovery: This mode reduces log space consumption and improves performance for large-scale operations, such as create index and bulk copy.
  • Simple: Transactions are truncated from the log on checkpoint. This approach offers the highest performance and lowest log space requirements. However, you can only recover to the last full backup; any changes since that point must be redone.

Point-in-time recovery is critical for many transactional systems. For relational data stores that do not manage user transactions, however, you can probably sacrifice point in time recovery to minimize the performance impact of logging. If you use either bulk-logged or simple recovery modes, compensate by performing backups after data loads, or ensuring that your ETL system can recreate data loads. Recovery modes are discussed in more detail in SQL Server Books Online.

DTS and ETL Processes

The following topics cover using SQL Server Data Transformation Services (DTS) for Extract, Transform, and Load (ETL) processes.

Designing and Maintaining Resilient ETL Processes

A data warehouse is part of a system of information. A relational data store is populated by an ETL system such as Data Transformation Services (DTS), which also processes the OLAP cubes in Analysis Services. There are several things you can do in the design and maintenance of the data warehouse to make this system more resistant to problems and failures.

ETL Design

Poor communication between source system DBAs and data warehouse DBAs is a common source of problems. A "minor" change in a source system may cause a load failure in the warehouse, or result in an incorrect data load.

Exception Processing

In designing the ETL system, make it resilient to the potential problems of load failures, incomplete data loads, and incorrect data loads.

Design for load failures: All ETL toolsets should be able to set an alarm if a step fails. Recovering from a complete load failure requires troubleshooting the error and restarting data processing from that point. Modularize the ETL application and design for restartability at any checkpoint.

Design for incomplete data loads: The ETL system should check data completeness at each process step by comparing rowcounts, checksums, or hash values at the source and target. This process does not catch problems in the source data itself. If data volumes are relatively predictable, check rowcounts and aggregate fact amounts against recent loads to ensure each day's data set is in an expected range. To design for these problems, tag all data rows in the data warehouse with an audit key to simplify the identification of rows loaded during any specific process. To recover from an incomplete load, you can either back out the partial load and reload full, or load the incremental data only.

Design for incorrect data problems: If incorrect data stems from an error in the ETL logic, it is best to restore the DW tables to the point before the error and reapply loads with the corrected system. If load volumes are high relative to the load window, you may choose to develop a TRANSACT-SQL script to correct the erroneous data (after first testing the script with a copy of the data, of course). Remember that incorrect data may not be discovered for days or weeks. If confronted with a serious error, most data warehouses will have to plan carefully how to recover while delivering acceptable service to end users. To that end, improve the ETL efficiency beyond that required for the load window – it gives you extra room when abnormal situations arise.

Problems In Cubes

Using Analysis Services MOLAP cubes offers a layer of protection from ETL problems, because users access cubes while processing occurs on the relational data warehouse in the background. Analysis Services offers incremental cube processing with no downtime for users.

If a past data load is skipped or incomplete, you can process it incrementally using the same process as a normal, daily cube processing. Be sure the query against the source data picks up only the missing data. Making an audit key part of the DW schema helps in this.

Analysis Services has no notion of "delete" or "update" for fact tables. There are two ways to "fix" data in a MOLAP partition and its downstream aggregates:

  • Back out bad data by inserting offsetting transactions so the totals work out. This can have undesirable side effects, such as throwing off transaction counts and other computed measures.
  • Fully refresh the data in the partition. This is the most common approach and works well if you have designed the cube to be partitioned by time.

The cube can remain available for queries during reprocessing.

Loading Data from a Text File into a Table

There are several methods for loading data from a text file into a table. These are listed below, from fastest to slowest:

  1. Bulk Insert via Transact-SQL.
  2. Bulk Insert via the DTS ExecSQL task (very close in performance to #1).
  3. Bulk Insert DTS task.
  4. BCP (although the transform data task is faster in some scenarios).
  5. Transform Data DTS task using only the Copy transform, with correct minimal logging options.
  6. Transform Data DTS task using one or more predefined transforms other than copy, but no ActiveX® scripts.
  7. Transform Data DTS task using one or more ActiveX scripts – VBScript is faster than Microsoft Jscript®.

Other considerations may affect which method you choose. For example, when dealing with very large data sets (100 million rows or larger), it is advisable to use BCP to import the raw data, then transform the raw data using Join operations in SQL statements in SQL Server.

Read the SQL Server Books Online topic "Bulk Copy Performance Considerations" for details and tips.

Loading Data from a Relational Source into a Table

There are several techniques for loading data from a relational source into a target table. Again, the performance varies. Here are our choices for the fastest to slowest methods:

  1. Transact-SQL SELECT INTO statement, using local or linked servers. The Transact-SQL statement could be executed in any environment, but the ExecuteSQL DTS task is generally preferred. The SELECT INTO statement creates the target table and populates it with the data set resulting from the query. This is the only way to perform a non-logged load directly from a relational source.
  2. Transform Data DTS task, using only the Copy transform, with correct minimal logging options.
  3. Transform Data DTS task using one or more predefined transforms other than copy, but no ActiveX scripts.
  4. Transform Data DTS task using one or more ActiveX scripts. VB Script is faster than Jscript.

Updating a Table During ETL Processing

Because update statements are logged, updating a table is generally slower than inserting data during ETL processes.

There are several techniques for updating table data based on one or more source relational tables. The fastest technique is to perform all of the updates in one transaction, using a Transact-SQL UPDATE statement across local or linked servers. This is usually executed from the ExecuteSQL DTS task. The efficiency comes from performing the update on a set of rows, instead of generating row-by-row updates.

Two problems lessen the usefulness of this technique:

  1. The UPDATE transaction is logged as a single entity. If the update affects many rows, the log can grow significantly. There is no inherent notion of batch size in the UPDATE statement, although such logic could be written into the Transact-SQL statement.
  2. The logic and syntax of a single UPDATE statement may not support extremely complex business requirements. A stored procedure can often be written to execute complex logic, but some ETL system developers prefer to avoid stored procedures.

The Data Driven Query (DDQ) DTS task is a safe and effective alternative. The DDQ task includes batching logic, and ActiveX scripts can handle complex logic and transformations. Performance for the DDQ task is roughly comparable to executing the UPDATE on a row-by-row basis using a cursor.

The DDQ DTS task is the only way to perform an update operation directly from information in a file. It is generally more efficient to load the "update" file into a staging area, then execute a SQL UPDATE statement.

Getting a List of DTS Packages

To get a list of DTS packages (for example, to populate a list box), query the system tables for the msdb database:

/* ListServer Packages */ SELECT DISTINCT name FROM sysdtspackages

Skipping the First Row in a Transform Data Task

If you need to skip the first row of a source file in a transform data task, modify the DTS "Text file (source)" connection definition to the flat file. Click the "Properties" button, and specify how many rows you want to skip.

Changing the "First row" and "Last row" settings in the options tab of the Transform Data task does not work for skipping the first row.

Executing DTS Packages: Which Machine?

Understanding where DTS packages execute can be a problem. For example, if you're moving data from serverA to serverB, using a DTS package stored on serverB, you might assume that the package would execute on serverB. This is not necessarily the case.

DTS packages execute on the computer that launched the execution – regardless of where the package is stored and where the data resides. Data makes a round trip to the package machine, and any package steps such as ActiveX scripts also execute on that machine.

In many cases, particularly early in the development cycle, the machine that launches the DTS package is a desktop machine. This can lead to suboptimal performance for the DTS package.

One way to avoid this is to use Terminal Server during the development cycle, to run the package from an appropriate machine. During production, schedule jobs using SQLAgent.

Implementing a Wait in a VB ActiveX Script Task

When programming a loop in a DTS ActiveX Script task, it is common to want to "wait" for a time before retrying an operation. This can be problematic, because VBScript does not have a "wait" function, nor can you use the Timer control as you would do for a normal VB program. There are Win32 APIs that support waiting for a period of time, but these cannot be used from VBScript.

One technique is to create a stored procedure spDelay with a single SQL statement in it:

CREATE PROCEDURE spDelay (@interval CHAR(9))
AS
WAITFOR DELAY @interval
GO

Then in your code, place the amount of time you want to wait in a global variable (for example, "000:00:13" for 13 seconds) and use an Execute SQL task with "EXECUTE spDelay ?" passing the global variable as an input parameter.

An astute programmer might want to use an Execute SQL task with the WAITFOR TSQL command directly as the text. Unfortunately, the DTS Designer does not recognize WAITFOR as a valid TSQL command. In reality, using a stored procedure is actually a good idea in this case since it would allow you to easily insert TSQL code in the future immediately before or after the wait – and since you are already in TSQL at the time you can just add it to the stored procedure without changing any DTS code.

Cube Design and Dimensional Modeling with Analysis Services

The following topics describe designing and working with cubes in Analysis Services.

Using Multiple Hierarchies in Analysis Services

Analysis Services lets you create multiple hierarchies for a single dimension to offer alternate views of dimension members. The Time dimension is a good example; Accounting might want to look at data by fiscal year, while Marketing is interested in the calendar year.

Using multiple hierarchies has several advantages over creating separate dimensions.

  • Better usability: Users see the hierarchies (such as time hierarchies) grouped in a logical way.
  • Shared aggregates: If you define a dimension with multiple hierarchies rather than building separate dimensions, Analysis Services builds a set of aggregations that are useful to both hierarchies. For example, if the time dimension has both YMD and YQD hierarchies, the two hierarchies can share aggregates built at the year level.
  • Shared key: Sharing a single key in the fact table reduces the size and complexity of the fact table.

Using Partitioning to Improve OLAP performance

Using partitioning improves query performance because there is less data for the server to scan on a query. For example, if a year of data were partitioned by months, then a query for data in July would have 1/12th of the data to consider. Even with Analysis Service's effective indexing scheme, it's better to have less data to scan per query.

Partitions improve query performance with increased selectivity for querying. You get this increased selectivity by setting the slice information for the partition. It's easy to forget to set the slices – and lose the performance benefits of partitions. Setting the slice effectively tells the query engine what data resides in each partition.

However, partitioning raises the concern that queries requiring data from multiple partitions will be slower.

We tested the effects of extensive partitioning on a very large data warehouse project, which had 5 years of data partitioned into 13 4-week months (on a retail calendar) per year, for a total of 65 monthly partitions per cube. We wanted to discover if there would be benefit in partitioning the data more, or if the extra overhead of hundreds of partitions would be a problem.

As an experiment, we changed from the 65 monthly partitions in the Week-Item cube (the largest of the cubes) to 585 partitions by Month and Department. So the data is sliced along the two most important dimensions in this cube.

We loaded 50 users running queries representing a complex set of business questions. Many of these queries spanned multiple months of data, operating either on a year's worth of data or on a rolling 13-week quarter.

The results were interesting:

  • Under this load, the average query ran 3.2 times faster.
  • The longest running queries benefited the most, while the quickest queries got a little slower.
  • Administrative operations got slower. With over 1000 partitions, Analysis Manager takes some time to open the metadata, and saving cubes is slower. Since routine administrative tasks should be automated, this should not offset the benefits of the query performance gains.

For more information on partitions, see the paper Using Partitions in a SQL Server 2000 Data Warehouse.

Creating an Empty Partition

When developing very large BI systems, waiting for processing cycles to complete can be a problem. To avoid the needless waiting, a good practice is to create an empty partition in each cube. This allows the developer or administrator to process that one partition during a testing cycle and eliminate the wasted time.

For example, this approach was valuable for a customer with a cube with 450 partitions, each with about 100 million rows of data. By creating a single partition with no data, we could perform programmatic testing without waiting for a partition with data to process.

To create an empty partition, create either an empty table or view that returns no data.

Cloning Partitions and Aggregations

One of the best features of Analysis Services is the ability to clone objects. The Clone method of the MDStore interface copies the property values and optionally the collections of major and minor objects of an existing object to a target object of the same class type. This applies to the following classes:

  • Databases
  • Cubes
  • Partitions
  • Aggregations

When developing very large cubes with many partitions, it is very efficient to perform the following optimization:

  1. Create a single partition.
  2. Use the usage based optimizer with standard types of queries running against that single partition
  3. Tune the aggregations
  4. Clone this partition to into all the other partitions.
    Note  This can be done either programmatically or using the Analysis Services partitioning wizard and selecting the "Copy the aggregation design from an existing partition" option in the last screen of the Partition Wizard. Creating a custom tool to programmatically manipulate large numbers of partitions is good idea and such a tool should contain this feature.

Using Numeric Keys for Dimensions

If you are using CHAR member keys, you can dramatically improve dimension processing performance by changing to NUMERIC member keys. There are several reasons for this:

  • A simple number comparison is much faster than CHAR comparison.
  • Storage of the numeric values is much easier and symmetric, since all members have the same length. String values, in contrast, are stored differently, with an additional level of indirection.
  • Memory requirements for strings are higher than those for numeric values. A string requires a four-byte pointer, plus the string length times two (the size of the Unicode characters), plus two bytes for the special end of string character. An integer, in contrast, is stored on four bytes only.

Using Changing Dimensions

Changing dimensions are a wonderful feature of Analysis Services, but this functionality doesn't come without a cost. This topic summarizes some of the benefits and effects of using changing dimensions.

Before using changing dimensions in your cube, remember that you can add new members (time periods, products, customers, and so on) without using changing dimensions. The Changing Dimension feature allows a member to be moved to a different parent, such as moving a sales person from one territory to another. Check Books Online for more information

Using Changing Dimensions has administrative, processing, and query performance impacts.

Administrative impact

A changing dimension cannot be used to set slices on partitions. A partition has data for a certain collection of members, determined by the slice. If some of the members were then re-organized in the hierarchy such that they would be in a different partition, the cube is invalidated.

Processing benefits & impact

The huge benefit of changing dimensions is that you don't have to re-process a cube when a member is moved. However, aggregations on the cube can be invalidated by such a change. The system deals with this by marking aggregations between the All level and the lowest level as soft aggregations, meaning that they can be dropped and re-computed when the dimension changes.  

When the system detects a hierarchy change in a changing dimension, the soft aggregations on all cubes using that dimension are dropped and recomputed as a background activity. This begins after the dimension is incrementally processed, and continues in the background until all the cubes involved have been re-aggregated. The system does not lock the cubes during this time – users can still query the cubes.

The net effect is that cubes stay available for querying, and the cube data is not invalidated, so Analysis Services does not have to re-process all the data from the relational database.

Query benefits & impact

Although cubes remain available during these types of changes, this introduces a query performance issue. Once the soft aggregations are dropped, and until they have been recomputed, queries cannot benefit from pre-computed aggregates. Aggregates are key to making OLAP queries fast. In this situation, a query requesting some aggregation that might normally have been directly available or easily computable from another aggregation must instead compute the aggregation from the fact level data. The effect is a substantial query performance penalty during the re-aggregation process.

This suggests that even though cubes are fully available following a dimension change, it is still better to make such a change during the lightest use periods for the Analysis system.

Testing for Large Numbers (>64K) of Children

No member in an Analysis Services database can have more than 64,000 children. If a member in a cube will have more than 64,000 children, you need to create a "dummy" level, either by hand in the relational database or using Analysis Server 2000's Automatic Member Grouping feature. (See the next topic, Handling Dimensions with Many Children.)

It is important to identify any cubes that exceed this 64K limit before dimension processing. There are multiple ways to check if you have hit the 64K limit.

The simplest solution to the problem is to issue a query in the database. You can do this by hand during design time, and programmatically before updating dimensions, during your periodic DW processing. Suppose we have dimension table A that snowflakes out to table B:

Table A 
dim_key 
parent_key (fk to B.parent_key) 
[other attributes] 

Table B 
parent_key 
[other attributes] 

Query to count children

SELECT B.parent_key, count(A.dim_key) as child_cnt 
FROM B inner join A ON (A.parent_key = B.parent_key) 
GROUP BY B.parent_key 
HAVING count(A.dim_key) > 64000 

Handling Dimensions with Many Children

The maximum number of children for any parent member is 64,000. If you think of the poor user browsing through a dimension hierarchy, it's clear that 64,000 is an upper limit anyway. Most Analysis Services practitioners agree that you do not want more than 1000-3000 members per parent, just for usability. Some client tools handle this scenario more gracefully than others.

Your objective, then, is to reduce the number of children for any dimension member.

The first solution is to manage the data on the relational design side – don't build a dimension with too many members per parent (perhaps > 1000 members). Figure out some way to group members between the parent and child in the original design. For example, you could group businesses belonging to each state by City, or City and Zip code. For a parent-child dimension, this approach is the only way to solve the problem.

For standard (non parent-child) dimensions, you can use the "Member Groups" feature in Analysis Services 2000. Analysis Services automatically inserts a "directory" level with this feature. To create automatic grouping:

  1. Identify the level that has too many members per parent.
  2. Introduce an additional level right above it and change the type of that level to have automatic grouping.

You can even hide the automatic grouping level, although doing so is not recommended for browsing usability reasons.

One drawback of the automatic group is that it applies throughout the level of that dimension. If your hierarchy is unbalanced and only one parent (perhaps "All Other") has a lot of children, the automatic grouping is added even for other, reasonably-sized parts of the hierarchy.

The best practice is to resolve the problem in the design phase by inserting a meaningful level to support graceful drilldown. The automatic Member Groups help you if you can't change the dimensional design, or if the dimension grows beyond the cardinality expected during the design phase.

Understanding Member Unique Names

A member's unique name can be generated in different ways, according to the MDX Unique Name Style property of the connection string.

Initially, OLAP 7.0 generated unique names using a traditional "Name path" algorithm. For example, [Time].[1997].[Q1].[1].

Another alternative is a key path algorithm, where the path is defined by the keys rather than the member names. In the FoodMart Time dimension, the keys are the same as the names, so the unique name would be: [Time].&[1997].&[Q1].&[1]. However, keys are often meaningless surrogate integers, and are usually shorter than names, so the key path is a good alternative.

Analysis Services 2000 can intelligently create a member's unique name using the information in the member properties. Not only are these names the same length or shorter than the key path or name path algorithm names, they are guaranteed to be stable and unique over time.

Although the unique name algorithm should be transparent to the DBA, it's not. Some client tools specifically parse the unique name, and so older algorithms must be maintained. The MDX Unique Name Style property of the connection string sets the possible name algorithms:

Style PropertyDescription
0Default (maps to 2
1Key path
2Name path
3AS2K algorithm

There is one other complication. If the dimension advanced property Allow Duplicate Names is FALSE (as it usually is), and if the MDX Compatibility property of the connection is 1 (the default), then the "MDX Unique Name Style" is set to 2 (key path) for that dimension, regardless of how you have set the property in the connection here. This is because if MDX Compatibility is set to 1, you may be using an older or less functional client, so should use the older unique name algorithm. Otherwise, the default is the new, AS2K style (3).

Using Virtual Dimensions vs. Physical Dimensions for Member Properties

If you have a property in your dimension table, 98% of the time you want to make this attribute a member property and then create a virtual dimension from it. In SQL Server 2000/Analysis Services 2000, virtual dimensions have aggregates and multiple levels. You can map attributes as member properties and create virtual dimensions from there.

One exception is the case in which the member property takes on many values – perhaps hundreds of thousands of values. In this case, you should take one of the following approaches:

  • Create a member property that groups the attribute in more manageable buckets, such as age ranges, and build the virtual dimension from the smaller domains.
  • Alternatively, build a regular (physical) dimension from the attribute. You may find it more effective to propagate the attribute's key into the fact table in the relational database.

Building the Dates Dimension from a Separate Table

Although you can build a Dates dimension directly from a datetime column in the fact table, it's usually not a good idea to do so. If you build the Dates dimension from the fact table, you get a dimension member only for those dates that appear in the fact table. Users generally expect to see a full calendar in the dimension.

Instead, create a physical Dates dimension table.

At its simplest, the Dates dimension table could contain one datetime column and a row for each day. Frequently you will need to add attributes to indicate holidays, peak periods, and so on.

Of course, it would be better to design the fact table with an integer surrogate key to the Dates dimension.

Speeding Computations Using the NON EMPTY Keyword

Computations that include the NON EMPTY key word can be slow and consume a lot of client resources. You can improve performance of these calculations by editing the definition of the calculate member to set the Non Empty Behavior property to point to one of the base measures. This triggers an optimized code path that typically cuts the computation time dramatically.

Using One Distinct Count Measure per Cube

Analysis Services cubes are limited to one distinct count measure per cube.

Although this is one more than competitive products, you still may find yourself wanting more than one in some cases. If this is so, the solution is simple. Build separate physical cubes for each distinct count measure, and combined them with a virtual cube.

It's good practice to always separate out the distinct count measure in its own cube to protect the performance of queries that do not use the distinct count measure.

Directing User Access through Virtual Cubes

Do not grant any end users access to physical cubes in your Analysis Services database. Instead, create a virtual cube on each physical cube, and grant privileges on the virtual cubes.

Using virtual cubes is good system design. It improves the flexibility of the system, providing a buffer between the physical cubes, which cannot easily be renamed, and the user access applications. Using virtual cubes greatly increases the likelihood that you can make significant physical changes to the system with minimum disruption to the business users.

For the same reasons, all user access into relational databases should be through views rather than physical tables.

Analysis Services Deployment and Maintenance

These topics cover the operational issues of deploying and maintaining Analysis Services cubes.

Using Views for Cube Building Queries

When building cubes, particularly when doing so programmatically, it is a good practice to create a view on the dimension table in the relational database and build the cube dimension from that view, instead of from the table.

This practice has the following advantages:

  • It provides a layer of insulation between the relational database and the cubes, making it easier to change one without rebuilding the other.
  • It pushes the definition of the cube-building query into the relational database, where it is visible to the DBAs who can then understand and optimize it.
  • If the column names in the cube are different from the relational column names, then using this approach you only change column names once (in the view definition) instead of every time you rebuild the cube in Analysis Manager.

Creating Local Cubes Programmatically

Local cubes let you access cube data while you are disconnected from the Analysis Server. If you are making extensive use of local cubes, you may want to create them programmatically and push them, via the web, to remote computers.

There are several ways to do this. The SQL Server 2000 Resource Kit includes code for building custom cubes. SDG Consulting has developed an external tool, which you can purchase or download for trial at http://localcubetask.com, for programmatically creating the cubes.

To distribute the cubes, you could either email the results or develop a web application to provide the most recent cubes on demand. Here are a few options:

  • A simple server-side Active Server Page could use the FileSystemObject to find out what local cubes were created by the DTS task. (Use a folder structure to add the semantics.) The ASP file generates a hyperlink to each cube, or walks down the folder structure to the local cube.
  • Use the approach above but link to an HTML file that the DTS package creates with the .cub file. The HTML file includes detailed information, such as what is in the file, when it was created, and what the trigger was. It also includes a hyperlink to the file itself. To complete the loop you could create an ASP page that generates the required DTS task to create the .htm and .cub files, so the users could request daily, weekly and monthly cubes themselves.

Both of these approaches require some development work, but are quite manageable.

Changing the Data Source for a Dimension

Analysis Manager does not let you change the data source for a dimension. In most cases, this is good, as changing the data source can make problems with cubes.

However, you might want to change data sources during the development processes. For example, you might legitimately want to rename the data source if you didn't give thought to the name when first creating the data source and need to change it to something more descriptive before production.

If you are careful, you can use DSO to change the data source name:

  1. Create a new data source
  2. Point all objects to the new data source
  3. Drop the old data source.

A better option might be to simply recreate the cube.

You can also script out the metadata and recreate the cube with the new data source names.

Configuring Pagefiles on a Large Analysis Server

Loading large dimensions can overload the default virtual memory pagefile on Analysis Server. When this happens the system reports the following error: "Memory error [Failed to reallocate space]", and the application log shows the error "The server is out of memory."

Use the following guidelines when configuring pagefiles for large servers:

  1. Add a pagefile on a second disk as big as the physical memory on the system.
  2. Expand the default pagefile to its maximum on the system disk (typically C:).

Using Archive/Restore with Large Databases

The Archive/Restore add-in for Analysis Services 2000 (and earlier versions) does not work with any file over 2GB in size. This restriction applies to any databases with cubes of 2GB if you are using a single partition, or any cube with a partition larger than 2GB.

To determine if you are affected by this restriction, go to the Analysis Services data folder to see the subfolders for each database. Using Windows Explorer, do an advanced search for any files in the database folder and its subfolders that are over 2GB. If a database has a file larger than 2GB, you cannot use Archive/Restore for that database, and should use a file system backup/restore utility instead, such as the Windows 2000 utility or a third-party backup program.

When using an alternate, file system-based backup for an Analysis Services, observe the following practices:

  • Pause or stop the MSSQLServerOLAPService while performing the backup, to ensure the integrity of the repository and data folder structure.
  • Back up the entire data folder and the Analysis Services repository together as one backup operation. This ensures that the repository matches the data folder metadata.
  • If you have migrated the repository to SQL Server, then you need to back up that data from SQL Server.

Restoring an Analysis Server Database Programmatically

In general, the best way to restore an Analysis Server is to use the Analysis Manager Restore database action. But if you want to restore your database programmatically, perhaps as part of the deployment process, you can use the MSMDARCH command.

If you're using DTS, use the DTS Execute Process task to execute the command; this is thoroughly documented in the SQL Server Books Online topic "Msmdarch command."

The following VB code snippet shells out to execute the MSMDARCH command. There is no return code; check the error log file to track success or error.

Dim Servername As String 
Dim DataPath As String 
Dim ExePath As String 
Dim CabFileName As String 
Dim LogFileName As String 
Dim Execstr As String 
Servername = "Servername" 
' Olap Data
DataPath = """C:\Program Files\Microsoft Analysis Services\Data\""" 
' Olap Binaries
ExePath = """C:\Program Files\Microsoft Analysis Services\Bin\"""
' Backup File
CabFileName = """C:\temp\tutorial.cab""" 
' Log File 
LogFileName = """C:\temp\tutorial.log""" 
Execstr = ExePath & "MSMDARCH.EXE"" /R " & Servername & " " & DataPath_
 & " " & CabFileName & " " & CabFileName 
Shell Execstr, vbHide 

Improving Processing Performance by Increasing the Processing Buffer Size

You can improve processing performance significantly by increasing the Process Buffer Size until all cube calculations will fit inside the Process Buffer without paging to disk.

There is no easy formula to determine the optimum setting for the Process Buffer Size. You can monitor the Analysis Server:Proc Aggs Performance counter Temp file bytes written/sec to determine whether Analysis Services is paging to disk during cube processing. A non-zero value indicates Analysis Services is paging aggregations to disk: the Process Buffer Size is too small.

Depending on the size of incremental data loads, dimension sizes, and other server demands on memory, you may not be able to fit all aggregations into the Process Buffer during processing. Nonetheless, it is generally helpful to increase the Process Buffer Size substantially, to 50 MB or even 1 GB for a large system with substantial installed memory.

Minimizing Memory Usage for Dimensions

Analysis Services loads all dimensions for all databases into volatile memory on startup. During processing, the server consumes additional memory to process updates to dimensions and cubes.

The executable behind Analysis Services service is called msmdsrv. With Windows NT 4.0 and Windows 2000 Server, the msmdsrv process is limited to 2GB of RAM. With Enterprise Edition, you can enable Application Memory tuning and configure Analysis Services to use up to 3GB of RAM.

You can estimate the amount of memory that will be used for dimensions by opening a command prompt and navigating the Microsoft Analysis Services\Data directory. Once there, issue the following command:

dir .dim /s

This works for a single database, and does not include private dimensions.

Memory requirements can increase substantially when processing dimensions. You should be able to work around an out-of-memory condition in the following ways:

  • After processing dimensions (especially very large dimensions), you may want to stop and restart Analysis Services. This eliminates any memory fragmentation that may have occurred during processing.
  • As we discuss elsewhere in this paper, increasing the Process Buffer Size setting on Analysis Services will greatly improve processing performance. However, memory assigned to the Process Buffer is not available to hold dimension data during processing. Ideally, the Process Buffer will be large enough to hold all aggregation calculations during cube processing, and still leave enough memory to hold all the dimension data. If the Processing Buffer Size has been set too large, leaving insufficient memory to hold the dimensions, you should decrease the Processing Buffer Size.
  • Increase the paging file size or create a second page file.
  • Create remote partitions, which will distribute memory and CPU utilization.
  • If the dimensions are very large, you may want to reduce the size of the dimension as follows:
    • Convert dimension keys to integers.
    • Eliminate any unnecessary, redundant data elements.
    • Delete or reduce the number of member properties used in dimension.

    Alternatively, you can convert to ROLAP storage.

Process large dimensions in a separate transaction. If you process all dimensions in a single transaction, all dimensions and their shadow copies must be in memory at the same time. If you process the large dimensions in separate transactions, memory requirements are reduced. The tradeoff to this approach is that you lose the ability to roll back changes to one dimension, based on a processing error in a 2nd dimension. This functionality is available only if the two dimensions are processed within a single transaction.

Using Analysis Services Timeouts

There are several kinds of timeouts in the SQL Server 2000 DSO object model.

  • DSO Server.ConnectTimeout is the timeout for DSO to connect to the server.
  • DSO Server.Timeout is the timeout for DSO to communicate with the server once the connection is made.
  • There is no configurable timeout for the server and DSO to connect to the data source (such as SQL Server or Oracle). DSO always uses 15 seconds for this timeout, and the server uses the default.
  • The registry entry CurrentVersion\MaxOLEDBTimeout is the timeout for the server and DSO to communicate with the data source (executing SQL statements).
  • The DSO LockTimeout property is not used at this time.

Using the Server Properties dialog in Analysis Manager, you can set a Server Timeout property. This sets both DSO Server.Timeout and CurrentVersion\MaxOLEDBTimeout to the same value. It does not make sense for the two timeouts to be different, as the smaller timeout will fail the operation.

OLAP Services 7.0 has different timeout settings:

  • The Server Connection Info\ServerConnectTimeout registry entry, if specified, is the timeout for DSO to connect to the server.
  • DSO Server.Timeout is the timeout for DSO to communicate with the server once connected.
  • There is no configurable timeout for the server and DSO to connect to the data source (such as SQL Server or Oracle). DSO always uses 15 seconds for this timeout, and the server uses the default.
  • The registry entry CurrentVersion\MaxOLEDBTimeout is the timeout for the server to communicate with the data source, executing SQL statements.
  • There is no LockTimeout property.

In the Server Properties dialog in OLAP Manager, you can set an OLE DB Timeout property, which in turn sets the CurrentVersion\MaxOLEDBTimeout to this value. The only way to set the DSO Server.Timeout property is to edit the repository in Access or write the following DSO code:

Svr.Timeout = 10 svr.Update

Using the MDX Compatibility Property, or Hiding Hidden Members

If you have a ragged and/or unbalanced hierarchy in your database, the MDX Compatibility property manages the way that the "holes" that make your hierarchy ragged are displayed in the client tool.

As an example, consider a Geography dimension that has levels for Country, State or Province, and City. The USA has all of these levels, but a smaller country (such as Israel) may skip the State or Province level, while a truly small country may not have State or City.

The Hide Member If property can be used on the dimension level to hide the member if it's missing or the same as the parent. But what happens on the client side when a user drills down to a "hidden" member?

Some user interfaces, such as Excel, are very tightly bound to the rows-and-columns paradigm. These client applications may not want to hide the "hidden" member because they want the columns to line up. Also, client applications written for SQL Server 7 may not handle hidden members gracefully because that feature is new in SQL Server 2000.

  • If the hidden member is at the bottom of the tree (such as the country of Andorra, which has no City or State), Analysis Services assumes any client tool can deal with this case.
  • If the hierarchy is missing a member somewhere else in the hierarchy, such as Israel, which has Cities and Country but no province or state, the MDX Compatibility property in the connection string controls how this is displayed on the client side. If it is missing, set to the default value (zero), or set to 1, then Analysis Services behaves in the most conservative way and does not hide the hidden member. In this case, you would see "Israel..Israel..Haifa". If the connection string sets the MDX Compatibility Property to 2, the hidden member is truly hidden. This is the behavior displayed by the Analysis Manager's browser tool on a dimension level with a ragged/unbalanced hierarchy.

Setting the Default Time Period to the Current Month

The default member for each dimension is usually set to the first member in the dimension. In the Time/Period dimension, this would be either "All" or the first year of the time period if there is no "All."

To change a dimension's default member to another specific member, change the Default Member property in the dimension editor. This would work if you want to set the default to a specific month or year.

To change the default member to the current month, you'll need to create an MDX expression that returns a member, such as:

Time.[ThisYear].[ThisQtr].&[ThisMonthNumber]

To do this, you need to construct a string and convert that string to a member. We'll use the FoodMart database as an example. Using the philosophy that complex MDX statements are best constructed a piece at a time, we'll first get the month right, and leave in basic values for the other parts:

StrToMember("Time.[1998].[Q1].&["Str(Month(Now()))+"]")
  • Now() is a VBA function that returns the current date/time. Month() is a VBA function that returns the month number (1-12) of a date.
  • Str() is a VBA function that casts the month number to a string.
  • The "&" lets me refer to the month by number [1..12] rather than by name.

If I write this in January, everything inside the StrToMember function resolves to the string "Time.[1998].[Q1].&[1]". The StrToMember function casts that string to a member.

The next step is to fix the quarter. Adding the quarter logic makes the MDX expression longer, but not very complicated. We're nesting three iif statements that assign the quarter based on the month:

StrToMember("Time.[1998].Q"+iif(Month(Now())<=3,"1",iif(Month(Now())<=6,"2",iif(Month(Now())<=9,"3","4")))+"].&["+Str(Month(Now()))+"]")

Finally, we'll add the logic for the current year. In theory, this should be as simple as the month. The FoodMart database only has two years, 1997 and 1998 – none of which are the current year. So, for this to work in FoodMart, we'll set the date to the current year if it's 1998 or earlier (highly unlikely), and otherwise set it to 1998. In practice, it will always use 1998, but this code shows the logic for setting current year:

StrToMember("Time.["+iif(Year(Now())>=1998,"1998",Str(Year(Now())))+"].[Q"+iif(Month(Now())<=3,"1",iif(Month(Now())<=6,"2",iif(Month(Now())<=9,"3","4")))+"].&["+Str(Month(Now()))+"]")

Enforcing Cube Security: Client or Server

Using Analysis Services, roles can be enforced on the client or the server. There are several factors to consider when implementing security:

  • Cell level security is available only when security is enforced on the client.
  • When security is enforced on the client, the full results of a query may exist in the pivot table services cache on the client. Any query returns only those cells available to the user, but because the restricted data exists temporarily in the client cache, client-enforce security is less secure than server-enforced security.

There is no simple rule of thumb for determining which security model delivers the best performance. Factors affecting performance include the following:

  • If the security scenario aligns with the dimension's hierarchies, server enforcement is faster, all else being equal. If the restricted members are split across the hierarchy, the cube should perform better with security enforced on the client.
  • If "Show visual totals" is defined, client-side security should perform better. The extra computation to display visual totals can be performed from the client cache.

MDX Expressions

The following topics offer suggestions for using Multidimensional Expressions (MDX) with Analysis Services.

Finding the First Period with Data in the Cube

Finding the first member in a dimension is simple with the FirstChild function. But finding the first member that has data associated with it is a little different.

We'll use the FoodMart.Sales cube as an example. To find the first year that has data, use the Filter function on columns from Sales. Execute the following query on the FoodMart database in the MDX Sample Application:

SELECT
  HEAD(NonEmptyCrossJoin([Time].Members,1),1) ON COLUMNS
FROM SALES 

The Filter function above filters out all of the members in the time dimension that don't have data associated with them. The head function returns the first member in the list. You can change this to a tail function to see the last member with data. Remember that the head and tail functions return a set. Please see the Microsoft Knowledge Base article Q301934 for more information.

Testing for the Highest Level

Using nested iifs (Immediate Ifs) in MDX is useful for creating expressions that compute differently for different levels. How do you test to see if you're at the top of the hierarchy?

As usual, there are several ways to do this with MDX. The following two methods work whether or not there is an (All) level:

iif ([Organization].CurrentMember.Parent is null, …, …)
iif ([Organization].CurrentMember.Level.Ordinal = 0, …,…)

If your dimension has an (All) level, you can test explicitly. For example:

iif( [Store].CurrentMember.Levell.Name = "(All)", …, …)

It would be better to compare the objects using the IS comparison operator:

iif([Store].CurrentMember.Level is [Store].[(All)], …, …)

It is always better to compare objects than strings. Comparing objects is faster. The equal (=) operator for string comparison is case sensitive, so less stable.

Listing Member Properties

When populating a list, you might want a distinct list of the values a member property may take.

Here is a sample MDX expression for getting a list of unique values, based on the FoodMart example:

WITH SET PropSet AS "CreatePropertySet([Product].[All Products], [Customers].[Name], [Customers].CurrentMemberProperties("Education"))' MEMBER measures.ProdName AS 'Product.CurrentmMember.Name' 
SELECT {PropSet} ON COLUMNS FROM Sales WHERE Prodname 

This solution has the following drawbacks:

  • It works with at most 759 distinct property values, and fails with more.
  • The property values are exposed as strings, even if they are numbers. You can convert them using the StrToValue function.
  • We need an additional "parent member" (like the All Products in the example above), because we're adding a calculated member to the "All" level for every distinct member property value.

However, it does the job and executes quickly.

Computing a "Since Time Began" Total

You may want to cumulate a measure since "the beginning of time" in a cube. Create a new calculated measure CumulativeSales in the FoodMart Sales cube, as:

Sum(Time.CurrentMember.Level.Members.item(0):Time.CurrentMember, [Measures].[Unit Sales])

Browse the Sales cube, and put Time on Rows. Compare [CumulativeSales] to [Unit Sales] to highlight the effects of the new calculated measure. Alternatively, execute the following query on the FoodMart database in the MDX Sample Application:

WITH MEMBER [Measures].[CumulativeSales] as 'Sum(Time.CurrentMember.Level.Members.item(0):Time.CurrentMember, [Measures].[Unit Sales])'
SELECT {[CumulativeSales], [Unit Sales]} ON Columns, [Time].[Month].Members ON Rows 
FROM Sales

Determining Your Best Customers' Favorite Products

It's often useful to identify your best customers' favorite products. If you are dealing with very large dimensions, you should approach the problem thoughtfully.

For example, assume you want to construct an MDX query to get the 10 best customers, and the 5 best products for each.

First, define the top 10 customers in a named set:

WITH SET Top10Cust AS 'TopCount(Customers.[Name].Members, 10, ([Time].[1997], [Unit Sales]) )' 

Then use the Generate and CrossJoin functions in the body of the query:

SELECT { [Time].[1997] } ON COLUMNS , Generate(Top10Cust, CrossJoin( {Customers.CurrentMember}, TopCount([Product].[Product Name].Members, 5, ([Time].[1997], [Unit Sales])) ) ) ON ROWS FROM [Sales]

Creating an MDX Expression Using the Current Day

It's quite easy to create an MDX function that has knowledge of the current day, because MDX can use VBA functions transparently.

Try creating the following computed measures to see what they return. (You should probably put Measures on columns on Time on rows in your cube's display grid.)

Now()
Month(Now()) 
Year(Now()) 
Year(Now()) 
StrToValue(Ancestor(Time.CurrentMember, Year).Name)

Creating a Calculated Measure that Sums a Product

For some calculated measures, you want to control exactly when and how Analysis Services aggregates. For example, if you are trying to create a calculated measure, Extended Price, that multiplies quantity by price and aggregates the results, you might try:

Extended Price = [Measures].[Quantity] * [Measures].[Price]. 

But Analysis Services would first aggregate the quantity, then aggregate prices, then multiply the aggregates – coming up with a completely different measure than what you had intended.

The best solution to this problem is to create the calculation (Q*P) in the source database and aggregate the result. You can do this in a view definition, without physically instantiating the calculation. This approach is preferable because it offers the best query performance.

Alternatively, you could create a calculated member that sums the extended price over the members at the correct level, like the following:

WITH MEMBER measures.demoSum AS 'sum(Descendants([Product].CurrentMember, [Product].[Product Name]), [Measures].[Unit Sales]*[Measures].[Store Cost])' SELECT {[Measures].[Unit Sales], [Measures].[Sales Count], [Measures].[demoSum], [Measures].[Store Sales Net]} ON COLUMNS, [Product].[Product Department].Members ON ROWS FROM Sales

This computation may take a while if your cube is large and the calculation is made at a high level.

Was this page helpful?
(1500 characters remaining)
Thank you for your feedback
Show:
© 2014 Microsoft