SQL Server 2005

Article
01/28/2010

By Paul Sanders

This paper covers features in SQL Server 2005 Analysis Services that overcome many of the barriers to gathering and analyzing real-time business intelligence data faced by companies today.

Introduction

The goal of business intelligence (BI) systems is to enable better, more informed, and faster decisions. Hence any discussion of real-time business intelligence must be in the context of how close to real time the information must be to support those decisions. What constitutes real-time information can vary widely for different business activities, even within a single enterprise. For example, considering a chain of retail outlets:

An analyst using historical data as input to the sales forecast for the next period would likely only need information as of the last month.
A marketing manager evaluating the success of a campaign, and responsible for deciding how long the campaign should run, would need much more timely information, probably no more than one day old.
A store manager might be making frequent decisions during the day; for example, deciding when to put a perishable item on sale. The manager would need information that was certainly no more than a few hours out of date.

Many times a combination of up-to-date and historical information is necessary. For example, while it may be useful for the store manager to know the number of sales of a particular item so far this morning, it is more useful if he can put that information in the context of the average number sales of that same item on the same day of the week over the last year.

Reports can include data from different systems. In our retail example, the data for today’s store sales would be taken from the source systems that are updated immediately at the time of sale, which would be combined with historical data in an enterprise data warehouse (DW). However, this means we no longer have "one version of the truth." When we compare data that is drawn from two sources, we must exercise great care to make sure that we apply the same business policies that were enforced during the load of the data warehouse. These include the consolidations that were applied, the treatment of discounts, and the approach to currency conversions. This means that the business rules embodied in the DW load process must also be embodied in the reports or applications that combine data from the two systems. This can result in a less agile and more error-prone system. This could lead to the making of decisions that are based on incorrect information.

Ideally, we want a single source that produces data that is close enough to real time to support all decision makers in the different scenarios. So what are the barriers that prevent business intelligence (BI) systems from supplying truly real-time data?

Barriers to Obtaining Real-Time BI

It should be stressed that there are many formidable challenges to obtaining real-time BI data, some of which are not addressable by technology. It is important to keep this in mind, as many presentations of Enterprise Information Interchange (EII) technologies appear to trivialize the extent of the problem. Barriers to obtaining data that is truly real-time include the following:

It is necessary to provide an historical view. Often, the original source systems do not provide historical data at all; therefore, it is necessary to maintain a separate copy of the data, with full history, in a DW.
It is necessary to coordinate with business processes and across systems. It might be meaningless to perform a particular analysis until certain business processes are complete. For example, in our simple retail chain example, store-to-store comparisons might not be meaningful until all stores, in all time zones, have closed. Similarly, the mechanisms by which some legacy systems are updated might set a lower limit on the latency that is achievable.
Frequently the data must go through a transformation process to enforce data quality. Depending on the complexity of those transformations and the volume of data, it might simply not be cost effective to reduce the load time below a certain level.
It is often necessary to integrate data from multiple data sources, which, even in the absence of significant transformations, might commonly be handled by transformation processes, thereby creating a separate store.
The goal of providing satisfactory user performance is often at odds with true real-time results. Reporting directly from the source system certainly provides real-time information, but generally the common aggregate-level queries do not perform well against the systems that are orientated for frequent update. Nor could we often tolerate such a query load being placed on our transactional systems. Thus, often a separate store must be maintained, with special attention paid to providing precomputed aggregates (likely using Analysis Services). Maintaining those aggregates on each update to the source data takes time.
Continually reporting on data from a system that is constantly being updated requires that we consider the issue of data consistency.
Even if the data in a report is sufficiently up-to-date, how do we ensure that the report is immediately delivered to the necessary decision makers when they need it?

Such considerations mean that the user is often a long way from the original source data, as shown in Figure 1. Each link in the chain introduces an additional time delay.

Figure 1: Barriers to real-time BI

The source transactional systems in each store are updated immediately from point-of-sale registers. Every day, data from all stores is consolidated, and extensive transformations applied to clean the data before it is loaded into a DW. Once loaded, a process is initiated to reprocess the Analysis Services cubes that are used for all user access. After the cube is processed on a staging server, it is validated and then transferred to the production server where it is available for user reporting. Sometime after the cube is refreshed, the relevant manager runs a report against the cube, sees the changed information, and (finally!) action is initiated.

Breaking Down the Barriers

Although some barriers are difficult to address by technology, there are a number of features in both SQL Server Analysis Services and other components of the SQL Server product that facilitate providing more timely information to end-users.

Pushing data into Analysis Services. The task of updating Analysis Services with changes is made easier by the addition of SQL Server Integration Services (SSIS) transformations that can push data directly into Analysis Services. This means that the same task that extracts and cleans the data can directly refresh the cube.
Combining heterogeneous data sources. An Analysis Services cube can be sourced from data drawn from different heterogeneous sources. The data is combined by Analysis Services and presented in a unified manner to the user.
Reading consistent data. The SQL Server 2005 relational engine includes an additional transaction isolation level allowing the reader to obtain a snapshot view of the data. This is exploited by Analysis Services. This means that a cube can be processed, such that even if the process takes many minutes, a consistent view of the data is seen through the cube.
Updating the Analysis Services cube incrementally. As in Analysis Services 2000, it is possible to reduce the time required to refresh the Analysis Services cube through the use of partitions, where the impact of updates can be limited so that only a small part of the cube is actually affected. In addition, it is possible to process a cube incrementally so that new fact rows can be added to the existing cache. Analysis Services 2005 builds on this by also allowing the incremental update of dimensions, catering for the case where new dimensions records are only added, and never updated.
Defining caching policies. Analysis Services 2005 introduces the ability to define how the cube should be refreshed to reflect changes to the source data. These proactive caching policies allow the business demands for up-to-date data to be balanced against the need for high performance.
Obtaining notification from Analysis Services. By means of integration with Notification Services, it is possible to subscribe to data changes of interest. For example, a manager could register interest in a particular Key Performance Indicator (KPI), and be notified by e-mail the moment that the KPI changes status. These technologies can be used in isolation or in tandem. In many cases they will make it easier to provide real-time BI. An example of a simplified architecture, applicable in some simple cases, is shown below, where the Analysis Services cube is built directly over the source databases.

These technologies can be used in isolation or in tandem. In many cases they will make it easier to provide real-time BI. An example of a simplified architecture, applicable in some simple cases, is shown below, where the Analysis Services cube is built directly over the source databases.

Figure 2: A simplified architecture

Note: Even in such simple cases, it is often a better practice to build the cube over a database that is a replicated copy of the actual source OLTP database. The cost of maintaining such a replica is outweighed by the advantage of relieving the transaction system from extra overhead.

The source transactional systems in each store are updated immediately from point-of-sale registers. Every twenty minutes, proactive caching policies ensure that the cube is incrementally refreshed to reflect the new data. If a user has subscribed to a KPI that has changed status, then an e-mail message containing a URL to the relevant report is sent directly to them.

The remainder of this paper goes into detail on these features, focusing on those that are new to Analysis Services 2005.

Pushing Data into Analysis Services

In SQL Server 2005, SSIS introduces pipeline tasks, allowing data to be read, transformed via a succession of transformations, and then output to some destination. Analysis Services is included as one of the possible destinations, allowing the transformed data to be piped directly to a cube. This covers the load of both partitions and dimensions, and includes the ability to load data incrementally. New to Analysis Services 2005, dimensions can be loaded incrementally. This is particularly relevant for those cases where dimension rows are only added, not updated, as is the case for type II dimensions.

Reading Consistent Data

SQL Server 2005 introduces a new transaction isolation level known as snapshot isolation. A client that starts such a transaction is guaranteed that all the data they read during the course of the transaction is as it was at the start of the transaction, even if updates are being committed concurrently by other users.

Processing a cube over changing data can be problematic if it is not isolated from changes of other users. For example, the Product dimension might be processed, and then before partition processing completes, a new product might be entered, along with a sale for that product. The sale might then get picked up in the partition processing, leading to an apparent referential integrity error.

Analysis Services 2005 can, at the discretion of the developer, exploit snapshot isolation. If a data source has this policy set, then Analysis Services starts a snapshot transaction at the start of processing, and all objects are processed from that same snapshot. It should be noted that there is no large, fixed overhead required to obtain a snapshot. There is, however, a cost that is incurred by a slight overhead on all updates made during the course of the transaction.

Combining Heterogeneous Data Sources

A cube in Analysis Services can be sourced from multiple data sources. For example, while Product and Sales data might come from the main enterprise DW, the Quotas might be drawn from a small, departmental database. Analysis Services will issue the necessary queries to join the data across the different sources.

One of the data sources must be SQL Server, as Analysis Services exploits the distributed query processor of SQL Server. The SQL Server database need not contain any data—for example, there might be data from two Oracle databases joined together using the query processor of a third SQL Server database.

Defining Caching Policies

Analysis Services provides high query performance by maintaining a cache of the underlying data in a form optimized for analytical queries, often including precomputed aggregations. The presence of such a cache raises a number of important questions:

What should be done when the underlying data changes, and therefore the cache becomes stale?
How often should the cache be refreshed? How should queries be answered while the cache refresh is underway—from the stale cache, or by instead reverting to the underlying source, which would result in poorer query performance but up-to-date data?
How do we even know that a change has occurred to the source data? How often do we check?

The new Analysis Services 2005 proactive caching feature provides a way for the cube designer to define policies that balance the business needs for performance against those for real-time data. The system runs on “auto pilot,” without the need for explicit processing.

The available policy settings allow for a variety of system behaviors. The following example shows a cube where the policy specifies that the refresh of the cube’s multidimensional OLAP (MOLAP) cache begins immediately after an update occurs to the source data. Also, due to the strong need for up-to-date data, if the cache refresh is not completed within a short period of time, the system will revert to relational OLAP (ROLAP) mode, sending queries directly to the underlying database until the cache refresh completes.


1. The cube has been processed from the data in the relational database. All user queries against the cube are satisfied using the MOLAP cache.
2. An update occurs to the underlying database. Analysis Services is notified later via an event.
3. Processing starts to rebuild a new MOLAP cache. Initially, while the new cache is being built, user queries are still answered using the old, stale MOLAP cache.
4. After a period of time, the new MOLAP cache is not yet completed. The policies dictate a maximum acceptable latency. Upon reaching that limit, the system reverts to ROLAP mode, and all new user requests will be satisfied by issuing SQL queries against the underlying relational source, to retrieve the up-to-date data.
5. Eventually, the new MOLAP cache is built, and now the system reverts to the original MOLAP mode.

Different policies allow for quite different behaviors. For example:

Automatic MOLAP. Behavior is as in Figure 3 except the system never reverts to ROLAP mode because the need for consistent high performance outweighs the need for real-time data.
Scheduled MOLAP. The cube is simply refreshed periodically, irrespective of updates to the source data.
Incremental MOLAP. The cube is refreshed incrementally with new data in the source database.
Real-time ROLAP. The cube is always in ROLAP mode, and all updates to the source data are immediately reflected in the query results.

The proactive caching settings can be set on each different Analysis Services object (dimension and partition). This provides great flexibility in cases where the policies and the tradeoffs between performance and latency vary for different parts of the cube.

Proactive caching settings are described more fully in the appendix.

Notification Schemes

There are three notification schemes by which Analysis Services can be notified of updates to the underlying source:

Trace Events. This only applies where the underlying data source is SQL Server. Analysis Services will register to receive trace events on the necessary tables. This scheme requires that the service account of Analysis Services has Administrator privileges on the SQL Server database. Event delivery is not guaranteed if the database is unavailable (e.g., due to network issues) at the time of update.
Client Initiated. A client can simply send a notification message to Analysis Services, indicating that a specified table has been updated. This is relevant in cases where the application that is actually performing the update is aware of the impact on the Analysis Services cubes
Polling. The most generally applicable approach is to use polling. Polling queries are defined for each table, returning a single value (i.e., a single row, with one column), a change to which indicates an update to the table. For example, a query returning the maximum value of the LastUpdated timestamp column would be a suitable polling query. Along with the query, the designer also specifies the frequency with which Analysis Services should send the polling query.

Incremental Processing

Generally, when a cache is refreshed, it will fully process the affected Analysis Services object to completely reload the cache. To be more exact, partitions use process option Full, while dimensions use process option Update. These options will ensure that related partition data is preserved. However, it is also possible to perform incremental processing of both partitions and dimensions where it is known that rows are only added and not updated.

This capability is available when using the polling notification scheme. Along with a polling query that indicates that an update has occurred, the designer also provides an incremental processing query, which returns the new rows. This query is parameterized, taking the previous and new value as returned by the polling query. An example is provided in the appendix.

Obtaining Notifications from Analysis Services

Notification Services is a recent addition to SQL Server. This service provides a scalable infrastructure by which clients can register interest in an event (e.g., the value of a query changing), and define how they wish to be notified if the event occurs (e.g., by e mail or Instant Messenger). Notification Services then provides standard services in support of this, such as guaranteeing delivery of the notification and ensuring that notification occurs only once upon a transition. For example, when a KPI changes from Ok to Bad, the user will be notified once, and not repeatedly while the KPI remains in the Bad state.

In SQL Server 2005, Notification Services includes a special adaptor for Analysis Services. Using this, a client can subscribe to an event by specifying an MDX query that defines the data of interest, along with the frequency of polling, the notification scheme, etc. For example, a store manager could request notification if the sales per-hour of a particular product fell below a specified level.

This makes it easy to build BI applications that ensure that users can be notified in a timely manner when specified events occur.

Summary

Analysis Services 2005, in concert with other SQL Server 2005 products, provides a number of features applicable to making real-time BI data available. Some of these features are:

Maintaining a set of aggregates is made easier by the ability to define caching policies, and then have the system run in ‘auto-pilot’ mode without the need to explicitly reprocess the aggregates.
Efficiency is improved by the ability to directly load Analysis Services from a transformation, and to incrementally add new records.
The ability to obtain a consistent view, even while data is changing, is provided by snapshot isolation.
Integrating data from multiple heterogeneous sources can be facilitated by the ability of Analysis Services cubes to combine data from multiple sources.
The ability to inform users of notable events is provided by Notification Services. While there are many barriers to obtaining completely real-time BI data, these features can be used independently or in tandem to make the information near real-time enough to support better, and faster, business decisions.

Appendix

This appendix provides additional details of the proactive caching settings, and the use of polling queries and incremental processing queries to incrementally refresh a cube using proactive caching.

Proactive Caching Settings

Setting	Description
Silence Interval and Silence Override Interval	Generally if a new update occurs while a new cache is being processed, then the processing is cancelled and restarted. Hence if updates to the underlying database are known to occur in batches, it would be wasteful to commence a refresh of the cache immediately on the first update in a batch. The Silence Interval specifies the amount of time that must pass without further updates before a refresh commences. The Silence Override Interval provides an upper limit on the amount of time—if after this period there has not been any quiet period, the refresh starts anyway. Setting the Silence Interval to ‘infinite’ means that processing will not be based on update events, but instead will be driven by the Force Rebuild Interval setting.
Force Rebuild Interval	If a Force Rebuild Interval is not specified and another update occurs while a new cache is being processed, then the processing is cancelled and restarted. The Force Rebuild Interval specifies the length of time after the last cache was refreshed whereupon a new cache refresh will be started, and will be completed, regardless of any further updates. Hence it provides a means to avoid continual restarts in the presence of continual updates. A common use of the Force Rebuild Interval is with an infinite Silence Interval, resulting in a simple periodic refresh of the cache.
Latency	The Latency setting specifies how out-of-date the cache can be before reverting to ROLAP mode. Often, where consistent high performance is needed, this would be set to ‘infinite,’ indicating that queries will always use the MOLAP cache.
Online Mode	The Online Mode controls whether during initial processing the cube is brought online immediately in ROLAP mode, even before the initial cache is built.
Aggregation Storage	Generally for ROLAP cubes, aggregations result in materialized views being built in the underlying database. This setting controls whether such views are also built in the case of a MOLAP cube reverting to ROLAP.

Incremental Processing Using Proactive Caching

The following table shows an example polling query and associated incremental processing query that would be used to detect the presence of new sales rows in the fact table and incrementally add those rows to a partition.

Polling Query

SELECT max(LastUpdated) from Sales

Incremental Processing Query

SELECT * from Sales WHERE LastUpdated > coalesce(?, -1) AND LastUpdated <= ?.

Note: The ‘coalese’ is needed simply because on startup, the previous value is Null.

Example

On the last polling query, the max(LastUpdated) was 3/18/2005 1pm.

On the next polling query, this changes to 3/18/2005 1.15 pm.

Analysis Services sends the following query to process the new rows:

SELECT * from Sales WHERE

LastUpdated > coalesce(3/18/2005 1pm, -1) AND

LastUpdated <= 3/18/2005 1.15 pm